Automating the Accept and Reject process in MTurk with R

I love spending a ton of time making an R program to accomplish something that I could do manually in 10 minutes or have an undergraduate do for essentially free!

See relevant XKCD.

I hope someone can find this bit of code useful.

Directions:
Go to MTurk and download your worker results csv file.
Then download your survey data. I use Qualtrics. The script assumes you have a two line header with variables on line 1 and descriptions of those variables on line 2. If you have a one line header you will need to tweak the code a bit.

The way I structure my MTurk data collection is a field in the HIT asking for a survey key. I honestly don’t bother too much with generating one for every participant. I find it doesn’t add much. What I do instead is have the worker paste a static code into the box. On the survey side I have participants go to their dashboard and copy/paste their Worker ID into my survey on the last page.

The syntax gathers this static key and checks if it was entered. Then looks if there was a substantial amount of missing data. It also checks for the correct answers to attention check items and compares them to a threshold. Last, it looks for people who have IDs in the MTurk file but not in the survey (the people who find your static key on a website and enter it). Failure to meet these conditions results in a rejection. The rejection messages are coded to be specific to the type of rejection so feedback is customized.

To run you will need to have dplyr and psych packages installed. dplyr for general awesomeness and the pipe command and psych for the ease of scoring multiple choice tests (used for the attention check items). If one or more attention check items have multiple correct responses I recommend recording them into a binary right / wrong beforehand format.

#Scott Withrow
#2015-02-12

# Match Worker ID’s from survey and MTurk
# Rejects people that fail attention check items

# In your survey and mturk data name the workerID variable “WorkerId”
# MTurk should automatically call your survey key Answer.surveycode but make sure that is true

# Fill in the Values Below ————————————————
#Key provided for “completed survey”
surveykey <- "yUacuEugjBohzK1OMrqmU7o6P"

#The number of acceptable NAs in a respondant.
acceptablena = 10

#List of attention check items. Set to NA if none.
attncheck = c("IPIP_30", "NA_2")

#List of correct answers to the attncheck items.
attncheck.correct = c(2, 3)

#Number of attention check items that can be failed
attnfail = 2

#Working Directory (where the datafiles are)
wrkdir <- "E:/data/GEC/"

#Names of the datafiles
survey.data <- "GEC_Validity_Study.csv"
mturk.data <- "mturk.csv"

# The program begins! —————————————————–
#Setup
setwd(wrkdir)
require("dplyr")
require("psych")

#Read in that funky Qualtrics csv output.
header <- scan(survey.data, nlines = 1, what = character(), sep=",")
survey <- read.table(survey.data, skip = 2, stringsAsFactors = FALSE, fill = TRUE, header = FALSE, sep = ",")
names(survey) <- header
survey <- tbl_df(survey)
#survey

#Read in the Amazon.com worker output file.
mturk <- tbl_df(read.table(mturk.data, stringsAsFactors = FALSE, fill = TRUE, header = TRUE, sep = ","))
#mturk

#Returns string w/o leading or trailing whitespace
trim <- function (x) gsub("^\\s+|\\s+$", "", x)

#Clean up the Worker ID string so we can do a validation check.
survey$WorkerId <- trim(survey$WorkerId)

#Clean up the survey key code field for validation checks.
mturk$Answer.surveycode <- trim(mturk$Answer.surveycode)

#Identify people whose ID's are not in the survey and reject them
bad.id <- !(mturk$WorkerId %in% intersect(survey$WorkerId, mturk$WorkerId))
mturk$Reject[bad.id] <- "Your Worker ID was not found in the survey and your survey key code could not be authenticated. Sorry."

#Reject people that don't have a matching survey code.
mturk$Reject[mturk$Answer.surveycode != surveykey] <- "They survey key code entered into the HIT did not match the code provided by the survey or was left blank. Sorry."

#Reject people that have too many NAs
survey$numNA acceptablena] <- "The survey submitted contains more than an acceptable minimum number of missing values and is not considered a completed HIT. Sorry."

#Reject people that failed the attention check items.
survey.attncheck <- tbl_df(data.frame(score.multiple.choice(attncheck.correct, survey[attncheck], score = FALSE)))
survey$attntotal attnfail] <- "The survey submitted contains more than an acceptable minimum number of failed attention checks and is not considered a completed HIT. Sorry."

#Create a smaller dataset containing the rejected people from the survey
survey.reject %
select(Reject, WorkerId)

#Merge the datasets
merged <- merge(mturk, survey.reject, by="WorkerId", all.x = TRUE)

#Merge the reject columns
merged <- within(merged, {
Reject <- rep(NA, nrow(merged))
ifelse (is.na(Reject.x), Reject <- Reject.y, Reject <- Reject.x)
})

#Approve everyone that didn't get rejected
merged$Approve[is.na(merged$Reject)] <- "x"

#Drop the extra columns
merged <- select(merged, -(Reject.x), -(Reject.y))

#Rejected Participants
Rejected % select(WorkerId, Reject, Answer.surveycode, AssignmentId)

#Approved Participants
Approved % select(WorkerId, AssignmentId)

#Save as a csv for uploading to mturk
write.csv(merged, file = “Upload_Mturk.csv”, row.names = FALSE, na = “”)

#Save a rejected csv for double checking
write.csv(Rejected, file = “Rejected_Participants.csv”, row.names = FALSE, na = “”)

#Save an approved csv
write.csv(Approved, file = “Approved_Participants.csv”, row.names = FALSE, na = “”)