A quality comparison of data collected on this website to data collected on Amazon Mechanical Turk

Updated November 2018

Amazon Mechanical Turk is service often used in psychology research where workers are paid some small amount of money to complete tasks, in the case of psychology research this is typically a survey. The consensus appears to be that mTurk data is at least as high quality as student samples.

To validate the quality of the data collected on this website, a survey was run on this website and then on Mechanical Turk.

  • The survey on this website was run as an attachment to some of the main personality tests offered. After visitors completed a personality test, they were asked if they were willing to complete a short supplemental research survey before they viewed their personality test results. At the end of the survey respondents were asked to confirm that their responses were accurate and give their consent for use in research. Data collection took place in July, August, and September 2018 and yielded 54,632 usable responses.
  • The survey run on mechanical turk was identical. Turk users could participate if they had completed at least 100 HITs previously with a 97%+ approval rating, and were paid $0.20 for a response. This was run in August 2018 and 1,403 usable responses were collected.
  • The survey consisted of two pages. On the first there were 26 items that were rated on a five point scale, and on the 2nd page there were six additional demographic questions.

    The survey was designed so that data validity could be looked at in four ways.

    Endorsement of items unlikely to be true

    The first comparison to be made is the endorsement of items that are very unlikely to be true. The survey contained two of those.

    Item A: "I have been sent to the hospital by an electric shock"

    The base rate of severe electrical burns is presumably low. Gordon, Reid, Awwaad (1986) report a rate of 2.6 cases per million per year, so we should expect extremely rare agreement with this item.

    Current website (n=54,631) mTurk (n=1,403)
    [NO RESPONSE] 1.10% 0.42%
    Strongly disagree 88.23% 76.81%
    Disagree 7.92% 11.05%
    Neutral 1.60% 4.63%
    Agree 0.62% 4.77%
    Strongly agree 0.82% 2.28%

    Both sources of respondents reported implausibly high rates of electrical burns, but this presumably invalid responding was more prevalent in AMT users. AMT users were 4.9 times more likely to select Agree or Strongly agree in response to this item.

    Item B: "I own a goat"

    The portion of people who own goats is presumably low. Statistics on this were hard to find, but this low quality source suggests that in the UK 53,000 households keep goats (compared to a population of 65 million), so agreement with this item should be extremely rare.

    Current website (n=54,631) mTurk (n=1,403)
    [NO RESPONSE] 1.14% 1.06%
    Strongly disagree 88.55% 79.52%
    Disagree 6.91% 8.55%
    Neutral 1.71% 4.27%
    Agree 0.74% 4.27%
    Strongly agree 1.21% 2.28%

    Respondents from both sources reported an implausibly high rate of goat ownership, but again AMT users were worse. AMT users were 3.9 times more likely to select Agree or Strongly agree.

    Inconsistent responding

    The second comparison is of inconsistent responding. If a respondent is providing valid responses, they should not give responses that are incompatible with each other.

    This survey contained one pair of items where if you agreed with one you should disagree with the other. These two items were "I am tall" and "I am short". The table below takes the data from people who strongly agreed with "I am short" and breaks down their responses to the item "I am tall".

    Response to "I am tall" Users of this website who selected "Strongly agree" for "I am short" (n=7,132) AMT users who selected "Strongly agree" for "I am short" (n=189)
    [NO RESPONSE]0.42%0.52%
    Strongly disagree92.31%79.36%
    Disagree4.64%5.82%
    Neutral10.85%0.52%
    Agree0.70%5.29%
    Strongly agree1.06%8.46%

    AMT respondents were 7.8 times more likely to go on to Agree or Strongly agree with the statement "I am tall" after already selecting Strongly agree for the statement "I am short". It should be pointed out that the 13.7% rate of invalid responding on this question can not be taken as an absolute rate, because it applies to only a subset of the respondents, and this subset could be more likely to be low quality responders. For example, if valid responders give responses with a peak at three and invalid responders give responses distributed randomly, a higher percentage of people responding 5 or 1 will be invalid responders than those who responded 3.

    Free response plausibility

    The survey contained one free response question that required careful responding. The question asked them to report their height and is reproduced in the red box below.

    What is your height? (use your preferred units)
    feet, inches
    centimeters

    This comparison will only look at those who entered imperial units (the most commonly chosen way to respond) and whose inputed values could be automatically parsed to numbers (because of laziness) --values that could be not be parsed could be either valid or invalid responses (compare inches->"ZR#$GT" vs inches->"3 or 4").

    One way data quality can be compared is by looking at impossible values of inches. The value entered in inches must be: 0 <= x < 12. Values fell outside that range for 0.45% of responses on this website and 3.15% of mTurk responses.

    Another way is looking at implausible heights. Values of feet that are not 4, 5, or 6 are implausible. Values fell outside that range for 0.2% of responses on this website and 1.29% of mTurk responses.

    Estimate of a known effect

    This dataset contains values that can be used to estimate an effect whose true value is known for the population. The gender difference in height is known to be ~2SD (Young, 2012). To make this comparison, first, face implausible values of height were removed as was done above. In the data collected on this website, the gender difference in height was 1.78SD. In the data collected on mTurk the gender difference was 1.62. So this website captured 89% of the true effect and mTurk captured 0.81% of the true effect, though there are probably some sample differences that make this comparison hard to interpret.

    Conclusions

    Across multiple measures, data collected on this website appears to contain less than 25% the rate of invalid responding as AMT data, so estimates can be expected to be modestly closer to their true values.

    Data

    The data collected for this comparison can be downloaded at mTurk-comparison-data.zip.

    Comments