Tell HN: Mechanical Turk is twenty years old today
94 points| csmoak | 3 months ago
At the time, AWS was about 100 people (when you were on call, you were on call for all of AWS), Amazon had just hit 10,000, S3 was still in private beta, and EC2 was a whitepaper.
What did you create with MTurk and the incredibly patient hard-working workforce behind it?
pvankessel|3 months ago
frmersdog|3 months ago
linkregister|3 months ago
larodi|3 months ago
maxrmk|3 months ago
ryandvm|3 months ago
crossbody|3 months ago
edoceo|3 months ago
comrade1234|3 months ago
Most of the work was done by one person - i think she was a woman in the Midwest, it's been like 15-years so the details are hazy. A few recipes were transcribed by people overseas but they didn't stick at it. I had to reject only one transcription.
I used mturk in some work projects too but those were boring and maybe also a little unethical (basically paying people 0.50 to give us all of their Facebook graph data, for example.)
cactusplant7374|3 months ago
social_quotient|3 months ago
For a major mall operator in the USA, we had an issue with tenants keeping their store hours in sync between the mall site and their own site. So we deployed MTurk workers in redundant multiples for each retail listing… 22k stores at the time, checked weekly from October through mid-January.
Another use case.. figuring out whether a restaurant had OpenTable as an option. This also changes from time to time, so we’d check weekly via MTurk. 52 weeks a year across over 100 malls. Far fewer in quantity, think 2-300. But it’s still more work than you’d want to staff.
A fun more nuanced use case: In retail mall listings, there’s typically a link to the retailer’s website. For GAP, no problem… it’s stable. But for random retailers (think kiosk operators), sometimes they’d lose their domain, which would then get forwarded to an adult site. The risk here is extremely high. So daily we would hit all retailer website links to determine if they contained adult or objectionable content. If flagged, we’d first send to MTurk for confirmation, then to client management for final determination. In the age of AI this would be very different, but the number of false positives was comical. Take a typical lingerie retailers and send it to a skin detection algorithm… you’d maybe be surprised how many common retailers have NSFW homepages.
Now some pro tips I’ll leave you with.
- Any job worth doing on mturk is worth paying a decent amount of money for.
- never runs. Job 1 tile run it 3-5 times and then build a consensus algo on the results to get confidence
- assume they will automate things you would not have assumed automated - And be ready to get some junk results at scale
- think deeply on the flow and reduce the steps as much as possible.
- similar to how I manage Ai now. Consider how you can prove they did the work if you needed a real human and not an automation.
pvankessel|3 months ago
unknown|3 months ago
[deleted]
mtmail|3 months ago
goykasi|3 months ago
People kept asking me to automate it, but I felt it was against the spirit of mTurk. So, another member would take my updates and add an auto-clicker. That lasted for a couple of weeks at most before the HIT volume dried up and very few would be released. I guess Amazon caught on to what was happening. But before that, several forum members made enough to get some high dollar items: laptops, speakers, etc. Eventually, I relented and created a wishlist. Thats how I ended up with the box sets for the first run of Futurama seasons.
stevejb|3 months ago
The idea was that the Prosper data set contained all of the information that a lending officer would have, but they also had user-submitted pictures. We wanted to see if there was value in the information conveyed in the pictures. For example, if they had a puppy or a child in the picture, did this increase the probability that the loan would get funded? That sort of thing. It was a very fun project!
Paper: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=1343275
hiddencost|3 months ago
rgovostes|3 months ago
In desperation I turned to Mechanical Turk to have the labels redone as bounding boxes, with some amount of agreement between labelers. Even then, results were so-so, with many workers making essentially random labels. I had to take another pass, flipping rapidly through thousands of frames with low confidence scores, which gave me nausea not unlike seasickness.
electroly|3 months ago
Later, we needed to choose the best-looking product picture from a series of possible pictures (collected online) for every SKU, for use in our website's inventory browser. MTurk to the rescue--their human taste was perfect, and it was effortless on my part.
Neither of these were earthshattering from a tech perspective, and I'm sure these days AI could do it, but back then MTurk was the perfect solution. Humans make both random and consistent errors and it was kinda fun to learn how to deal with both kinds of error. I learned lots of little tricks to lower the error rate. As a rule, I always paid out erroneous submissions (you can choose to reject them but it's easier to just pay for all submissions) and just worked to improve my prompts. I never had anyone maliciously or intentionally try to submit incomplete or wrong work, but lots of "junk" happens with the best of intentions.
danpalmer|3 months ago
But every time I looked at it I persuaded myself out of it. The docs really down played the level of critical thinking that we could expect, they made it clear that you couldn't trust any result to even human-error levels, you needed to test 3-5 times and "vote". You couldn't really get good results for unstructured outputs instead it was designed around classification across a small number of options. The bidding also made pricing it out hard to estimate.
In the end we hired a company that sat somewhere between MTurk and fully skilled outsourcing. We trained the team in our specific needs and they would work through data processing when available, asking clarifying questions on Slack, and would reference a huge Google doc that we had with various disambiguations and edge cases documented. They were excellent. More expensive that MTurk on the surface, but likely cheaper in the long run because the results were essentially as correct as anyone could get them and we didn't need to check their work much.
In this way I wonder if MTurk never found great product market fit. It languished in AWS's portfolio for most of 20 years. Maybe it was just too limited?
sireat|3 months ago
I spent a month in 2012 roughly 4 hours a day doing various tasks.
It was horrible, even if I followed all the "best practices" of Turkers it was not a way to make a living.
By end of the month, I had become so jaded to all the "priming" experiments by graduate and undergraduate psychology students. Those usually paid at least something 3-4 USD an hour.
Did some porn labeling tasks, those were horrible after the novelty wore off.
Did very few other labeling tasks because they paid next to nothing.
To have someone actually depend on living for these seemed like a torture.
stickfigure|3 months ago
There are places where $3-$4 USD per hour is significantly higher than the prevailing wage. This is not a great fact about global wealth disparity, but that money goes towards improving the situation not making it worse.
kittikitti|3 months ago
dr_dshiv|3 months ago
edoceo|3 months ago
mtlynch|3 months ago
The most valuable prospects were businesses in buildings where we had a direct fiber connection. There were sites online that purported to list the buildings and leads that the company bought from somewhere, but the sources were all really noisy. Like 98% of the time, the phone number was disconnected or didn't match the address the source said, so basically nobody used these sources.
I thought MTurk would be my secret weapon. If I could pay someone like $0.10/call to call business and confirm the business name and address, then I'd turn these garbage data sources into something where 100% of the prospects were valid, and none of the sales reps competing with me would have time to burn through these low-probability phone numbers.
The first day, I was so excited to call all the numbers that the MTurk workers had confirmed, and...
The work was all fake. They hadn't called anyone at all. And they were doing the jobs at like 4 AM local time when certainly nobody was answering phones at these businesses.
I tried a few times to increase the qualifications and increase pay, but everyone who took the job just blatantly lied about making the calls and gave me useless data.
Still, I thought MTurk was a neat idea and wish I'd found a better application for it.
rzzzt|3 months ago
Delightful.
ruralfam|3 months ago
ebcase|3 months ago
It wasn’t perfect, but it didn’t need to be. We essentially needed a “good enough to start with” dataset that we could refine going forward. It got the job done.
johntfella|3 months ago
For those using it to "get by" idk, I mean I knew someone who qualified for SSDI in theory but was still denied. He used it to offset some cash needs but obviously was not sufficient enough. Think the bigger issue around ethics is more societal. People shouldn't have to rely on it in place of SSDI when they shouldn't have been denied. I suppose the same can apply to food stamps etc. With this administration obviously things are tightening and getting more strict. Potential SSDI reforms will probably exasperate the need for this service. The ideal is this wouldn't be the case and the service just provides the small cash niceties for small time gaming and donations.
ref; https://www.propublica.org/article/social-security-disabilit...
firefax|3 months ago
https://web.archive.org/web/20170809155252id_/http://kittur....
nvarsj|3 months ago
malshe|3 months ago
slyall|3 months ago
None of the companies I've worked for have used it AFAIK, despite them all using AWS. I think I've mostly ignored it as one of the niche AWS products that isn't relevant.
[0] https://blog.mturk.com/weve-made-it-easier-for-more-requeste...
coderintherye|3 months ago
abhiyerra|3 months ago
cole-k|3 months ago
https://github.com/cole-k/turksort
dmd|3 months ago
ada1981|3 months ago
I got about 100 of each.
jonatron|3 months ago
antaviana|3 months ago
jgalt212|3 months ago
BoredPositron|3 months ago
brazukadev|3 months ago
hshdhdhehd|3 months ago
nextworddev|3 months ago
deadbabe|3 months ago
Bad data or false work was a big problem on MTurk, but now LLMs should be able to act as reasonable quality assurance for each and every piece of work a worker commits. The workers can be ranked and graded based on the quality of their work, instantly, instead of requiring human review.
You can also flip the model and have LLMs do the unit of work, and have humans as a verification layer, and the human review sanity checked again by an LLM to ensure people aren’t just slacking off and rubber stamping everything. You can easily do this by inserting blatantly bad data at some points and seeing if the workers pick up on it. Fail the people who are letting bad data pass through.
For a lot of people, I think this will be the future of work. People will go to schools to get rounded educations and get degrees in “Human Cognitive Tasks” which makes them well suited for doing all kinds of random stuff that fills in gaps for AI. Perhaps they will also minor in some niche fields for specific industries. Best of all, they can work their own hours and at home.
hiddencost|3 months ago