top | item 11388698

How to create an AI startup – convince some humans to be your training set

137 points| simplystats | 10 years ago |simplystatistics.org | reply

32 comments

order
[+] AznHisoka|10 years ago|reply
"It will also be interesting if there is a legal claim for the gig workers at these companies to make that their labor helped “create the value” at the companies that replace them."

Well, I would certainly hope any employee help create value for the companies they work for... even if they get laid off eventually.

[+] exolymph|10 years ago|reply
The key word here is "legal". Without having a contract to that effect, no employee or contractor can just appropriate equity, regardless of how much sweat they put into building the company. I'm not sure why OP thinks there might be a legal claim.
[+] danblick|10 years ago|reply
I think he's really missing the point about the importance of self-play in Alpha Go. Human play provided a seed for training the system, but the thing that made it work was the fact that the computer could play an unlimited number of games against itself; the fact that Go is a game with clear rules made it possible to label a huge number of board positions without any human-derived training data at all. The human-derived training set isn't nearly enough for this.
[+] nazka|10 years ago|reply
Do you have first hand sources about this? I have now been hearing all and everything about what makes Alpha Go so great... First it was the hardware, then it was the use of the Monte-Carlo tree search with NN... And even more just 1 day ago https://news.ycombinator.com/item?id=11382954
[+] morganK|10 years ago|reply
Would have like to hear at least one concrete exemple of startup actually doing that. Seems a bit theoretical at the moment, as big companies doesn't need to do that thanks to existing datasets, and I've never heard any startups using dozens (hundreds?) of contractors for this kind of job.
[+] HillRat|10 years ago|reply
CrowdFlower does AI and ML-focused microtasking, though I have no experience with them. Even large companies need plenty of preprocessing done on their datasets, so it's common to use offshored services companies or divisions to do annotation and cleanup work on corpora before using them as training sets.
[+] johndavi|10 years ago|reply
In very broad strokes this is how we power many of our API features at Diffbot. We have hundreds of thousands of human-trained web pages amounting to millions of individual elements that have helped to train our system.
[+] RobertoG|10 years ago|reply
Not a start-up and not deep learning (until now I suppose), but this have been done for years in the translation industry.

They feed their automatic systems with the output of the human translator. Every input means less and less manual work that need to be done in the future.

[+] globba22|10 years ago|reply
the post office used humans for many years to train OCR models, e.g. zip code readers.

I visited a postal routing facility once in the 90s and saw a long row of metal stationed by about 20 people, 10 to each side. Envelopes passed through on a sort of pneumatic tube-like conveyor, paused in front of a human operator who read a single digit of a zip code, keyed it in and sent the envelope to be read by the next person.

[+] lifeisstillgood|10 years ago|reply
This does hit at one of the most basic debates of the next decade - how much of my actions and behaviours do I own? Creating a link from one page to another, thus providing PageRank with value - do I get a cut of that value? Purchasing a book or a film, thus making profit for the reseller's recommendation engine? Driving around populating maps with my GPS co-ordinates. Just generally leaving digital footprints makes someone a training set somewhere - and yet instead of this being a public good it's private profit - the term bandied Around after 2008 was "socialising risk, privatising profits". The same debate should be happening here - but I only occasionally hear about something like it.

Or am I listening in wrong places?

[+] thinkingkong|10 years ago|reply
It wont work this way in the short term.

Any company doing "AI" will get there over a long period of time by employing people to do actual work and then slowly automating that work away. If you wait for a huge dataset or some new technique there will be tons of competition.

[+] zodPod|10 years ago|reply
>It will also be interesting if there is a legal claim for the gig workers at these companies to make that their labor helped “create the value” at the companies that replace them.

I'd assume that you'd be waiving any legal claim they might have when they sign the ToS or w/e. I mean, in all fairness, they are getting paid to perform these actions and be recorded. What more would they have any claim for anyway? A percentage based on the times their anonymized playthroughs were used?

"Well, we've got 1,000 people and each played 100 games of Go. We took that 100,000 games and trained a single dataset to play against itself." User is 1 player of 1,000. Company makes 20,000,000 and sets aside 25% (magically) for paying back the original people. Those people now get $5000. That $5000 is cool but it's not life changing.

EDIT: It occurs to me that my numbers could be skewed. This could be significant if they only used 100 people or so, I guess. My point wasn't necessarily to shoot down the notion just to discuss it. What would the person have a claim to be it legal or otherwise?

[+] pbkhrv|10 years ago|reply
Microsoft, perhaps inadvertently, did that. Tay's stream of consciousness can now be used as a training set for an abusive content monitoring AI.
[+] bliti|10 years ago|reply
You could crawl 4chan and get a bigger dataset of abusive content. But that cold lead to terminators showing up on my lawn.
[+] tariqali34|10 years ago|reply
The interesting question is what would you call these humans who are serving as your training set. Do you call them "Machine Therapists" (trying to coax the AI to proper behavior)? "AI Educators" (providing the material that is used to teach the AI)? "Data Scientists" (they are curating data and handing it off to the machine)?
[+] pdkl95|10 years ago|reply
Hopefully they call them "people who gave their informed consent to use their data in this specific AI project".
[+] nxzero|10 years ago|reply
Unclear how this is new, even Google, Amazon, etc. have either been doing this internally, offering it as a service, been susceptible to man-in-the-middle exploits to mining real world data for training sets, released data, etc.
[+] awinter-py|10 years ago|reply
Spot on. One recipe to become a tech acquisition target is to collection a 'new kind' of user data -- all big companies are hungry for this.

This phenomenon is not at all new; data has been informing investment models forever and access to that data comes from having the right customers, and is closely hoarded once gotten.

Some of the largest companies in the late middle ages were wool buyers -- they weren't permitted to trade internationally, but they used locally owned franchises and market knowledge to corner the market anyway. And many of the largest ag commodities futures traders in this century also own substantial farm acreage. Those capital one guys who were SEC'd for trading options on credit card receipts were leveraging customer activity.

Point being -- you've always needed data to train a good model.

[+] graycat|10 years ago|reply
With the many parameters, the normal equations will become large. In that case, can consider solving the equations with the old iterative method Gauss-Seidel.
[+] verbify|10 years ago|reply
I've only a little experience in NN, but getting trainers is rarely the bottleneck - it's usually in programming the NN.
[+] tmaly|10 years ago|reply
I plan to do just that, but my end goal is to provide a free service that has tons of value for my users.
[+] graycat|10 years ago|reply
There's a chance that some Web site ad targeting is being done this way.
[+] madelinecameron|10 years ago|reply
This is kind of "no duh".

Not really an article that adds much value or understanding, especially for a blog seemingly being targeted to a technical audience.