How to create an AI startup – convince some humans to be your training set

[+] AznHisoka|10 years ago|reply

"It will also be interesting if there is a legal claim for the gig workers at these companies to make that their labor helped “create the value” at the companies that replace them."

Well, I would certainly hope any employee help create value for the companies they work for... even if they get laid off eventually.

[+] exolymph|10 years ago|reply

The key word here is "legal". Without having a contract to that effect, no employee or contractor can just appropriate equity, regardless of how much sweat they put into building the company. I'm not sure why OP thinks there might be a legal claim.

[+] danblick|10 years ago|reply

I think he's really missing the point about the importance of self-play in Alpha Go. Human play provided a seed for training the system, but the thing that made it work was the fact that the computer could play an unlimited number of games against itself; the fact that Go is a game with clear rules made it possible to label a huge number of board positions without any human-derived training data at all. The human-derived training set isn't nearly enough for this.

[+] nazka|10 years ago|reply

Do you have first hand sources about this? I have now been hearing all and everything about what makes Alpha Go so great... First it was the hardware, then it was the use of the Monte-Carlo tree search with NN... And even more just 1 day ago https://news.ycombinator.com/item?id=11382954

[+] morganK|10 years ago|reply

Would have like to hear at least one concrete exemple of startup actually doing that. Seems a bit theoretical at the moment, as big companies doesn't need to do that thanks to existing datasets, and I've never heard any startups using dozens (hundreds?) of contractors for this kind of job.

[+] tariqali34|10 years ago|reply

Netflix used humans to tag movies for their recommendation system.

Source: http://www.theatlantic.com/technology/archive/2014/01/how-ne...

[+] HillRat|10 years ago|reply

CrowdFlower does AI and ML-focused microtasking, though I have no experience with them. Even large companies need plenty of preprocessing done on their datasets, so it's common to use offshored services companies or divisions to do annotation and cleanup work on corpora before using them as training sets.

[+] johndavi|10 years ago|reply

In very broad strokes this is how we power many of our API features at Diffbot. We have hundreds of thousands of human-trained web pages amounting to millions of individual elements that have helped to train our system.

[+] RobertoG|10 years ago|reply

Not a start-up and not deep learning (until now I suppose), but this have been done for years in the translation industry.

They feed their automatic systems with the output of the human translator. Every input means less and less manual work that need to be done in the future.

[+] globba22|10 years ago|reply

the post office used humans for many years to train OCR models, e.g. zip code readers.

I visited a postal routing facility once in the 90s and saw a long row of metal stationed by about 20 people, 10 to each side. Envelopes passed through on a sort of pneumatic tube-like conveyor, paused in front of a human operator who read a single digit of a zip code, keyed it in and sent the envelope to be read by the next person.

[+] nl|10 years ago|reply

Many, many startups use Amazon Mechanical Turk and/or CrowdFlower for this exact thing.

See http://blog.echen.me/2012/04/25/making-the-most-of-mechanica... for some examples.

[+] klochner|10 years ago|reply

hunch

[+] lifeisstillgood|10 years ago|reply

This does hit at one of the most basic debates of the next decade - how much of my actions and behaviours do I own? Creating a link from one page to another, thus providing PageRank with value - do I get a cut of that value? Purchasing a book or a film, thus making profit for the reseller's recommendation engine? Driving around populating maps with my GPS co-ordinates. Just generally leaving digital footprints makes someone a training set somewhere - and yet instead of this being a public good it's private profit - the term bandied Around after 2008 was "socialising risk, privatising profits". The same debate should be happening here - but I only occasionally hear about something like it.

Or am I listening in wrong places?

[+] thinkingkong|10 years ago|reply

It wont work this way in the short term.

Any company doing "AI" will get there over a long period of time by employing people to do actual work and then slowly automating that work away. If you wait for a huge dataset or some new technique there will be tons of competition.

[+] zodPod|10 years ago|reply

>It will also be interesting if there is a legal claim for the gig workers at these companies to make that their labor helped “create the value” at the companies that replace them.

I'd assume that you'd be waiving any legal claim they might have when they sign the ToS or w/e. I mean, in all fairness, they are getting paid to perform these actions and be recorded. What more would they have any claim for anyway? A percentage based on the times their anonymized playthroughs were used?

"Well, we've got 1,000 people and each played 100 games of Go. We took that 100,000 games and trained a single dataset to play against itself." User is 1 player of 1,000. Company makes 20,000,000 and sets aside 25% (magically) for paying back the original people. Those people now get $5000. That $5000 is cool but it's not life changing.

EDIT: It occurs to me that my numbers could be skewed. This could be significant if they only used 100 people or so, I guess. My point wasn't necessarily to shoot down the notion just to discuss it. What would the person have a claim to be it legal or otherwise?

[+] pbkhrv|10 years ago|reply

Microsoft, perhaps inadvertently, did that. Tay's stream of consciousness can now be used as a training set for an abusive content monitoring AI.

[+] bliti|10 years ago|reply

You could crawl 4chan and get a bigger dataset of abusive content. But that cold lead to terminators showing up on my lawn.

[+] tariqali34|10 years ago|reply

The interesting question is what would you call these humans who are serving as your training set. Do you call them "Machine Therapists" (trying to coax the AI to proper behavior)? "AI Educators" (providing the material that is used to teach the AI)? "Data Scientists" (they are curating data and handing it off to the machine)?

[+] pdkl95|10 years ago|reply

Hopefully they call them "people who gave their informed consent to use their data in this specific AI project".

[+] unknown|10 years ago|reply

[deleted]

[+] stcredzero|10 years ago|reply

Searle would have us interpret this as the company taking the intelligence of the humans, refining and repackaging it.

https://www.youtube.com/watch?v=rHKwIYsPXLg

[+] nxzero|10 years ago|reply

Unclear how this is new, even Google, Amazon, etc. have either been doing this internally, offering it as a service, been susceptible to man-in-the-middle exploits to mining real world data for training sets, released data, etc.

[+] awinter-py|10 years ago|reply

Spot on. One recipe to become a tech acquisition target is to collection a 'new kind' of user data -- all big companies are hungry for this.

This phenomenon is not at all new; data has been informing investment models forever and access to that data comes from having the right customers, and is closely hoarded once gotten.

Some of the largest companies in the late middle ages were wool buyers -- they weren't permitted to trade internationally, but they used locally owned franchises and market knowledge to corner the market anyway. And many of the largest ag commodities futures traders in this century also own substantial farm acreage. Those capital one guys who were SEC'd for trading options on credit card receipts were leveraging customer activity.

Point being -- you've always needed data to train a good model.

[+] graycat|10 years ago|reply

With the many parameters, the normal equations will become large. In that case, can consider solving the equations with the old iterative method Gauss-Seidel.

[+] verbify|10 years ago|reply

I've only a little experience in NN, but getting trainers is rarely the bottleneck - it's usually in programming the NN.

[+] tmaly|10 years ago|reply

I plan to do just that, but my end goal is to provide a free service that has tons of value for my users.

[+] graycat|10 years ago|reply

There's a chance that some Web site ad targeting is being done this way.

[+] madelinecameron|10 years ago|reply

This is kind of "no duh".

Not really an article that adds much value or understanding, especially for a blog seemingly being targeted to a technical audience.

[+] unknown|10 years ago|reply

[deleted]

32 comments