top | item 20507925

Lyft releases self-driving research dataset

301 points| dchengy | 6 years ago |medium.com | reply

102 comments

order
[+] jedberg|6 years ago|reply
> Academic research accelerates innovation, but it requires costly data that is out of reach for most academic teams.

This is true of pretty much any AI research. Look at Puffer[0], which was just on HN a couple of days ago. They're running a free streaming service just to get enough data to train their algorithms, and in fact mention in their FAQ that they would love to use commercial data if they could get it.

Unfortunately, academic and commercial incentives don't really align here. Most commercial entities don't want to share their data because it's valuable to them, and if they let researchers in, they want the output of the research to remain proprietary to their commercial enterprise.

I wonder if there isn't some sort of governance solution to this. Like give companies big tax breaks for sharing their data with researchers, or something like that. Essentially subsidize academia indirectly.

[0] https://puffer.stanford.edu/player/

[+] ISL|6 years ago|reply
I've seen semiconductor industry companies collaborate on grant-funding fundamental condensed-matter physics research. If it is a question of interest to all parties, and the work is too blue-sky to be immediately profitable, sometimes they'll fund the work.
[+] bko|6 years ago|reply
> Unfortunately, academic and commercial incentives don't really align here. Most commercial entities don't want to share their data because it's valuable to them, and if they let researchers in, they want the output of the research to remain proprietary to their commercial enterprise... I wonder if there isn't some sort of governance solution to this.

You're commenting on an article in which a commercial entity is sharing their data despite it being valuable to them. Maybe they are the outlier but I've seen plenty of companies share data, especially in the ML space. Here are some datasets[0]. Maybe you would prefer more, but compared to other fields there is a lot of sharing. A "governance solution" could make things worse. If there was some mandate that companies that collect this data have to share it in a costly way, then it would discourage collection.

[0] https://blog.cambridgespark.com/50-free-machine-learning-dat...

[+] hwillis|6 years ago|reply
> I wonder if there isn't some sort of governance solution to this. Like give companies big tax breaks for sharing their data with researchers, or something like that. Essentially subsidize academia indirectly.

I think that's a really great idea. Not sure how many would take advantage of it but if it could be made to work then it would be really awesome.

It would also be extremely prone to abuse, though. Patenting is already an art of pretending to explain in clear terms what you are doing, while actually describing something as broadly and vaguely as possible. It would be pretty easy for a TON of things to leave out some key things that make it impossible or unhelpful to have the information.

You could form industry-specific regulations or even an active agency to prosecute abuses like that, but it would be immediately overwhelmed. The patent office is already heavily gamed by patent trolls, who bank on long odds for small judgements. Now imagine if millions or billions of dollars of taxes were on the line, and major companies were investing significant resources to open source while protecting their IP.

Even if that were all figured out, how would you value open sourcing stuff, even something as simple as data? Do you give breaks by size, importance, proportion of profit or future profit? Cost of the research? How do you guard against overvaluations and abuse of accounting? Even if you had perfectly accurate, annually-updated solutions for all that, companies can still game the system. Lyft has decided this dataset is what they need; if they could get a bigger break by collecting more data, they'd do that. Plus- facebook and google release tons of open source stuff. Do they deserve more than say, pharmaceutical research?[1]

Similar (IIRC Nixon) tax breaks already exist for R&D, and they are a notoriously abused loophole. Simplified but illustrative example: you build your R&D lab in the shape of a factory, do your research for a while and then suddenly scale back and replace it with machinery- well, the original building was still deducted from taxes.

Pharma is actually a perfect example. It's a well known fact that R&D only accounts for 22% of pharma industry revenue (almost equal to advertising at 19%), but only ~30% of that actually goes to new drugs. The rest takes advantage of marketing and the patent system to re-release drugs that are essentially the same. Two thirds of their research is obvious changes that are only protected because they owned the original patent- those shouldn't be getting the benefit of incentives.

[+] trophycase|6 years ago|reply
Slightly tangential but might this be another argument for people "owning" their data while companies "own" the processing procedures of it. If people "owned" their data it would presumably be much easier for them to give it out for research purposes
[+] Bombthecat|6 years ago|reply
AI ( neural networks) is sadly a "winner takes it all" markt...

Even if you create an algorithm five times better and faster, you still lack the data to feed it..

[+] choppaface|6 years ago|reply
Just some context here:

* The raw data in nuscenes ( https://www.nuscenes.org/ ) is about 5x larger than this dataset from Lyft. 300GB train vs 60GB train. Argoverse ( https://www.argoverse.org/ ) is is about 3.x larger at 200GB. The Waymo dataset will (allegedly) be an order of magnitude larger than nuscenes ( https://i2.wp.com/syncedreview.com/wp-content/uploads/2019/0... ). BDD100k ( https://bair.berkeley.edu/blog/2018/05/30/bdd/ ) is the "largest" public dataset to date, but lacks lidar, and labels are inconsistent; most of the 100,000 scenes only have one labeled frame.

* The Lyft sensor suite has bumper-mounted lidar, which is absent from other existing datasets. Point cloud data in these areas is critical for pedestrians, bikes, and various road hazards. So this dataset alone is useful for validating work trained through other means.

* The current Lyft Level 5 release has no explicit test / validation set, which is crucial for properly measuring performance of any experiment one might do with the data. In nuscenes and Argoverse, there's a small snippet dataset that helps you prepare your pipeline. Feels like Lyft might have rushed things a little here-- they could have posted a "teaser" and then the full train and test/validation set a couple weeks later.

Great to see more public data (especially from a more modern sensor suite), plus investment into a contest with prizes.

[+] cardigan|6 years ago|reply
(I work at scale)

Hmm this blog post and the website doesn't mention that this dataset was mostly annotated by Scale (scale.ai), as part of a partnership with Lyft ... We're going to publish a blog post about this soon, but if anyone at Lyft is reading this, please figure out how to reasonably credit Scale since I doubt leaving out Scale completely from the announcement is in the spirit of the agreement. Scale should probably also be added to the bibliography and website in some form

Contrast this with the nuScenes website, which was also annotated by Scale, and whose data format set the standard for this dataset: they credit Scale pretty reasonably

[+] ayw|6 years ago|reply
Hi, I'm the CEO of Scale.ai.

This comment does not represent the company's viewpoint, and cardigan is not speaking on behalf of Scale.

We are very excited to have been able to work with Lyft in open-sourcing this dataset and advancing the research community. We are also very grateful to Lyft for choosing to leverage our point cloud viewer and have credited the annotations to us on their launch page.

[+] cm2012|6 years ago|reply
Quick tip: voice concerns about partners in private. Lyft will probably be happy to credit Scale more - it was likely an honest mistake. But now you dragged them through the mud publicly, which is going to make big companies less likely to work with Scale in the future.
[+] partingshots|6 years ago|reply
Yeah, this comment to me really gives off a bad impression of Scale as a whole. My immediate reaction, assuming this is how Scale normally deals with PR, is that this company is still far too immature to be properly handling any sort of legitimate partnership.

FYI, you’re unlikely to solve any of your problems airing grievances on a public forum instead of just directly emailing the people involved.

[+] rubyfan|6 years ago|reply
If you are an officer of scale you should take this offline. If not then you’re probably not authorized to speak on behalf of scale. Check your NDAs and service agreements. This is in such poor taste, anyone who would consider scale’s service now has to consider this sort of public commentary.
[+] _coveredInBees|6 years ago|reply
Out of curiosity, isn't Lyft just a customer that pays Scale for annotation services? Or is there a reason for this to be more of a partnership and less of a customer-client relationship?
[+] cardigan|6 years ago|reply
OP here, just waking up (I'm remote) - I can't edit my original comments so let me modify them here:

I wasn't involved in our communications with Lyft, so I was talking about something I didn't know much about. My audience was just the anonymous commenteriat: turns out a lot of people whose opinion makes a material difference to Lyft/Scale read these comments too. Sorry for not realizing that; I probably wouldn't have posted an uninformed personal opinion had I realized that.

I was being way too aggressive - genuinely sorry to anyone at Lyft who felt maligned by these comments. I woke up to 20 messages from coworkers who told me I was being an ass - genuinely sorry :(

Also I really should've clarified I was not speaking on behalf of the company: this was just a personal, uninformed opinion.

I cannot go into the details I learned about Scale's agreement with Lyft since it's confidential

[+] cardigan|6 years ago|reply
Also, the viewer packaged with nuScenes was built by Steven Hao from Scale, and while it was packaged as part of nuScenes it should probably be called Scale's viewer instead of nuScenes' viewer. The original viewer in the nuscenes SDK has the Scale logo, but it looks like Lyft removed that in the fork. Maybe a bit of public shaming will fix that...

Dear Lyft marketing person who wrote this: we are a data labeling company, and you may think that means we have a bunch of useless bozos working here like most other data labeling companies, but that's not true - e.g, Steven is one of the smartest people in the world - https://stats.ioinformatics.org/people/3113 - he learns ridiculously quickly - e.g, gets to number one on random video games in a few weeks and learned to boulder L10 in a few months from scratch (normally takes years/decades and most climbers never get there)

[+] buboard|6 years ago|reply
I dont care, but god what is wrong with the replies. Let the guy speak his/er mind and stop playing game of thrones
[+] alexmlamb|6 years ago|reply
(I work at Lyft)

I've had it with you Scale.AI people always trying to take credit for Lyft's work. We've been working weekends and nights for years, even the hourly workers have had to take unpaid overtime. All that time, I've never seen Scale.AI do extra work to help us before a big deadline.

[+] DeonPenny|6 years ago|reply
You should delete this.
[+] saadalem|6 years ago|reply
I just researched scale before your comment here ! I tought : "This isn't done by scale ?"
[+] malandrew|6 years ago|reply
Hope you guys get acquired by someone that recognizes the talent of your team. You deserve it.
[+] azinman2|6 years ago|reply
$25k in prizes seems silly given this is a multi-billion dollar market to crack.
[+] arathore|6 years ago|reply
Such competitions do not usually result in a comprehensive "solution" by themselves - pushing the state-of-the-art is more common. Also the value is not going to be derived solely from the algorithm but more from its deployment to real world applications and the surrounding infrastructure to make it possible.
[+] jacobn|6 years ago|reply
“There will be $25,000 in prizes, and we’ll be flying the top researchers to the NeurIPS Conference in December, as well as allowing the winners to interview with our team.”

I guess it’s a decent opportunity if you’re trying to break into DL?

[+] bearpelican|6 years ago|reply
I had a bad experience with Lyft's previous self driving challenge - https://www.udacity.com/lyft-challenge

I unofficially got first place after finding a bug in their test set (allowing me to blow the competition away). I reported the problem directly - they decided not to fix it and asked me to take my submission down. They said they'd still offer an interview.

However - this interview wasn't even for their DL team. They offered an interview with the web tools support team because they felt I didn't have enough experience...

Reference - https://github.com/bearpelican/lyft-perception-challenge

[+] m0zg|6 years ago|reply
>> allowing the winners to interview with our team

This was pretentious AF. People who win such competitions _allow companies_ to interview them sometimes, not the other way around. It's not like working at Lyft is some amazing privilege.

[+] google2342|6 years ago|reply
I suspect the winners will already be well established in the field.
[+] yodon|6 years ago|reply
The post indicates there is a competition and prizes but I'm not seeing any discussion of what sort of license the data is being made available under (or the competition for that matter). Hopefully it's there and I'm just not seeing it.
[+] ekc|6 years ago|reply
The Github they link to says it's under the CC BY-NC-SA 4.0.
[+] vinayms|6 years ago|reply
This will go against the grain here on HN. I don't know how anyone can imagine self driving even succeeding in real world, leave alone in the so called third world, unless all vehicles are self driven and operate in a controlled environment. There are some really important things pending, like accurate NLP and computer vision, but no, we need something shiny and useless. I think some smart computer scientists are getting rich by carrot sticking some gullible billionaire investors. Good for them. I hope some of the really useful stuff piggy back on this rather lofty endeavor.
[+] buboard|6 years ago|reply
I 'm with you i think self-driving is an aspiration rather than a concrete goal. It literally means solving the quintessential problem of robotics which is a very very hard problem. We 'll probably have human level NLP before that.

Car driving is in decline http://www.washingtonpost.com/blogs/wonkblog/files/2013/04/m...

A more concrete goal for transportation would be to reduce driving times even more by adopting remote work. That is reachable within the decade. In the meanwhile, car safety features should be ramped up, but autonomous driving so far doesn't seem very safe.

[+] astrostl|6 years ago|reply
> Self-driving is too big — and too important — an endeavor for any one team to solve alone. Transportation serves all of us, and we should all be invested in the next step of its evolution.

Imagine how much better things could be if everyone working on maps felt the same.

[+] jaimex2|6 years ago|reply
It's looking more and more like everyone is just going to have to licence Tesla's FSD when its finished.

They are the only ones with a broad real world data source and seem to have wisely taken the right path by not adopting LIDAR, focusing purely on passive vision.

[+] KaiserPro|6 years ago|reply
Its not really that wise, its a large gamble.

tesla decided to not include lidar because they couldn't find a manufacturer that would make one cheap enough for them/fell out over terms. Its not a statement of vision. Its exactly the same decision that Apple dropped Flash support for the iPhone, The processor and ram were too limited to support it, adobe refused to make compromises, and it was too late to change before launch.

Firstly, Tesla is not focussing purely on passive vision, they are using radar as well. But because radar is nowhere near high resolution enough, they need vision to provide categorization.

Now Musk makes a lot of noise about avoiding lidar, thats mostly because he knows its a massive gamble. Yes, he bleats on about its power budget and cost, but using pure AI cost a whole more in RnD, plus a boat load of latency. Not to mention the massive power budget needed to run the custom silicon.

_eventually_ vision + radar will be more than enough to provide life critical level 5 autonomy. However Tesla barely provide more than level 2.

They have a number of problems to overcome, rain/bug occlusion of vision sensor, low light performance, sunrise/sunset, fog, reliable realtime depth estimation, etc.

I suspect that CCD based time of flight depth sensors will become cheaper, low power, and small (They are almost certainly going to end up in mobile phones soon) before pure vision realtime life critical depth estimation is a thing.

[+] obese_by_nature|6 years ago|reply
No LIDAR means zero ability to detect pedestrians in heavy fog, rain, and snow, right?