top | item 12599339

Announcing YouTube-8M: A Large and Diverse Labeled Video Dataset for Research

314 points| runesoerensen | 9 years ago |research.googleblog.com | reply

37 comments

order
[+] JosephRedfern|9 years ago|reply
If, for some reason, you wanted a list of all of the video IDs (I couldn't easily find such a list), then I wrote a crappy scraper to pull them out: https://gist.github.com/JosephRedfern/d60bdc584d84b1451cc605....

I can post a URL to the output once it's finished running, if it'd be of any use to anyone. Oh, and be warned, there's a strong chance that it's buggy. It's certainly not optimised (no threads).

EDIT: The script has now run. I've scraped ~10,000,000 Video IDs, but only ~5.5m of these IDs are unique, so there's probably a bug in my script somewhere (but I need sleep now). Files containing IDs for various categories are listed here: https://redfern.me/public/yt8m/, some notes are here: https://redfern.me/public/yt8m/README.md, and .tar.gz'd archive is available here: https://redfern.me/public/yt8m/yt8m-ids-probably-incomplete.....

[+] garysieling|9 years ago|reply
I'd love a list of IDs - I'm doing a research project that is a search engine for lectures (https://www.findlectures.com) and I'm interested to see if there is any overlap.

It seems like it'd be interesting to explore their tagging compared to what is in video transcripts.

[+] chirau|9 years ago|reply
This is wonderful. Though I was wish i could just specify columns that I need and download those. Or limit number of rows. 1.5 TB is quite a bit. Regardless, this is wonderful.

Would I be violating any law, copyright if I formatted it and put it on my server for that kind of consumption or via JSON?

[+] aub3bhat|9 years ago|reply
The 1.5 TB is just for 1024 (8 bit each) dimension feature vectors for 1 frame per second on first 300 seconds of 5 Million videos.

You can actually download each shard (~300 Mb) separately. They haven't yet released the PCA matrix and quantization parameters used with inception model, but should release them soon.

[+] iverjo|9 years ago|reply
This is nice :) Kudos to the Youtube guys for releasing this. I'm a data scientist in a startup where one of the things I do is create multi-label models for classifying YouTube videos. My current model has 90 % precision and 69 % recall, while Youtube-8M has 78 % precision and 14 % recall, with respect to the human raters. I guess one of the reasons is that my model only has around 100 categories, while Youtube-8M has 4800. It's like comparing apples with pears, but still interesting.
[+] tiplus|9 years ago|reply
Sounds interesting, do you guys have a blog at mashtime? What kind of hardware/software do you use for training? Tensorflow? on AWS or bare metal GPUs?
[+] edent|9 years ago|reply
I don't see anything about the rights of video owners? Have people (inadvertently) licensed their content to be used in this way?
[+] scott_karana|9 years ago|reply
I wish they'd addressed that too.

I'd guess the reasoning is, because it's a list of public URLs, there's no expectation of privacy.

[+] tdaltonc|9 years ago|reply
How good do labels need to be for you to be able to get good results on something like this? There's a lot of data, so that's great, but the labels seem a bit spotty.
[+] lifeisstillgood|9 years ago|reply
Oh man.

I am searching (thrashing) around for my next "big" project. i have been thinking of drones measuring roof / building quality and the CV/ML requirements are fairly high - getting my teeth stuck into these would really give me a better feel for training my own system.

The problem is, how do I feed my family while taking the six months to do it all?

[+] timClicks|9 years ago|reply
If you're serious about this concept, create a drone company that takes real estate photos. That will give you hands on experience with the regulations, quality control issues, etc while giving you time to build up your training set.
[+] misiti3780|9 years ago|reply
im not sure this database will be able to help with that. i doubt building quality is going to be in the annotations, although i did not check.
[+] lolive|9 years ago|reply
Did someone make a RDF dump of that? (Aligned with dbPedia ;)
[+] kelvin0|9 years ago|reply

[deleted]

[+] bitmapbrother|9 years ago|reply
Get back to us when A.I reaches that of a cockroach.