Yahoo Releases the Largest-Ever Machine Learning Dataset for Researchers

[+] rodionos|10 years ago|reply

Note that they're sharing this dataset only with *.edu, which is unfortunate for the rest of us. I wish they would allow access to a fraction of the dataset, e.g. 5% of records, for the rest of the community.

[+] hellameta|10 years ago|reply

To clarify from source[1]

TO BE ELIGIBLE TO RECEIVE WEBSCOPE DATA, UNLESS SPECIFIED IN A PARTICULAR DATASET, YOU MUST: Be a faculty member, research employee or student from an accredited university Send the data request from an accredited university .edu or domain name (for international universities) email address

UNLESS SPECIFIED IN A PARTICULAR DATASET, WE ARE NOT ABLE TO SHARE DATA WITH: Commercial entities Employees of commercial entities with university appointment Research institutions not affiliated with a research university

--

[1] http://webscope.sandbox.yahoo.com/

[+] mrfusion|10 years ago|reply

That really sucks. I hate the whole attitude that you need to be a Ph.D. To do research.

[+] dave_sullivan|10 years ago|reply

To yahoo: that seems like an unbelievably lame restriction. Even aside from commercial entities, there are many many people working to improve machine learning with no university affiliation.

You're basically taking something really cool here and shooting yourself in the knee with a shotgun.

I hope other companies don't think open data initiatives count if they're not actually open. If you want to keep your data internal and top secret, totally fine, but open data should be available to anyone or it doesn't count.

Almost yahoo, almost...

Edit:

>> I didn't see the word "open" mentioned once in this article...

Touché sir. Sentiment still stands: "released to researchers" and "released to the public" should not be different things.

[+] Estragon|10 years ago|reply

I imagine you'll be able to torrent it fairly soon.

[+] boomzilla|10 years ago|reply

Would it be suffice with an .edu email, or does one need a formal document from university officials? I tried to click through from the sandbox link, but a Yahoo account is required.

[+] frik|10 years ago|reply

Sad the future of "Yahoo!" (the tech company, not the Alibaba stock) is uncertain. They were always very open with their research. Thinking back to 2008/09 they had the biggest Hadoop clusters, etc. even the first edition of O'Reilys Hadoop books says "Yahoo press".

[+] blazespin|10 years ago|reply

That's really irrelevant here. I think we should focus on what an incredible contribution this is. Perhaps a sign of good things to come from Yahoo.

[+] alceufc|10 years ago|reply

Flickr was -- and maybe still is -- very useful for the computer vision research community.

[+] magicmu|10 years ago|reply

As a brand, I'm honestly impressed that they've lasted through so many sea changes.

[+] BetaCygni|10 years ago|reply

I'm torn. I love open data, but I fully expect that someone will (partially) deanonymize this.

[+] rectang|10 years ago|reply

I share your concern.

Once data like this is deanonymized, it's out there forever -- there's no going back to fix it like you would a software bug. So you need perfect understanding and provable security at release time to guaranteed safety into the indefinite future. That's not an easy constraint to satisfy.

[+] japaw|10 years ago|reply

Probobly. Hopefully so have they learned from the fallout from the AOL search log case ( https://en.wikipedia.org/wiki/AOL_search_data_leak ). That case was certainly a big mess.

[+] fweespeech|10 years ago|reply

Yep, and this is why stuff like this should have a formal opt-in process.

[+] mooreds|10 years ago|reply

It is actually 1.5TB compressed. Direct link to the dataset:

http://webscope.sandbox.yahoo.com/catalog.php?datatype=r&did...

"The dataset may be used by researchers to validate recommender systems, collaborative filtering methods, context-aware learning, large-scale learning algorithms, transfer learning, user behavior modeling, content enrichment and unsupervised learning methods."

Edit: added quote

[+] glxc|10 years ago|reply

13.5 TB - that is pretty huge!

Great to get some truly "Big Data" sets out there. I consider "Big Data" to be data that can't be conventionally processed on a commodity machine, else it's just analytics

Yahoo must be applauded for supplying various data sets and helping progress machine learning research

[+] collyw|10 years ago|reply

I saw a course advertised in my email yesterday. Big data with MySQL. The description talked about queries and aggregate functions. That isn't big data - that's just "using a database" before the term "big data" appeared in the mainstream.

[+] wahsd|10 years ago|reply

Can someone please explain to me why this dataset needs to be one big file? They couldn't have broken it down? I need to download the full 1.5TB? Also, they couldn't have simply made the data available on one of the "big-data" services? Seems to redundant and inefficient.

[+] boltzmannbrain|10 years ago|reply

It's unfortunate Yahoo assumes only those with .edu email addresses make up "the research community".

[+] rovr138|10 years ago|reply

No they don't.

http://webscope.sandbox.yahoo.com

[+] wdr1|10 years ago|reply

Is it possible to get the readme w/o downloading the entire thing?

They state "The readme file for this dataset is located in part 1 of the download. Please refer to the readme file for a detailed overview of the dataset.", but I only see an option to get the full 1.5T.

[+] jedberg|10 years ago|reply

It's too bad they aren't publishing this as an EBS snapshot. That would probably be the most useful way their intended audience could consume it given that most universities get a ton of free Amazon credits for exactly this type of research.

[+] lqdc13|10 years ago|reply

My university had no Amazon credits (2 years ago). I did have access to several supercomputers though, which would work out much better for this type of data.

Yahoo is also somewhat closer to Microsoft than to Amazon.

[+] satyajeet23|10 years ago|reply

Released Publicly?

You need an .edu mail address, a yahoo account with verified sms to download this!

Very unfortunate.

[+] zo1|10 years ago|reply

It is unfortunate. But who knows what sort of restriction have to be imposed by the various sources of the data and other various contractual obligations? I'd imagine most of us would feel quite differently if we knew that we were sources for certain parts of the dataset.

[+] inglor|10 years ago|reply

People who downloaded this - does this contain any form of tagging of the data? For example, do news articles contain visit counts? Article sentiment? Any form of structured information?

Otherwise, what benefit does this have over scarping news sites?

[+] GrantS|10 years ago|reply

The interesting data here aren't the news articles themselves, but the news-browsing history of 20 million people over a 4 month period.

To answer your first question, though, according to the official description of the dataset [1], "On the item side, we are releasing the title, summary, and key-phrases of the pertinent news article."

[1] http://webscope.sandbox.yahoo.com/catalog.php?datatype=r&did...

[+] redrummr|10 years ago|reply

Did you read the article? It seems like they are providing data on the stories people clicked, and at what time, so you can draw temporal and recommendation hypotheses. Some device and location specifics are provided. Scraping can only tell the scraper's story. This data tells millions of people's stories.

[+] fsaintjacques|10 years ago|reply

I really need a yahoo account with verified sms to download this?

[+] jonesb6|10 years ago|reply

1) Begin registration to a community college.

2) Get .edu email address

3) Profit

[+] IshKebab|10 years ago|reply

Not the most interesting dataset though.

[+] astazangasta|10 years ago|reply

I am so sick of the implication that all data is equivalent, and there is some generic notion of "big data" that we generic "data scientists" can learn how to "mine" using some generic technique called "deep learning" that will give us all the answers we need like some kind of oracle.

I study biology. The shape of the data, the way it is structured, the problems we face in analyzing it, are quite different than the ones faced in user-news interaction data. Techniques that are useful for reshaping and summarizing one dataset are not necessarily applicable to another.

[+] blazespin|10 years ago|reply

Word2Vec!

76 comments