Note that they're sharing this dataset only with *.edu, which is unfortunate for the rest of us. I wish they would allow access to a fraction of the dataset, e.g. 5% of records, for the rest of the community.
TO BE ELIGIBLE TO RECEIVE WEBSCOPE DATA, UNLESS SPECIFIED IN A PARTICULAR DATASET, YOU MUST:
Be a faculty member, research employee or student from an accredited university
Send the data request from an accredited university .edu or domain name (for international universities) email address
UNLESS SPECIFIED IN A PARTICULAR DATASET, WE ARE NOT ABLE TO SHARE DATA WITH:
Commercial entities
Employees of commercial entities with university appointment
Research institutions not affiliated with a research university
To yahoo: that seems like an unbelievably lame restriction. Even aside from commercial entities, there are many many people working to improve machine learning with no university affiliation.
You're basically taking something really cool here and shooting yourself in the knee with a shotgun.
I hope other companies don't think open data initiatives count if they're not actually open. If you want to keep your data internal and top secret, totally fine, but open data should be available to anyone or it doesn't count.
Almost yahoo, almost...
Edit:
>> I didn't see the word "open" mentioned once in this article...
Touché sir. Sentiment still stands: "released to researchers" and "released to the public" should not be different things.
Would it be suffice with an .edu email, or does one need a formal document from university officials? I tried to click through from the sandbox link, but a Yahoo account is required.
Sad the future of "Yahoo!" (the tech company, not the Alibaba stock) is uncertain. They were always very open with their research. Thinking back to 2008/09 they had the biggest Hadoop clusters, etc. even the first edition of O'Reilys Hadoop books says "Yahoo press".
Once data like this is deanonymized, it's out there forever -- there's no going back to fix it like you would a software bug. So you need perfect understanding and provable security at release time to guaranteed safety into the indefinite future. That's not an easy constraint to satisfy.
"The dataset may be used by researchers to validate recommender systems, collaborative filtering methods, context-aware learning, large-scale learning algorithms, transfer learning, user behavior modeling, content enrichment and unsupervised learning methods."
Great to get some truly "Big Data" sets out there. I consider "Big Data" to be data that can't be conventionally processed on a commodity machine, else it's just analytics
Yahoo must be applauded for supplying various data sets and helping progress machine learning research
I saw a course advertised in my email yesterday. Big data with MySQL. The description talked about queries and aggregate functions. That isn't big data - that's just "using a database" before the term "big data" appeared in the mainstream.
Can someone please explain to me why this dataset needs to be one big file? They couldn't have broken it down? I need to download the full 1.5TB? Also, they couldn't have simply made the data available on one of the "big-data" services? Seems to redundant and inefficient.
Is it possible to get the readme w/o downloading the entire thing?
They state "The readme file for this dataset is located in part 1 of the download. Please refer to the readme file for a detailed overview of the dataset.", but I only see an option to get the full 1.5T.
It's too bad they aren't publishing this as an EBS snapshot. That would probably be the most useful way their intended audience could consume it given that most universities get a ton of free Amazon credits for exactly this type of research.
My university had no Amazon credits (2 years ago). I did have access to several supercomputers though, which would work out much better for this type of data.
Yahoo is also somewhat closer to Microsoft than to Amazon.
It is unfortunate. But who knows what sort of restriction have to be imposed by the various sources of the data and other various contractual obligations? I'd imagine most of us would feel quite differently if we knew that we were sources for certain parts of the dataset.
People who downloaded this - does this contain any form of tagging of the data? For example, do news articles contain visit counts? Article sentiment? Any form of structured information?
Otherwise, what benefit does this have over scarping news sites?
The interesting data here aren't the news articles themselves, but the news-browsing history of 20 million people over a 4 month period.
To answer your first question, though, according to the official description of the dataset [1], "On the item side, we are releasing the title, summary, and key-phrases of the pertinent news article."
Did you read the article? It seems like they are providing data on the stories people clicked, and at what time, so you can draw temporal and recommendation hypotheses. Some device and location specifics are provided. Scraping can only tell the scraper's story. This data tells millions of people's stories.
I am so sick of the implication that all data is equivalent, and there is some generic notion of "big data" that we generic "data scientists" can learn how to "mine" using some generic technique called "deep learning" that will give us all the answers we need like some kind of oracle.
I study biology. The shape of the data, the way it is structured, the problems we face in analyzing it, are quite different than the ones faced in user-news interaction data. Techniques that are useful for reshaping and summarizing one dataset are not necessarily applicable to another.
[+] [-] rodionos|10 years ago|reply
[+] [-] hellameta|10 years ago|reply
TO BE ELIGIBLE TO RECEIVE WEBSCOPE DATA, UNLESS SPECIFIED IN A PARTICULAR DATASET, YOU MUST: Be a faculty member, research employee or student from an accredited university Send the data request from an accredited university .edu or domain name (for international universities) email address
UNLESS SPECIFIED IN A PARTICULAR DATASET, WE ARE NOT ABLE TO SHARE DATA WITH: Commercial entities Employees of commercial entities with university appointment Research institutions not affiliated with a research university
--
[1] http://webscope.sandbox.yahoo.com/
[+] [-] mrfusion|10 years ago|reply
[+] [-] dave_sullivan|10 years ago|reply
You're basically taking something really cool here and shooting yourself in the knee with a shotgun.
I hope other companies don't think open data initiatives count if they're not actually open. If you want to keep your data internal and top secret, totally fine, but open data should be available to anyone or it doesn't count.
Almost yahoo, almost...
Edit:
>> I didn't see the word "open" mentioned once in this article...
Touché sir. Sentiment still stands: "released to researchers" and "released to the public" should not be different things.
[+] [-] Estragon|10 years ago|reply
[+] [-] boomzilla|10 years ago|reply
[+] [-] frik|10 years ago|reply
[+] [-] blazespin|10 years ago|reply
[+] [-] alceufc|10 years ago|reply
[+] [-] magicmu|10 years ago|reply
[+] [-] BetaCygni|10 years ago|reply
[+] [-] rectang|10 years ago|reply
Once data like this is deanonymized, it's out there forever -- there's no going back to fix it like you would a software bug. So you need perfect understanding and provable security at release time to guaranteed safety into the indefinite future. That's not an easy constraint to satisfy.
[+] [-] japaw|10 years ago|reply
[+] [-] fweespeech|10 years ago|reply
[+] [-] mooreds|10 years ago|reply
http://webscope.sandbox.yahoo.com/catalog.php?datatype=r&did...
"The dataset may be used by researchers to validate recommender systems, collaborative filtering methods, context-aware learning, large-scale learning algorithms, transfer learning, user behavior modeling, content enrichment and unsupervised learning methods."
Edit: added quote
[+] [-] glxc|10 years ago|reply
Great to get some truly "Big Data" sets out there. I consider "Big Data" to be data that can't be conventionally processed on a commodity machine, else it's just analytics
Yahoo must be applauded for supplying various data sets and helping progress machine learning research
[+] [-] collyw|10 years ago|reply
[+] [-] wahsd|10 years ago|reply
[+] [-] boltzmannbrain|10 years ago|reply
[+] [-] rovr138|10 years ago|reply
http://webscope.sandbox.yahoo.com
[+] [-] wdr1|10 years ago|reply
They state "The readme file for this dataset is located in part 1 of the download. Please refer to the readme file for a detailed overview of the dataset.", but I only see an option to get the full 1.5T.
[+] [-] jedberg|10 years ago|reply
[+] [-] lqdc13|10 years ago|reply
Yahoo is also somewhat closer to Microsoft than to Amazon.
[+] [-] satyajeet23|10 years ago|reply
You need an .edu mail address, a yahoo account with verified sms to download this!
Very unfortunate.
[+] [-] zo1|10 years ago|reply
[+] [-] inglor|10 years ago|reply
Otherwise, what benefit does this have over scarping news sites?
[+] [-] GrantS|10 years ago|reply
To answer your first question, though, according to the official description of the dataset [1], "On the item side, we are releasing the title, summary, and key-phrases of the pertinent news article."
[1] http://webscope.sandbox.yahoo.com/catalog.php?datatype=r&did...
[+] [-] redrummr|10 years ago|reply
[+] [-] fsaintjacques|10 years ago|reply
[+] [-] jonesb6|10 years ago|reply
2) Get .edu email address
3) Profit
[+] [-] IshKebab|10 years ago|reply
[+] [-] astazangasta|10 years ago|reply
I study biology. The shape of the data, the way it is structured, the problems we face in analyzing it, are quite different than the ones faced in user-news interaction data. Techniques that are useful for reshaping and summarizing one dataset are not necessarily applicable to another.
[+] [-] blazespin|10 years ago|reply