top | item 7802407

Twitter to Release All Tweets to Scientists

153 points| digital55 | 12 years ago |scientificamerican.com | reply

53 comments

order
[+] chbrown|12 years ago|reply
I've heard that before.

* Library of Congress: http://blogs.loc.gov/loc/2013/01/update-on-the-twitter-archi...

* Twitter Data grants: https://blog.twitter.com/2014/introducing-twitter-data-grant...

I'll admit, I haven't applied for access through either one, but neither have I seen any papers cite access through those venues—and I read quite a few NLP + Twitter papers.

[+] etiam|12 years ago|reply
from http://www.loc.gov/today/pr/2013/files/twitter_report_2013ja...

"Transfer of Data to the Library

In December, 2010, Twitter named a Colorado-based company, Gnip, as the delivery agent for moving data to the Library. Shortly thereafter, the Library and Gnip began to agree on specifications and processes for the transfer of files - "current" tweets - on an ongoing basis.

In February 2011, transfer of "current" tweets was initiated and began with tweets from December 2010.

On February 28, 2012, the Library received the 2006-2010 archive through Gnip in three compressed files totaling 2.3 terabytes. When uncompressed the files total 20 terabytes. The files contained approximately 21 billion tweets, each with more than 50 accompanying metadata fields, such as place and description.

As of December 1, 2012,the Library has received more than 150 billion additional tweets and corresponding metadata, for a total including the 2006-2010 archive of approximately 170 billion tweets totaling 133.2 terabytes for two compressed copies."

I find the quantities hilarious. But since they haven't been able to cope with providing access yet I get pessimistic about their prospects of doing so at all any time soon.

Can we do something to help them?

I've been thinking maybe GPU-accelerated databases like MapD, could mitigate the cost issue for them, but I'm pretty sure that doesn't go all the way to solving the problem...

[+] stokedmartin|12 years ago|reply
Twitter had initiated granting of datasets some time back (now closed)[0] on the merits of a short proposal. The number of groups who eventually got access to the data were very few[1]. I hope in the future they increase the number of grants.

[0] https://blog.twitter.com/2014/introducing-twitter-data-grant...

[1] https://blog.twitter.com/2014/twitter-datagrants-selections

[+] alexleavitt|12 years ago|reply
Yes, I am pretty sure this article is just rehashing the Twitter grants (I believe there were only 6 to 8 rewards), rather than announcing full open data to any researchers (thereby making the title misleading).
[+] apetresc|12 years ago|reply
This is exciting to me; does anyone know how Twitter will go about this? Will there be a public dataset available for download? A research contract through the recently-acquired GNIP? Or just firehose access for future streams?
[+] beejiu|12 years ago|reply
Considering there's at least 400GB of data generated per day, I don't think it'll be readily available for the public as a download.
[+] umanwizard|12 years ago|reply
Doesn't Twitter make a fair bit of money from selling access to various slices of their data? I'd be surprised if they released it all to the general public. I imagine scientists would have to be under some sort of NDA.
[+] JoshTriplett|12 years ago|reply
This would likely make a great natural language data set for compression algorithms.
[+] hyperbovine|12 years ago|reply
@JoshTriplett tweets r alrdy #compressed. hth
[+] bgwhn|12 years ago|reply
True, but doesn't Twitter already provide an API for access to a fraction of the firehose? Surely that would be enough data. If Twitter doesn't have a good API, Reddit allows full access to all comments through their API (although Reddit has orders of magnitude less data).
[+] Smulv|12 years ago|reply
It appears as if the data is only available to those scientists who apply for the data grant and win it. Furthermore, applications for the grant have been closed since midway through March. Yea, I'm not surprised Twitter isn't making its historical data public. That would literally end Gnip, which is a revenue source for Twitter not based on advertising to users.
[+] jebus989|12 years ago|reply
How about just loosening the API rate limits, or making a better token request process with resource allocation e.g. I'd like 1500 requests per 15 min window (as opposed to 15, for some things) for 72 hours. I guess this could be limited to those with a academic email address if they insist.
[+] uptown|12 years ago|reply
One thing I've wondered. Is it possible to follow "everyone" on Twitter? If-not, what type of cap does Twitter enforce on the number of accounts you're allowed to follow? I realize it'd be difficult to know which new accounts to add as people join, but how far could you push a roll-your-own stream of the Twitter firehose?
[+] freehunter|12 years ago|reply
Hypothetically: I'm sure you could do it algorithmically; if your program sees a retweet from someone who is not on your following list, you then follow them. You might miss a few, but you would get most everyone.
[+] theg2|12 years ago|reply
A release for journalists would be nice too...
[+] NamTaf|12 years ago|reply
I can't wait to see someone legitimately design a better sewerage system by using twitter's geolocation.
[+] nevinera|12 years ago|reply
Their geo-data is utter crap. The vast majority of it is based on 'profile location', which means that there are almost a million people tweeting from the exact center of Atlanta. It's a crowded spot, must be a Starbucks there or something.
[+] izzydata|12 years ago|reply
It is all available to the public to begin with anyway. I don't see the dilemma here.
[+] _RPM|12 years ago|reply
So much for the "protected" tweet illusion.
[+] of|12 years ago|reply
Who cares? Isn't it already available?
[+] callesgg|12 years ago|reply
Yes it is, however not in excel. (written so a non tech person could understand)
[+] extesy|12 years ago|reply
Article date is Jun 1, 2014. Is the author from the future?
[+] flycaliguy|12 years ago|reply
There is no magic discovery about the nature of man hidden away in that data. Nothing you're average stand up comedian hasn't already written a bit about.
[+] dirtyaura|12 years ago|reply
That was a good one.

On a more serious tone, there is one area of research that Twitter data is very valuable for: how information and disinformation is created and spread during major news events: wars, catastrophes, uprisings, school shootings etc.

My gut feeling is that today a lot of quality journalism happens outside of the traditional journalistic organisations. The downside is that also a lot of wild speculation and rumours are spread, but it would be valuable to see how good this modern "crowd" journalism is. A skilled research group can use Twitter data and Internet Archive to track down the original sources of information pretty well.

[+] coherentpony|12 years ago|reply
That's a little cynical. I like to think we can predict big historical social events (regime changes, major protests, climate change?) based on Twitter's data.

You never know unless you try.

[+] scalene|12 years ago|reply
Not to be skeptical, but I'm pretty sure one of these scientists may happen to work for the NSA.