I'll admit, I haven't applied for access through either one, but neither have I seen any papers cite access through those venues—and I read quite a few NLP + Twitter papers.
This article is just talking about Twitter Data Grants for which 6 universities were decided as winners [0]. You won't see papers through these grants as yet because well, the winners were announced about 40 days back!
In December, 2010, Twitter named a Colorado-based company, Gnip, as the delivery agent for moving data to the Library.
Shortly thereafter, the Library and Gnip began to agree on specifications and processes for the transfer of files - "current" tweets - on an ongoing basis.
In February 2011, transfer of "current" tweets was initiated and began with tweets from December 2010.
On February 28, 2012, the Library received the 2006-2010 archive through Gnip in three compressed files totaling 2.3 terabytes. When uncompressed the files total 20 terabytes. The files contained approximately 21 billion tweets, each with more than 50 accompanying metadata fields, such as place and description.
As of December 1, 2012,the Library has received more than 150 billion additional tweets and corresponding metadata, for a total including the 2006-2010 archive of approximately 170 billion tweets totaling 133.2 terabytes for two compressed copies."
I find the quantities hilarious.
But since they haven't been able to cope with providing access yet I get pessimistic about their prospects of doing so at all any time soon.
Can we do something to help them?
I've been thinking maybe GPU-accelerated databases like MapD, could mitigate the cost issue for them, but I'm pretty sure that doesn't go all the way to solving the problem...
Twitter had initiated granting of datasets some time back (now closed)[0] on the merits of a short proposal. The number of groups who eventually got access to the data were very few[1]. I hope in the future they increase the number of grants.
Yes, I am pretty sure this article is just rehashing the Twitter grants (I believe there were only 6 to 8 rewards), rather than announcing full open data to any researchers (thereby making the title misleading).
This is exciting to me; does anyone know how Twitter will go about this? Will there be a public dataset available for download? A research contract through the recently-acquired GNIP? Or just firehose access for future streams?
Doesn't Twitter make a fair bit of money from selling access to various slices of their data? I'd be surprised if they released it all to the general public. I imagine scientists would have to be under some sort of NDA.
True, but doesn't Twitter already provide an API for access to a fraction of the firehose? Surely that would be enough data. If Twitter doesn't have a good API, Reddit allows full access to all comments through their API (although Reddit has orders of magnitude less data).
It appears as if the data is only available to those scientists who apply for the data grant and win it. Furthermore, applications for the grant have been closed since midway through March. Yea, I'm not surprised Twitter isn't making its historical data public. That would literally end Gnip, which is a revenue source for Twitter not based on advertising to users.
How about just loosening the API rate limits, or making a better token request process with resource allocation e.g. I'd like 1500 requests per 15 min window (as opposed to 15, for some things) for 72 hours. I guess this could be limited to those with a academic email address if they insist.
One thing I've wondered. Is it possible to follow "everyone" on Twitter? If-not, what type of cap does Twitter enforce on the number of accounts you're allowed to follow? I realize it'd be difficult to know which new accounts to add as people join, but how far could you push a roll-your-own stream of the Twitter firehose?
Hypothetically: I'm sure you could do it algorithmically; if your program sees a retweet from someone who is not on your following list, you then follow them. You might miss a few, but you would get most everyone.
Their geo-data is utter crap. The vast majority of it is based on 'profile location', which means that there are almost a million people tweeting from the exact center of Atlanta. It's a crowded spot, must be a Starbucks there or something.
There is no magic discovery about the nature of man hidden away in that data. Nothing you're average stand up comedian hasn't already written a bit about.
On a more serious tone, there is one area of research that Twitter data is very valuable for: how information and disinformation is created and spread during major news events: wars, catastrophes, uprisings, school shootings etc.
My gut feeling is that today a lot of quality journalism happens outside of the traditional journalistic organisations. The downside is that also a lot of wild speculation and rumours are spread, but it would be valuable to see how good this modern "crowd" journalism is. A skilled research group can use Twitter data and Internet Archive to track down the original sources of information pretty well.
That's a little cynical. I like to think we can predict big historical social events (regime changes, major protests, climate change?) based on Twitter's data.
[+] [-] chbrown|12 years ago|reply
* Library of Congress: http://blogs.loc.gov/loc/2013/01/update-on-the-twitter-archi...
* Twitter Data grants: https://blog.twitter.com/2014/introducing-twitter-data-grant...
I'll admit, I haven't applied for access through either one, but neither have I seen any papers cite access through those venues—and I read quite a few NLP + Twitter papers.
[+] [-] denzil_correa|12 years ago|reply
[0] https://blog.twitter.com/2014/twitter-datagrants-selections
[+] [-] etiam|12 years ago|reply
"Transfer of Data to the Library
In December, 2010, Twitter named a Colorado-based company, Gnip, as the delivery agent for moving data to the Library. Shortly thereafter, the Library and Gnip began to agree on specifications and processes for the transfer of files - "current" tweets - on an ongoing basis.
In February 2011, transfer of "current" tweets was initiated and began with tweets from December 2010.
On February 28, 2012, the Library received the 2006-2010 archive through Gnip in three compressed files totaling 2.3 terabytes. When uncompressed the files total 20 terabytes. The files contained approximately 21 billion tweets, each with more than 50 accompanying metadata fields, such as place and description.
As of December 1, 2012,the Library has received more than 150 billion additional tweets and corresponding metadata, for a total including the 2006-2010 archive of approximately 170 billion tweets totaling 133.2 terabytes for two compressed copies."
I find the quantities hilarious. But since they haven't been able to cope with providing access yet I get pessimistic about their prospects of doing so at all any time soon.
Can we do something to help them?
I've been thinking maybe GPU-accelerated databases like MapD, could mitigate the cost issue for them, but I'm pretty sure that doesn't go all the way to solving the problem...
[+] [-] stokedmartin|12 years ago|reply
[0] https://blog.twitter.com/2014/introducing-twitter-data-grant...
[1] https://blog.twitter.com/2014/twitter-datagrants-selections
[+] [-] alexleavitt|12 years ago|reply
[+] [-] apetresc|12 years ago|reply
[+] [-] beejiu|12 years ago|reply
[+] [-] umanwizard|12 years ago|reply
[+] [-] JoshTriplett|12 years ago|reply
[+] [-] hyperbovine|12 years ago|reply
[+] [-] bgwhn|12 years ago|reply
[+] [-] Smulv|12 years ago|reply
[+] [-] jebus989|12 years ago|reply
[+] [-] uptown|12 years ago|reply
[+] [-] freehunter|12 years ago|reply
[+] [-] jonknee|12 years ago|reply
[+] [-] unknown|12 years ago|reply
[deleted]
[+] [-] theg2|12 years ago|reply
[+] [-] NamTaf|12 years ago|reply
[+] [-] nevinera|12 years ago|reply
[+] [-] izzydata|12 years ago|reply
[+] [-] namenotrequired|12 years ago|reply
[+] [-] unclesaamm|12 years ago|reply
[+] [-] mike415|12 years ago|reply
[+] [-] _RPM|12 years ago|reply
[+] [-] of|12 years ago|reply
[+] [-] callesgg|12 years ago|reply
[+] [-] extesy|12 years ago|reply
[+] [-] flycaliguy|12 years ago|reply
[+] [-] dirtyaura|12 years ago|reply
On a more serious tone, there is one area of research that Twitter data is very valuable for: how information and disinformation is created and spread during major news events: wars, catastrophes, uprisings, school shootings etc.
My gut feeling is that today a lot of quality journalism happens outside of the traditional journalistic organisations. The downside is that also a lot of wild speculation and rumours are spread, but it would be valuable to see how good this modern "crowd" journalism is. A skilled research group can use Twitter data and Internet Archive to track down the original sources of information pretty well.
[+] [-] coherentpony|12 years ago|reply
You never know unless you try.
[+] [-] scalene|12 years ago|reply