Thanks for the interest! We collected tweets by sampling a small percentage of all tweets in a time window, to emulate what one might get from the streaming API (I did this as part of the VI-A http://vi-a.mit.edu/ masters program at MIT, and was an employee of Twitter). We did pick a fixed set of topics to track, but those were randomly sampled (though we did get rid of topics that trended multiple times in a large time window, like the name of a football player who scored in multiple matches, in which case we don't know which event we are trying to detect). One thing the algorithm doesn't do at the moment is come up with its own trending topics. It just tests prediction of trending/non-trending on a hold-out set taken from the original set of topics.
No comments yet.