20090201_PIT@ARI,2,30,18,ARI,PIT,1,1,1,(:18) (Shotgun) K.Warner pass short middle intended for A.Boldin INTERCEPTED by J.Harrison at PIT 0. J.Harrison for 100 yards TOUCHDOWN. Super Bowl Record longest interception return yards. Penalty on ARZ-E.Brown Face Mask (15 Yards) declined. The Replay Assistant challenged the runner broke the plane ruling and the play was Upheld.,7,10,2008
They forgot: for(i=0;i<92;i++){yell('edw519','GO!')}
Seriously, I had plans for the next 4 days, but I just scrapped them. Funny how jazzed I get when it's data that I can really relate to...
I've already structured my data warehouse and started the loads. (I'll probably need a whole day just to parse the text in Field 10.) Then I'm going to build a Business Intelligence system on top of it. I will finally have the proof I need that I, not the offensive coordinator, should be texting each play to Coach Tomlin.
See you guys on Monday.
EDIT: OK, I'm back, but not for long. I'm having way too much fun with this...
fleaflicker: Cool website & domain name. Thanks for the tips. I expect shortcomings in the data, but it looks like it's in a lot better shape than the usual free form enterprise quality/vendor/customer comments I usually have to parse. We'll see...
MattSayer & sjs382: I don't plan to do any analysis. I prefer to build an app that enables others to do their own analyses, answering questions that nobody else is asking. Like "Which Steeler makes the most tackles on opposing runs or more than 5 yards when it's 3rd down and longer than 8 yards to go, the temperature is below 38, and edw519 is twirling his Terrible Towel clockwise?"
jerf: Nice thought. I've spent years trying to earn enough money to buy the Pittsburgh Steelers just to fire the slackers and fumblers and win the Super Bowl every year. Maybe I should just take an easier route and solve that problem like any self-respecting hacker should: with data & logic. No Steeler game this weekend; I may have found my destiny </sarcasm>
Looking through the 2002 season, there's an oddity around touchdowns and extra points. It seems that the 6 points for the touchdown are bundled with the extra point, and the score is not updated until the extra point is complete.
It seems this might result in bugs, as in the Oct 20, 2002 game between Dallas and Arizona. In the third quarter, with a score of Arizona 6 - Dallas 0, Dallas scored a touchdown (row 13900) but "aborted" the extra point (row 13901). The 6 points for the Cowboys are not recorded in the data.
The game eventually went to overtime, with the Cardinals kicking a winning field goal in OT for a final score of Arizona 9 - Dallas 6, but the data here records it as Arizona 6 - Dallas 0.
Most of my team data comes from open online sources such as espn.com, nfl.com, myway.com, and yahoo.com. It's easy for anyone to grab whatever they're interested in from those sites.
My play-by-play data comes from a source that's not publicly available, and at this time I regret that I cannot share it. However, I am working hard to develop a way to spread the wealth. One of my biggest goals is to help create a larger, more open, and more collaborative community for football research.
----
There's no real terms of service so I'm curious as to the constraints in using this for commercial purposes. I most definitely want to use this for teaching purposes (how to text-mine, how to build a web app from data, etc) but want to know what terms the data can be redistributed.
This seems like as good a time as any to share something I've been working on which uses the same source data, even though it's pretty rough at the moment (slow, bad data, only currently goes through week 8 of 2012, etc.):
I'm frankly surprised that this information is allowed to be distributed. I spent awhile in the financial services industry, and while it was really easy to obtain "public" information like stock quote data, I recall that we weren't allowed to simply scrape data from public sites... we had to pay a license fee to get a feed of the data if we were planning on repackaging & distributing it.
It seems to me that the NFL would want to have exclusive rights to distribute this data and charge people a fee for access to it. Clearly I'm no expert in these legal affairs though.
Generally speaking, stats are public domain as they are a public event that occurred. Because a sports league may disagree with this position doesn't mean that it isn't true. However, its entirely possible to violate a given site's TOU by scraping the data, it doesn't mean the data itself isn't allowed to be compiled or distributed.
IANAL, but I worked at ESPN and founded Fanvibe (YC S'10), and worked quite a bit with the leagues and lawyers on rights-related topics.
IANAL, but I asked one about this while ago; let's see if I can remember: It's complicated. The NFL broadcasts are copyrighted, and come with a statement that (among other things) distributing descriptions of the game is not allowed. That could be considered a derivative work.
On the other hand, a live performance is generally not protected from copyright, so if you attend a live game to collect the data, you may be in the clear.
The data isn't owned by the NFL, but all recordings of the games are, and so any data obtained by watching recordings of the games could potentially be controlled by the NFL.
IANAL, but there was a landmark case where the NBA sued Motorola and STATS Inc. for distributing live game statistics. The ruling ended up in favor of STATS, where the decision was pure facts could not be copyrighted.
The NFL is not near as strict as MLB (note how you never see an MLB highlight on youtube?) yet MLB allows http://retrosheet.org/ to exist. I don't believe it's technically "dissemination".
Fun fact: In the UK even the FA Premier League fixture list is copyrighted, and websites or publications wishing to publish all or part of the fixture list for a given season need to pay a licence fee.
I can't even give you a list of football fixtures coming up this weekend without breaking copyright law.
Here is an idea: build a predictive model of an offensive coach that predicts the play he will call, given a game situation (and based on that, build a predictiveness quotient for a coach).
It doesn't work like that in practice. Football is very dependent on matchups. Coaches will vary gameplans from week-to-week to exploit weaknesses they see on film.
If only a single play call had a single potential outcome, and that outcome was always met. Using these stats for predictions would seem extremely difficult beyond answering, "will it be a run or a pass?"
There are too many missing variables. The most obvious being who made the actual play call, the head coach, the offensive coordinator, or the quarterback.
The CSV file format is nice, but if you're looking for a Python API to play with NFL stats without having to parse play-data fields, check out nflgame [1]. I've written up a quick primer. [2] It also includes the ability to get play-by-play statistics live.
Here's some soccer data, doesn't include play-by-play though (soccer generally isn't suited to that kind of breakdown, although Opta Sports do track it).
The comments on that are awesome too - great advice for parsing, categorizing, and such. I couldn't download 2010 though - "Sorry, we are unable to generate a view of the document at this time. Please try again later."
Amazing!!! Thanks to www.advancednflstats.com for doing all the leg-work. Highly recommend their site too. Their in-game win probability statistics are always a must-have for me on game-day ^_^
This looks like great fun...Judging by some of the sample entries, it will also be an instructive example of the limitations of CSV and why serious analysts who want to work with unstructured data need to know a scripting language, or at least regexes.
Sample description field:
> 20020905_SF@NYG,1,59,20,NYG,SF,3,11,81,(14:20) (Shotgun) K.Collins pass intended for T.Barber INTERCEPTED by T.Parrish (M.Rumph) at NYG 29. T.Parrish to NYG 23 for 6 yards (T.Barber).,0,0,2002
In the comments section of the OP, someone posted this sample Excel function:
The Excel function looks ridiculous, but it probably didn't take more than 10 minutes to make, tops. Nested conditionals are easy.
At any rate, what would you recommend most to accomplish the task? I'm learning Python and know R a bit, so I was just wondering how I was going to about combing through the data.
God this is such interesting stuff. How do we still not have a fully featured open source NFL stats-rosters-game charting API? Who wouldn't want to contribute to that project?
Other than cool data visualization stuff, the obvious implication is the potential to devise a profitable system to pick games against the spread. The guys at Football Outsiders have done a decent job at it and made a proprietary algorithm that picked games at 58% this year ( which is over the threshold you need to be profitable in Vegas ). But even those guys are still having some trouble getting access to and aggregating the data in a usable format.
I really want to sit down and start playing around with some of this data so I appreciate you putting this together for everyone. The NFL needs an open source API and this is definitely a step in the right direction.
A bit OT, but I thought this might be a good opportunity to mention the upcoming SportsHackDay in Seattle from Feb 1-3 which culminates in a group viewing of the SuperBowl.
http://sportshackday.com/
I've used data from Brian Burke's site before. I think it's the exact PBP data the NFL has, but you'll find that the structure and common phrasings change over the years. I had to write a lot of regular expressions and I was still catching edge cases for weeks.
btw, pro-football-reference has pbp data now too, and it probably goes back a lot further, but I think they discourage mass scraping of their site.
There is a lot to have fun with here. I would imagine though that in a lot of NFL coaching rooms there has to be a balance between coaching and analysis.
As a French not interested in sports at all this would have made no sense at all to me before I watched the TV series The League [1]. Now I kind of enjoy the fact that these stats exist and are available in an open format, even if I don't really care myself.
[+] [-] edw519|13 years ago|reply
20090201_PIT@ARI,2,30,18,ARI,PIT,1,1,1,(:18) (Shotgun) K.Warner pass short middle intended for A.Boldin INTERCEPTED by J.Harrison at PIT 0. J.Harrison for 100 yards TOUCHDOWN. Super Bowl Record longest interception return yards. Penalty on ARZ-E.Brown Face Mask (15 Yards) declined. The Replay Assistant challenged the runner broke the plane ruling and the play was Upheld.,7,10,2008
They forgot: for(i=0;i<92;i++){yell('edw519','GO!')}
Seriously, I had plans for the next 4 days, but I just scrapped them. Funny how jazzed I get when it's data that I can really relate to...
I've already structured my data warehouse and started the loads. (I'll probably need a whole day just to parse the text in Field 10.) Then I'm going to build a Business Intelligence system on top of it. I will finally have the proof I need that I, not the offensive coordinator, should be texting each play to Coach Tomlin.
See you guys on Monday.
EDIT: OK, I'm back, but not for long. I'm having way too much fun with this...
fleaflicker: Cool website & domain name. Thanks for the tips. I expect shortcomings in the data, but it looks like it's in a lot better shape than the usual free form enterprise quality/vendor/customer comments I usually have to parse. We'll see...
MattSayer & sjs382: I don't plan to do any analysis. I prefer to build an app that enables others to do their own analyses, answering questions that nobody else is asking. Like "Which Steeler makes the most tackles on opposing runs or more than 5 yards when it's 3rd down and longer than 8 yards to go, the temperature is below 38, and edw519 is twirling his Terrible Towel clockwise?"
jerf: Nice thought. I've spent years trying to earn enough money to buy the Pittsburgh Steelers just to fire the slackers and fumblers and win the Super Bowl every year. Maybe I should just take an easier route and solve that problem like any self-respecting hacker should: with data & logic. No Steeler game this weekend; I may have found my destiny </sarcasm>
[+] [-] tghw|13 years ago|reply
It seems this might result in bugs, as in the Oct 20, 2002 game between Dallas and Arizona. In the third quarter, with a score of Arizona 6 - Dallas 0, Dallas scored a touchdown (row 13900) but "aborted" the extra point (row 13901). The 6 points for the Cowboys are not recorded in the data.
The game eventually went to overtime, with the Cardinals kicking a winning field goal in OT for a final score of Arizona 9 - Dallas 6, but the data here records it as Arizona 6 - Dallas 0.
[+] [-] danso|13 years ago|reply
http://www.advancednflstats.com/2007/02/contact.html
Of particular interest:
Where did you get your data?
Most of my team data comes from open online sources such as espn.com, nfl.com, myway.com, and yahoo.com. It's easy for anyone to grab whatever they're interested in from those sites.
My play-by-play data comes from a source that's not publicly available, and at this time I regret that I cannot share it. However, I am working hard to develop a way to spread the wealth. One of my biggest goals is to help create a larger, more open, and more collaborative community for football research.
----
There's no real terms of service so I'm curious as to the constraints in using this for commercial purposes. I most definitely want to use this for teaching purposes (how to text-mine, how to build a web app from data, etc) but want to know what terms the data can be redistributed.
[+] [-] petersalas|13 years ago|reply
http://nfl-query.herokuapp.com/
The basic syntax is [stats] [conditions] : [row] / [column].
There's some autocompletion to try to make it possible to discover what is accepted.
Examples:
passing yards : team / season
first downs / first down attempts : down / distance
rushing yards min 100 rushing yards : player, game, quarter
rushing yards / carries min 200 carries : player
One of the biggest problems is that it's currently way too easy to shoot yourself in the foot by making a really slow query.
[+] [-] arscan|13 years ago|reply
It seems to me that the NFL would want to have exclusive rights to distribute this data and charge people a fee for access to it. Clearly I'm no expert in these legal affairs though.
[+] [-] vsprabhakara1|13 years ago|reply
IANAL, but I worked at ESPN and founded Fanvibe (YC S'10), and worked quite a bit with the leagues and lawyers on rights-related topics.
[+] [-] aidenn0|13 years ago|reply
On the other hand, a live performance is generally not protected from copyright, so if you attend a live game to collect the data, you may be in the clear.
The data isn't owned by the NFL, but all recordings of the games are, and so any data obtained by watching recordings of the games could potentially be controlled by the NFL.
[+] [-] dkoch|13 years ago|reply
[+] [-] kodablah|13 years ago|reply
Edit: I should note that even http://gdx.mlb.com/components/game/mlb/ contents are governed by http://gdx.mlb.com/components/copyright.txt which doesn't allow commercial use.
[+] [-] saturdayplace|13 years ago|reply
[+] [-] frozenport|13 years ago|reply
[+] [-] aes256|13 years ago|reply
I can't even give you a list of football fixtures coming up this weekend without breaking copyright law.
[+] [-] dude_abides|13 years ago|reply
[+] [-] fleaflicker|13 years ago|reply
[+] [-] nchuhoai|13 years ago|reply
[+] [-] zgohr|13 years ago|reply
[+] [-] sethist|13 years ago|reply
[+] [-] euroclydon|13 years ago|reply
[+] [-] DanBC|13 years ago|reply
[+] [-] pinchyfingers|13 years ago|reply
[+] [-] burntsushi|13 years ago|reply
[1] - https://github.com/BurntSushi/nflgame
[2] - http://blog.burntsushi.net/nfl-live-statistics-with-python
[+] [-] ImJasonH|13 years ago|reply
[+] [-] danvoell|13 years ago|reply
[+] [-] snake_plissken|13 years ago|reply
[+] [-] patrickk|13 years ago|reply
http://www.football-data.co.uk/downloadm.php
Tons of European leagues, going back to 1993 in some cases.
Here's some sites that give detailed stats and match reports:
http://www.eplindex.com/
http://www.whoscored.com/
http://www.soccerstats.com/
http://www.soccerway.com/
http://www.squawka.com/
Man City use their petro-dollars to open up Opta Sports (detailed match stats) to all: http://www.mcfc.co.uk/the-club/mcfc-analytics
Someone needs to compile stats equivalent to these NFL ones for european football! Hmmmm...
[+] [-] ScottWhigham|13 years ago|reply
[+] [-] gavinlynch|13 years ago|reply
[+] [-] tghw|13 years ago|reply
http://www.advancednflstats.com/2009/09/4th-down-study-part-...
The tl;dr version can be found at:
http://www.advancednflstats.com/2010/05/4th-down-briefs.html
The conclusion is that teams should go for it on 4th down much more often than they currently do.
He also has a calculator where you can get the exact values:
http://wp.advancednflstats.com/4thdncalc1.php
[+] [-] danso|13 years ago|reply
Sample description field: > 20020905_SF@NYG,1,59,20,NYG,SF,3,11,81,(14:20) (Shotgun) K.Collins pass intended for T.Barber INTERCEPTED by T.Parrish (M.Rumph) at NYG 29. T.Parrish to NYG 23 for 6 yards (T.Barber).,0,0,2002
In the comments section of the OP, someone posted this sample Excel function:
Dear god, at what point do people finally realize that it's worth learning some simple scripting to work with text files?[+] [-] HelloMcFly|13 years ago|reply
At any rate, what would you recommend most to accomplish the task? I'm learning Python and know R a bit, so I was just wondering how I was going to about combing through the data.
[+] [-] Groxx|13 years ago|reply
[+] [-] AlwaysBCoding|13 years ago|reply
Other than cool data visualization stuff, the obvious implication is the potential to devise a profitable system to pick games against the spread. The guys at Football Outsiders have done a decent job at it and made a proprietary algorithm that picked games at 58% this year ( which is over the threshold you need to be profitable in Vegas ). But even those guys are still having some trouble getting access to and aggregating the data in a usable format.
I really want to sit down and start playing around with some of this data so I appreciate you putting this together for everyone. The NFL needs an open source API and this is definitely a step in the right direction.
[+] [-] nchuhoai|13 years ago|reply
https://www.dropbox.com/s/cy04oxaq83mxvoz/report.pdf
[+] [-] evanjacobs|13 years ago|reply
[+] [-] kevinburke|13 years ago|reply
http://downanddistance.herokuapp.com/
[+] [-] jredwards|13 years ago|reply
btw, pro-football-reference has pbp data now too, and it probably goes back a lot further, but I think they discourage mass scraping of their site.
[+] [-] pmarsh|13 years ago|reply
Like someone else said, it's about match-ups.
Semi-related : http://profootballtalk.nbcsports.com/2013/01/03/polian-think...
[+] [-] p4bl0|13 years ago|reply
[1] http://www.imdb.com/title/tt1480684/