top | item 5002974

Every NFL play for the past 10 years in CSV format

421 points| edw519 | 13 years ago |advancednflstats.com | reply

101 comments

order
[+] edw519|13 years ago|reply
From Line 42536 of the 2008 CSV file:

20090201_PIT@ARI,2,30,18,ARI,PIT,1,1,1,(:18) (Shotgun) K.Warner pass short middle intended for A.Boldin INTERCEPTED by J.Harrison at PIT 0. J.Harrison for 100 yards TOUCHDOWN. Super Bowl Record longest interception return yards. Penalty on ARZ-E.Brown Face Mask (15 Yards) declined. The Replay Assistant challenged the runner broke the plane ruling and the play was Upheld.,7,10,2008

They forgot: for(i=0;i<92;i++){yell('edw519','GO!')}

Seriously, I had plans for the next 4 days, but I just scrapped them. Funny how jazzed I get when it's data that I can really relate to...

I've already structured my data warehouse and started the loads. (I'll probably need a whole day just to parse the text in Field 10.) Then I'm going to build a Business Intelligence system on top of it. I will finally have the proof I need that I, not the offensive coordinator, should be texting each play to Coach Tomlin.

See you guys on Monday.

EDIT: OK, I'm back, but not for long. I'm having way too much fun with this...

fleaflicker: Cool website & domain name. Thanks for the tips. I expect shortcomings in the data, but it looks like it's in a lot better shape than the usual free form enterprise quality/vendor/customer comments I usually have to parse. We'll see...

MattSayer & sjs382: I don't plan to do any analysis. I prefer to build an app that enables others to do their own analyses, answering questions that nobody else is asking. Like "Which Steeler makes the most tackles on opposing runs or more than 5 yards when it's 3rd down and longer than 8 yards to go, the temperature is below 38, and edw519 is twirling his Terrible Towel clockwise?"

jerf: Nice thought. I've spent years trying to earn enough money to buy the Pittsburgh Steelers just to fire the slackers and fumblers and win the Super Bowl every year. Maybe I should just take an easier route and solve that problem like any self-respecting hacker should: with data & logic. No Steeler game this weekend; I may have found my destiny </sarcasm>

[+] tghw|13 years ago|reply
Looking through the 2002 season, there's an oddity around touchdowns and extra points. It seems that the 6 points for the touchdown are bundled with the extra point, and the score is not updated until the extra point is complete.

It seems this might result in bugs, as in the Oct 20, 2002 game between Dallas and Arizona. In the third quarter, with a score of Arizona 6 - Dallas 0, Dallas scored a touchdown (row 13900) but "aborted" the extra point (row 13901). The 6 points for the Cowboys are not recorded in the data.

The game eventually went to overtime, with the Cardinals kicking a winning field goal in OT for a final score of Arizona 9 - Dallas 6, but the data here records it as Arizona 6 - Dallas 0.

[+] danso|13 years ago|reply
There's a FAQ for this data that is on the site's main nav:

http://www.advancednflstats.com/2007/02/contact.html

Of particular interest:

Where did you get your data?

Most of my team data comes from open online sources such as espn.com, nfl.com, myway.com, and yahoo.com. It's easy for anyone to grab whatever they're interested in from those sites.

My play-by-play data comes from a source that's not publicly available, and at this time I regret that I cannot share it. However, I am working hard to develop a way to spread the wealth. One of my biggest goals is to help create a larger, more open, and more collaborative community for football research.

----

There's no real terms of service so I'm curious as to the constraints in using this for commercial purposes. I most definitely want to use this for teaching purposes (how to text-mine, how to build a web app from data, etc) but want to know what terms the data can be redistributed.

[+] petersalas|13 years ago|reply
This seems like as good a time as any to share something I've been working on which uses the same source data, even though it's pretty rough at the moment (slow, bad data, only currently goes through week 8 of 2012, etc.):

http://nfl-query.herokuapp.com/

The basic syntax is [stats] [conditions] : [row] / [column].

There's some autocompletion to try to make it possible to discover what is accepted.

Examples:

passing yards : team / season

first downs / first down attempts : down / distance

rushing yards min 100 rushing yards : player, game, quarter

rushing yards / carries min 200 carries : player

One of the biggest problems is that it's currently way too easy to shoot yourself in the foot by making a really slow query.

[+] arscan|13 years ago|reply
I'm frankly surprised that this information is allowed to be distributed. I spent awhile in the financial services industry, and while it was really easy to obtain "public" information like stock quote data, I recall that we weren't allowed to simply scrape data from public sites... we had to pay a license fee to get a feed of the data if we were planning on repackaging & distributing it.

It seems to me that the NFL would want to have exclusive rights to distribute this data and charge people a fee for access to it. Clearly I'm no expert in these legal affairs though.

[+] vsprabhakara1|13 years ago|reply
Generally speaking, stats are public domain as they are a public event that occurred. Because a sports league may disagree with this position doesn't mean that it isn't true. However, its entirely possible to violate a given site's TOU by scraping the data, it doesn't mean the data itself isn't allowed to be compiled or distributed.

IANAL, but I worked at ESPN and founded Fanvibe (YC S'10), and worked quite a bit with the leagues and lawyers on rights-related topics.

[+] aidenn0|13 years ago|reply
IANAL, but I asked one about this while ago; let's see if I can remember: It's complicated. The NFL broadcasts are copyrighted, and come with a statement that (among other things) distributing descriptions of the game is not allowed. That could be considered a derivative work.

On the other hand, a live performance is generally not protected from copyright, so if you attend a live game to collect the data, you may be in the clear.

The data isn't owned by the NFL, but all recordings of the games are, and so any data obtained by watching recordings of the games could potentially be controlled by the NFL.

[+] dkoch|13 years ago|reply
IANAL, but there was a landmark case where the NBA sued Motorola and STATS Inc. for distributing live game statistics. The ruling ended up in favor of STATS, where the decision was pure facts could not be copyrighted.
[+] frozenport|13 years ago|reply
Lets pretend he got it from memory.
[+] aes256|13 years ago|reply
Fun fact: In the UK even the FA Premier League fixture list is copyrighted, and websites or publications wishing to publish all or part of the fixture list for a given season need to pay a licence fee.

I can't even give you a list of football fixtures coming up this weekend without breaking copyright law.

[+] dude_abides|13 years ago|reply
Here is an idea: build a predictive model of an offensive coach that predicts the play he will call, given a game situation (and based on that, build a predictiveness quotient for a coach).
[+] fleaflicker|13 years ago|reply
It doesn't work like that in practice. Football is very dependent on matchups. Coaches will vary gameplans from week-to-week to exploit weaknesses they see on film.
[+] zgohr|13 years ago|reply
If only a single play call had a single potential outcome, and that outcome was always met. Using these stats for predictions would seem extremely difficult beyond answering, "will it be a run or a pass?"
[+] sethist|13 years ago|reply
There are too many missing variables. The most obvious being who made the actual play call, the head coach, the offensive coordinator, or the quarterback.
[+] euroclydon|13 years ago|reply
How many of you are thinking right now: I'm going to generate an HTML page for every game and throw ads on it? Be Honest!
[+] DanBC|13 years ago|reply
Is anyone going to try a 'moneyball' style Fix_Your_Fantasy_League_LineUp site with ads?
[+] ImJasonH|13 years ago|reply
I've started uploading these CSVs to a public Google BigQuery dataset called [nfl], so you can run queries over them like this:

    SELECT off, COUNT(off) AS count
    FROM [nfl.2012reg]
    WHERE description CONTAINS "INTERCEPTED"
    GROUP BY off
    ORDER BY count DESC
(This counts the number of plays that resulted in an interception by the team that threw the interception, sorted from most to fewest INTs)
[+] danvoell|13 years ago|reply
I'm new to BigQuery, how do I access a public dataset? I ran the query and got the error Not Found: Dataset 578707073226:nfl
[+] snake_plissken|13 years ago|reply
mean reversion between ints for NFL live betting on next int? i smell greenbacks!
[+] patrickk|13 years ago|reply
Here's some soccer data, doesn't include play-by-play though (soccer generally isn't suited to that kind of breakdown, although Opta Sports do track it).

http://www.football-data.co.uk/downloadm.php

Tons of European leagues, going back to 1993 in some cases.

Here's some sites that give detailed stats and match reports:

http://www.eplindex.com/

http://www.whoscored.com/

http://www.soccerstats.com/

http://www.soccerway.com/

http://www.squawka.com/

Man City use their petro-dollars to open up Opta Sports (detailed match stats) to all: http://www.mcfc.co.uk/the-club/mcfc-analytics

Someone needs to compile stats equivalent to these NFL ones for european football! Hmmmm...

[+] ScottWhigham|13 years ago|reply
The comments on that are awesome too - great advice for parsing, categorizing, and such. I couldn't download 2010 though - "Sorry, we are unable to generate a view of the document at this time. Please try again later."
[+] gavinlynch|13 years ago|reply
Amazing!!! Thanks to www.advancednflstats.com for doing all the leg-work. Highly recommend their site too. Their in-game win probability statistics are always a must-have for me on game-day ^_^
[+] danso|13 years ago|reply
This looks like great fun...Judging by some of the sample entries, it will also be an instructive example of the limitations of CSV and why serious analysts who want to work with unstructured data need to know a scripting language, or at least regexes.

Sample description field: > 20020905_SF@NYG,1,59,20,NYG,SF,3,11,81,(14:20) (Shotgun) K.Collins pass intended for T.Barber INTERCEPTED by T.Parrish (M.Rumph) at NYG 29. T.Parrish to NYG 23 for 6 yards (T.Barber).,0,0,2002

In the comments section of the OP, someone posted this sample Excel function:

    	=IF(ISNUMBER(SEARCH("right   tackle",J2)),"rush",IF(ISNUMBER(SEARCH("right
	guard",J2)),"rush",IF(ISNUMBER(SEARCH("left
	guard",J2)),"rush",IF(ISNUMBER(SEARCH("up                              the
	middle",J2)),"rush",IF(ISNUMBER(SEARCH("left
	tackle",J2)),"rush",IF(ISNUMBER(SEARCH("left
	end",J2)),"rush",IF(ISNUMBER(SEARCH("right
	end",J2)),"rush",IF(ISNUMBER(SEARCH("pass",J2)),"pass",IF(ISNUMBER(SEARCH("kneel",J2)),"kneel",IF(ISNUMBER(SEARCH("punt",J2)),"punt",IF(ISNUMBER(SEARCH("kicks",J2)),"kickoff",IF(ISNUMBER(SEARCH("extra
	point",J2)),"extrapoint",IF(ISNUMBER(SEARCH("sacked",J2)),"sack",IF(ISNUMBER(SEARCH("PENALTY",J2)),"penalty",IF(ISNUMBER(SEARCH("field
	goal",J2)),"fieldgoal",IF(ISNUMBER(SEARCH("FUMBLES",J2)),"fumble",IF(ISNUMBER(SEARCH("spiked",J2)),"spike",IF(ISNUMBER(SEARCH("scrambles",J2)),"rush","rush"))))))))))))))))))

Dear god, at what point do people finally realize that it's worth learning some simple scripting to work with text files?
[+] HelloMcFly|13 years ago|reply
The Excel function looks ridiculous, but it probably didn't take more than 10 minutes to make, tops. Nested conditionals are easy.

At any rate, what would you recommend most to accomplish the task? I'm learning Python and know R a bit, so I was just wondering how I was going to about combing through the data.

[+] Groxx|13 years ago|reply
There are limitations to CSV? As long as it's properly escaped, it works as well as any other I'm aware of. What would you suggest, XML?
[+] AlwaysBCoding|13 years ago|reply
God this is such interesting stuff. How do we still not have a fully featured open source NFL stats-rosters-game charting API? Who wouldn't want to contribute to that project?

Other than cool data visualization stuff, the obvious implication is the potential to devise a profitable system to pick games against the spread. The guys at Football Outsiders have done a decent job at it and made a proprietary algorithm that picked games at 58% this year ( which is over the threshold you need to be profitable in Vegas ). But even those guys are still having some trouble getting access to and aggregating the data in a usable format.

I really want to sit down and start playing around with some of this data so I appreciate you putting this together for everyone. The NFL needs an open source API and this is definitely a step in the right direction.

[+] evanjacobs|13 years ago|reply
A bit OT, but I thought this might be a good opportunity to mention the upcoming SportsHackDay in Seattle from Feb 1-3 which culminates in a group viewing of the SuperBowl. http://sportshackday.com/
[+] jredwards|13 years ago|reply
I've used data from Brian Burke's site before. I think it's the exact PBP data the NFL has, but you'll find that the structure and common phrasings change over the years. I had to write a lot of regular expressions and I was still catching edge cases for weeks.

btw, pro-football-reference has pbp data now too, and it probably goes back a lot further, but I think they discourage mass scraping of their site.

[+] p4bl0|13 years ago|reply
As a French not interested in sports at all this would have made no sense at all to me before I watched the TV series The League [1]. Now I kind of enjoy the fact that these stats exist and are available in an open format, even if I don't really care myself.

[1] http://www.imdb.com/title/tt1480684/