Science of The Super Bowl

A couple days ago, I participated in a Science of the Super Bowl Panel discussion organized by Newswise. I was asked to give a 5 (which turned more into about 10) minute overview, so I focused on answering 3 questions.

  • What is data science?
  • How is data science used in the NFL?
  • How might data science affect the outcome of the Super Bowl?

For my talk  (around 21:45 mark) I made some visuals, so I thought I would recreate one here and include the R code. Here is the finished product, below is the code to reproduce it. Here is a higher res copy, that is actually readable.

 

 

 

 

9 thoughts on “Science of The Super Bowl

  • Hi
    Interesting output
    However getting a problem running code
    game_list <- lapply(all_games, game_play_by_play)
    Error in [<-.POSIXlt(*tmp*, not_same, value = c(1486196100, 1486196100, :
    NAs are not allowed in subscripted assignments

    first problem seems to be in 17th game as
    game_play_by_play("2016091500") produces same error

    • yep, that’s a pasting oversight on my part. the playoff vector has the IDs for the Superbowl first and Pro Bowl second. Since the Superbowl doesn’t have any data yet it will error out. Well at least until tomorrow night

  • Thanks for getting back to me. Has the data been updated? I’m still getting same error and I don’t think it was just the one game. That one just happened to be the first

    • Andrew, I just re ran the code and was not able to recreate the error. One potential difference could be that I downloaded the source package from GitHub (https://github.com/maksimhorowitz/nflscrapR) and not from CRAN. In the version I downloaded, I recieved an error when running the game_play_by_play function on certain games. The error you are getting is not the same error I received from the github issue, but the version difference could be the same source. I was able to get everything working by wrapping the input string for all occurrences of stringr::str_extract_all in unlist within the game_play_by_play function. For example, lines 314:315 in (https://github.com/maksimhorowitz/nflscrapR/blob/master/R/PlayByPlayBoxScore.R) are currently
      two.point.result2 <- stringr::str_extract_all(PBP$desc[two.point.result.ind], pattern = "ATTEMPT FAILS|SUCCEEDS") I changed that to two.point.result2 <- stringr::str_extract_all(unlist(PBP$desc[two.point.result.ind]), pattern = "ATTEMPT FAILS|SUCCEEDS") I have been meaning to send a pull request but I have not had a chance yet. Thanks for reading btw. I appreciate it. Also I enjoy your site. You also post great stuff

  • Jesse, Thanks for the github suggestion. That update seemed to work as far as getting data scraped without error and I was able to replicate your work. However, I noticed that there are additional rows for some of the players if you do not limit it by team. For instance, D.Freeman has 3 plays as Rusher with his posteam as DEN on 2016-10-09 when they played Atlanta. They all appear to be related to a shotgun play. I’m afraid Football isn’t my sport but would probably want to exclude this data if, say, I was doing a shiny app and creating a chart based on a player name? Can you help me here cheers

    • Andrew,

      First, my apologies for not replying sooner, I’m sure at this point (after the NFL season) the utility of my reply is greatly diminished, but reply I shall

      You are exactly right, you would want to exclude those and it is here when we run into the inescapable problem of trying to use data outside the purposes it was collected for. Since this data is used by the NFL for displaying game stats on their site, there are some conventions (and I’m sure unavoidably some errors) that seem strange from an analysis point of view. One particular issue that I tried to look into was finding a unique ID for each player so we wouldn’t have to rely on “D. Freeman” which of course is not guaranteed to be unique and for that matter player-team is not guaranteed either since players quite often get traded mid season. I did not find what I considered to be a reliable source of unique IDs for players.

      If you are still working on using the package, let me know and we can see if there is something we can figure out

      Jesse

Comments are closed.