This post is excerpts from the code I presented and gave out at the tutorial. The full tutorial expands my of my previous 'package spotlight' post on nhlscrapr. This post only includes the bare bones of downloading the raw games, examining the rate of goals scored and shots fired throughout the game, and making a basic player summary.
Also included is a patch to nhlscrapr I wrote that fixes a couple of functions ( full.game.database() , player.summary() ) that were throwing some errors, and adds a function ( aggregate.roster.by.name() ) that aids in matching player summaries to the proper names.
You can load the nhlscrapr package and use the patch with the following code:
library(nhlscrapr) source("Patch for nhlscrapr.r")
After which you can do things like define seasons after 2014-15 to extract. The following line builds a data frame of game IDs with one season beyond 2014-15.
fgd = full.game.database(extra.seasons = 1)
Which lets you download, process, and compile the games in the 2015-16 season
thisyear = "20152016" game_ids = subset(fgd, season == thisyear) dummy = download.games(games = game_ids, wait = 5) process.games(games=game_ids,override.download=FALSE) gc() compile.all.games(output.file="NHL play-by-play 2014-5.RData")
After extraction and compiling, we have two files
# .. the events logtemp = load("source-data\\nhlscrapr-20142015.RData") ev_all = get(temp) #.. and the player roster temp = load("source-data\\nhlscrapr-core.RData") roster = get(temp)
Analysis Snippet 1: Goals and Shots per minute of play
# First define a minutes variable based on the existing variable 'seconds' ev_all$minutes = floor((ev_all$seconds - 0.5) / 60) ev_all$minutes = pmax(ev_all$minutes, 0)
# Isolate the database to situations that are # in regulation time and during the regular season ev_reg = subset(ev_all, period <= 3 & gcode < 30000 & gcode >= 20001) ev_goals = subset(ev_reg, etype == "GOAL") ev_shots = subset(ev_reg, etype == "SHOT")
goals_per_hr = as.numeric(table(ev_goals$minutes)) / 1230 * 60 shots_per_hr = as.numeric(table(ev_shots$minutes)) / 1230 * 60
Which produces plots like these..
Analysis Snippet 2: Sedin Summary
If we just want the basic event counts, we can use the roster to find the player IDs for the Sedin twins and see the number of goals, shots, hits, etc. they had in the 2015-16 season.
sedins = subset(roster,last=="SEDIN")$player.id ev_sedin = subset(ev_all, ev.player.1 %in% sedins)
We can also get player summaries for more complex events using the player.summary() function
roster_name = aggregate.roster.by.name(roster) ps = player.summary(ev_all, roster_name)
The output of player.summary() is an array of 5 tables
The first table is the person that did the event (e.g. scored the goal, got the penalty, made the shot, miss, or blocked the shot)
player_summary = as.data.frame(ps[,,1])
The second table is the second person in the event, if relevant. (i.e. 1st assist, victim in penalty (?)), 2nd block (?))
The third table is the third person in the event, if relevant. (only 2nd assists)
player_summary$ASSIST = ps[,3,2] + ps[,3,3] # Third column is the GOAL event
The fourth table is anyone who was on ice when the event happened and it was their team that was ev.team
The fifth table is anyone who was on ice when the event happened and it was the opposing team that was ev.team
ev.team refers to the team that scored, took the shot, won the faceoff, or received the penalty
player_summary$PLUSMINUS = ps[,3,4] - ps[,3,5] player_summary$PLUSMINUS_SHOTS = ps[,2,4] - ps[,2,5]
Finally, we can use information from the roster to fill in the name information
roster_name = subset(roster_name, firstlast %in% rownames(player_summary)) name_idx = match(row.names(player_summary), roster_name$firstlast) player_summary$firstlast = roster_name$firstlast[name_idx]
And we can look at the sedins again for comparison
subset(player_summary, last == "SEDIN")
Apologies to A.C. Thomas, the author of nhlscrapr, if I'm stepping on your toes with this patch.