Last time I discussed the importance of correctly modeling the game you are interested if you want to address the problem of data analysis in football. If you are new to the blog, I would suggest reading that post before reading this one. It will give you a good overview of how the analytics are built around here.
For everyone else, a quick refresher. We’re assuming that the correct model of a football offense is shown below.
Yards gained on the field begin as a play called by the offensive play caller. They then filter down to the quarterback execution which then filters down to the wide receiver execution. Obviously there are important aspects of the offense we don’t model here, most notably the offensive line and any interactions between the stated roles of the model, but those bridges are a substantial distance down the road.
Now that we have our model, we need two things 1) a question to answer and 2) an analytical tool that can take our question, address the realities of the model, and give us back a number for us to interpret.
I want to answer one of Bill Connelly’s 45 Reasons to Care about College Football Analytics. These are a set of questions Bill created to drive interest in analyzing college football data. Specifically, I want to begin to address question #5, quantifying how important the quarterback is to the offense. Believe it or not, we can address this question with already existing tools. The first bit of technology we need is the Better Box Score I detailed last week. Here is an example of a Better Box Score.
So our question is How Important is the Quarterback to the offense. We have our basic model up top and the Better Box Score to help us. Now all we need is the tool. To answer this question we will use Intraclass Correlations (ICC). ICC’s are analytical tools designed to understand similarity across sub-sets of a group. Similarity among group members can be interpreted as the effect of a common factor higher up the hierarchy – in this case the quarterback.
For example, below you see a scatterplot of the four individuals a) played quarterback for Utah State in 2014 and b) targeted two different receivers at least five times over the course of the season. Utah State presents a nice example of this as they had so many quarterbacks and no single quarterback ran away with the team’s attempts. Receptions are on the X axis and Targets are on the Y. Each mark represents a different pass receiver targeted by that quarterback, but the same receiver could be marked on this graph multiple times if they were targeted by multiple quarterbacks. Remember that’s not a problem for us because our model says a pass exists as a connection between quarterback and receiver, not as individual performances. Notice the general patterns in the subsets.
First, Craig Harrison is markedly different from the rest of the quarterbacks on this list. His points are concentrated in the lower right hand corner. The second thing that should be noted is how tightly clustered Darell Garretson’s completion percentage is whereas Kent Myers is more spread out. This may be easier to see if we draw ellipses around each sub-set.
See how Darell Garretson’s ellipse is much more squished compared to Kent Myers’s? That means the receivers targeted by Darell Garretson are more similar to one another than they are with Kent Myers. Likely, this indicates an effect of Darell Garretson getting more consistent performances out of the pass receivers (note that more consistent does not necessarily mean better. One could be consistently throwing balls into the dirt on every pass play and still be consistent).
But we want more than just looking at graphs and guessing if they mean anything. We want to quantify if those circles are meaningfully different from one another. The ICC is a good tool to use here as it returns both a null hypothesis significance test and an effect size of what percentage of variance in a statistic is attributable to a particular “focal person,” in this case the quarterback.
I will be using the terms focal person and partner repeatedly throughout the explanation, so let’s define those terms. A “focal person” is any entity on the higher end of a two-level hierarchy and the “partner” is on the lower level of the hierarchy. So in the hierarchy of our model, the quarterback would be the “focal person” and the receivers would be the “partners.” Note that, in this case, higher up on the hierarchy does not mean “better.” It just means that, when we model the actual game, multiple wide receivers are paired with a single quarterback.
The formula I will use for the ICC is a bit different than what you might find in other sources. Psychologists, most notably David Kenny, have evolved the formula of the ICC so it better matches the questions we care about. The formula we use can be interpreted as an assessment the similarity of the members of a sub-group. It will tell us whether or not receivers targeted by a particular quarterback are more similar to one another than they are to other random points in our data set. Therefore, using the following formula we can assess the percentage of variance explained by having the ball thrown to you by a particular quarterback.
Where k’ = either the number of partners if group size if fixed or, if group size is variable as it is in our case
To calculate our ICCs we first need to choose a dependent variable. I will focus on yards gained (rather than the completion percentage that I showed above). As a teaching example, let’s first calculate the ICC for our Utah State quarterbacks. Here are the data that I have that we’ll be using.
To calculate the ICC, we first run a univariate ANOVA on yards with Quarterback (the focal person) as the independent variable. This returns our Between-subjects and within-subjects variance. In this case those numbers are
This means that on Utah State during the 2014 season, 18.7% of the variance in passing yards gained can be attributed to the quarterback. Now let’s do this same thing for the entire league, but we have one final wrinkle to overcome, the fact that we have nested hierarchies – receivers within quarterbacks within teams.
To tease all this nonsense apart, we’re going to start at the very top of our hierarchy. Team will be our focal person and quarterback-receiver connections will be he partners. We need to enter more than one season’s worth of data into this analysis because we need to be sure that every team has at least two of the next level down in the hierarchy, in other words quarterbacks. Because of the NCAA’s eligibility rules, this means we need to have at least six seasons of data to guarantee this criteria is met for every single team. So we have data from 2009-2014 in the data set.
Calculating out the Mean Square (MS) between and MS within (a.k.a. between and within groups variance respectively) gives us the following.
So 3.1% of the variance in yards can be attributed to the team. This would be anything that is common among all receivers and quarterbacks, so things like the offensive system, facilities, average offensive line ability, average relative defense strength played against, etc.
Now we run the same analysis on the same data but now we change the focal person from team to quarterbacks. Running this analysis gets us the following result.
This result tells us that 6.4% of the variance is attributable to…what? Because it’s not directly true that this results explains everything about the quarterback only. Instead it says 6.4% of the variance is attributable to everything that is held in common among the partners, which would be quarterbacks but would also include, play callers, facilities, etc. So, we need to do a simple subtraction here to get a pure quarterback metric.
And there’s our answer. Quarterbacks in NCAA FBS football have 3.3% of the variance in passing yards attributed directly to them. I also find it very interesting that knowing who the quarterback on a team is will explain almost exactly as much of the variance in passing yards gained as knowing who the play caller is.