Apologies for the unexpected hiatus last week. I had to fly to Seattle for a wedding. But now I’m back in the Frozen Tundra and ready to tell you about some stats.
I’ve been adjusting the statistical methodology I use in the last couple weeks. Nothing has changed about how I calculate the Completions Away from Average metric, but I have been changing how I use that metric to predict NFL success. Be advised that this is another post that is heavy on technical details. If you are not a technical reader, the take-home point of this post is that the changes will make my predictions better in the long term, but add more uncertainty in the short-term.
Change #1: Dependent Variable
The DV I’ve been using for predicting is NFL Passer Rating after three years in the league. I chose three years arbitrarily. I thought, “Well…most draft prospects have a decent chance of playing after three years in the league, and we don’t want to get too far away in time from college because we can’t account for growth and coaching, so let’s use three years.” Arbitrarily isn’t the best reason to choose a DV, so let’s use something with a little more meaning. I’m going to stick with Passer Rating as it’s a reasonably good per-attempt statistic. The per-attempt part is what I really care about. Yes, there are problems with it, but I think the good outweighs the bad.
But choosing three years as an arbitrary time-period doesn’t make much sense. We don’t have to change this much to make it meaningful, but we should change it. Rather than Passer Rating after three years in the league, I went with Passer Rating after four years in the league. Four years is much more meaningful because it is the length of all rookie contracts under the new CBA. Thus, I’m predicting what a player is most likely to do during his rookie contract. I think that is much more meaningful than what I was doing previously.
Change #2: Prediction Model
Changing the prediction model solves two problems, which is handy. When I have two problems, I like being able to solve them both at the same time. So what are the two problems that need solving?
Problem #1: Adding Data
The typical method of designing a statistical model is you have some theory, you collect data in a way to test that theory, and you use the data you collected to create a mathematical formula that minimizes the errors between your theory and the data. In most cases, my statistical model included, the mathematical formula is fairly simple. We draw a scatterplot of our data, and draw a straight line through the cloud of data points. Then all we need is our 8th grade math skills to solve for the important values of that line, y = mx + b, where m is the slope of the line and b is where the line crosses the vertical axis (a.k.a. intercept). We also know x, in this case Career CAA, for each individual player, so we just solve for y, in this case Passer Rating after four years in the league, and we have our prediction. This process is called regression modeling.
But then what? New players will always be entering the NFL. Also, some of the players that I have in the model haven’t been in the league for four years, so their data will change. But that’s not a problem, right? We just keep collecting the data and everything should get even better, right?
What I’m about to say might surprise some of you that are not familiar with traditional methods of statistical modeling. Continually adding data and using traditional regression techniques will actually make your predictions worse. Regression models hate adding new data. They want to take the information you give them the first time and do the best they can with it. When you take the same model, feed in the old data plus some new data, the model is more likely to follow blind alleys and tell you that unimportant things might be useful predictors. So the realities of the NFL make traditional modeling techniques less useful.
Problem #2: Distribution of the DV
Traditional regression assumes a normally distributed dependent variable. For most things, this is fine. For Passer Rating this is not fine as passer rating isn’t normally distributed. In a normal distribution, most cases are in the middle with a few cases on the high and low ends. However, a per-attempt statistic like passer rating isn’t normally distributed because some quarterbacks won’t have many attempts. For example, let’s look at Ryan Mallet this year. He had 4 attempts this season, one of which was intercepted. Now, the passer rating metric just takes what it has and extrapolates from that information. The number that is Ryan Mallet’s 2012 passer rating assumes he would throw an interception every 4 attempts, which we humans know is not true. If he was on the field and made 100 attempts, he wouldn’t expect him to throw 25 interceptions. To our brain, that’s ridiculous. But the math equation doesn’t know this. It’s just taking what we gave it and spitting out a result. So Ryan Mallett’s passer rating is currently 5. Some players, like Mallett, have really poor numbers. Other players, like Kirk Cousins, have really good numbers. We shouldn’t believe either number just yet because they haven’t had enough attempts to get a good picture of what they can do when they are on the field. This is the nature of the beast in the NFL; some quarterbacks have passer ratings that are extreme compared to others. But regression models don’t like extreme values. Regression models like everything to be nice and normal. Extreme values confuse the model and exert undue influence on the final result.
Solution: Bayesian Robust Linear Regression
I won’t say too much about the Bayesian portion of this. When I start talking Bayesian methods in class, my students tend to glaze over like fully fed zombies. Baaaaaayes. Regardless, the Bayesian portion solves Problem #1. Bayesian analyses are built to accept new data in ways that traditional regression models are not.
The Robust Linear Regression part is the really cool part. In this analysis, we can change the assumption about the distribution of our DV. We can, for example, assume our values are distributed as a t-distribution, which means most of the values are in the middle, but extreme values are expected and dealt with easily. So, let’s revise the predictions with this analysis.
New Model
We will set some very general prior predictions before we run the analysis. We will assume that our DV is distributed as a t-distribution. We will also assume that possible intercepts and slopes are distributed normally. The analysis will calculate which degrees of freedom by estimating a value called tau. Tau is conceptually, but not quite the same as degrees of freedom. It’s a measure of how fat the tails are. The closer to 0 this is, the better this analysis will do compared to traditional techniques. If tau is > 30, we shouldn’t see much difference. We assume possible values of tau follow a gamma distribution beginning at 1.
So what do we find? Note, we’re using 4-year NFL passer rating as our DV rather than 3-year as in past analyses.
Tau
The first question is if we made a wise choice in assuming our dependent variable is t-distributed rather than normally distributed. Our most credible estimate of tau from these data is 2.56. Remember, the closer to 30 this is, the less this matters. Since the most credible estimate is much, much less than 30, we should improve our predictions substantially with this procedure.
Equation of the Line
We’re still trying to find a line that best predicts our outcomes. Bayesian regression is a little different in that it gives us lots of different lines that are all credible given the data we have.
In the initial estimates, we have a small sadness. 0 remains a credible slope. This is what I meant above about increasing uncertainty in the short term. There is a chance that everything I’ve been talking about is not true and this CAA metric isn’t worth anything. The nice thing about Bayesian analyses is that they put a probability on this chance. Given the data we have, there is about a 6% chance that I’ve been blowing smoke at you this entire time. I’m willing to keep going with that, but you can make your own decisions regarding that.
To create the actual equation, I took the most credible intercept and the most credible slope and stuck them in the same equation. Note that I’m still new to Bayesian estimation techniques and I’m not quite sure if this is kosher or not. If I find that it’s not, I will revise the model.
New Equation for Prediction
4-Year NFL Passer Rating = 70.5 + 0.143 * (Career NCAA CAA)
I’ve added a new column to my predictions for the 2013 draft class. Note that this doesn’t change their relative rankings any, just the estimate of their NFL passer rating four years from now.