In the last post I left off with this linear model of home runs hit per games played:

Because the regression line fits the data so closely, it's hard to say whether home run hitting was a factor of anything except the number of games, i.e., whether the rate of home runs remained constant over the course of the season.

However, a residual plot of this model gives us a better feel for the situation.

This chart takes the actual total HR for each day of the season and shows how far it is from the regression line in the previous chart. If the actual number of HR hit by a given date was more than the linear model predicted, the corresponding mark appears above the 0-line of this residual plot. If the number hit by that date was less than the model predicted, it appears below the 0-line. So what this chart shows is that in the middle part of the season (approximately days 22-57) the actual number of total HRs hit was consistently above the value predicted by the model.

I think this chart is very suggestive if we reason the following way.

Assume that the total number of home runs hit is merely a factor of the number of games played, i.e., the rate is constant over the season or t = c * n, where t is the total number of home runs, n is the number of games played, and c is some constant (the number of HRs hit per game).

Then, given our data, the linear regression gives us the best estimate for c for the 2020 season, which is the slope of the line in our chart. Any variation from the prediction should be random, so the residuals should be somewhat randomly distributed across the 0-line of the residual plot (it should be random whether they fall above or below the line).

But, the distribution of points across the 0-line does not appear to be random, it appears that the model overestimated the number of HRs that would be hit during the beginning and ending parts of the season and underestimated how many would be hit in the middle.

The tentative conclusion I draw from this is that the rate of HR-hitting *is not actually constant*. Actually it increases during a certain segment of the season and then decreases at the end.

We can perhaps get a better feel for this if we cut the season into a few segments and compare them. For reference, the average number of HR/game over the entire season was 2.566.

Here's how the HR/Game looked for each month of the season, which began July 23.

Admittedly, the sample size in July was much smaller than in the other months (104 games were played in July, compared to 405 in August, and 389 in September). However, this chart does suggest that there was a surge in HR hitting around the middle of the season that put it above the average in August.

We can look even deeper with a week-by-week breakdown.

Now, the "weeks" I defined here are actually seven-day groups of days on which games were played, so they don't exactly line up with calendar weeks or actual seven-day periods (since there was at least one calendar day on which no games were played). However, they do line up pretty closely if you take July 23 as the first day of the first week of the season and add seven days for each week from there.

What this chart shows is that around weeks 3 - 6 of the season (2 - 5 on the chart above, since I started at week 0), which would be August 6 - September 2 in real time, there were an above average number of HRs hit per game (about 2.74 HR/Game over this span). This supports my hypothesis that there was a "peak period" for home run hitting in the middle part of the season.

But this is more of a hunch than a theory at this point. The next step is to look at some other season-long data and see how it compares...

## Comments