Once upon a time, I wrote for the blog Seamheads.com. As that time was short-lived, and Mike Lynch needed the disk space to host other writers, my posts were deleted from their active archive. Thankfully, they have survived thanks to the Internet Archive. Since SABR101x used the Lahman Database, I was reminded of a toy stat I created back in 2008 in a post on Seamheads. The article below is an updated and edited version of what was originally published there. Treat it as something to hold you over until SABR 44 begins in Houston next week.
You might have guessed that today’s discussion will be about predicting a no hitter based on the wealth of numerical data that exists, and you’d be right. So where should we start? Well, let’s start with (who else?) Bill James, who estimated the odds of a no-no by a pitcher in a single game using his career statistics:
Simple probability and logic: the more frequently you get people out without allowing a hit, the better chance you have of getting 27 batters out without allowing a hit. J.C. Bradbury modifies this formula slightly to be more centered on the pitcher’s ability as well as utilizing the Poisson distribution, but essentially follows the same idea.
Let’s stay within Bill’s framework, but consider a few extra components:
- Defense is typically something that, courtesy of DIPS, tends to be ignored, but it always seems to play a critical role in a no hitter. A statistic like DER can account for the percentage of batted balls that become outs. Let’s use this statistic here. For the estimate of runners reaching on error, I used the approximation of .71 * team errors that Baseball Reference uses
- Since we’re accounting for defense, we’ll need to know how often the ball gets batted into play where the defense can field it. The denominator from the formula for BABIP is a good start, but we want it from the pitchers perspective, as well as it in the form of a percentage. Let’s use (BFP-BB-HBP-K-HRA)/BFP. This will be the estimate for BIP%
- Some balls leave the yard and never come back. Home runs will be accounted for with HR%, HR/BFP
- We’ll also need to know what percentage of batters do not get a hit during their turn at bat. This is every thing else not covered with the previous 2 bullet points, so 1-HR%-BIP% should work fine
- What about walks and other ways of getting on that aren’t hits? Forget them here, as they are not hits or outs, the only factors that matter in a no hitter. This does ignore the issue of pitching with runners on, which tends to be a detriment to preserving a no hitter.
- The exponential factor is the last component. Bill used 26 as an estimate, assuming that somewhere along the way a caught stealing or double play factored into the mix. What he probably wanted was to find the average number of batters that had an out-producing at bat per 27 outs. Easy to say, harder to show how to calculate. When I first did this analysis in 2008, I used the Retrosheet play-by-play files from 2000-2007, counting the number of plays with an out and dividing by the number of games times 2. This came out to 25.8, close to James’ 26. Let’s keep using 26 since its a whole number, and you can’t have 80% of an at bat during a game.
And there you have it. The full formula thus looks something like this:
NoHitOdds = [1-HR%-BIP%*(1-DER)]^26
A few notes before I get to the numbers:
- This estimator works best at as a season-level estimator. The reason for the season-level calculation is two-fold. DER is most stable over an extended period, and defensive rosters generally change from year to year, especially in the age of free agency.
- When I first wrote this post, it came just before Jon Lester’s no hitter in 2008 against Kansas City. At the time, I wondered if there’s an extra component that could account for predicting these slim odds against a given opponent. The 2008 Royals were noteworthy for being a team that doesn’t hit very well, and so even a left-hander in a park that doesn’t like southpaws a lot (thanks Mr. G. Monster) was able to do it. Calculating this would probably include some sort of batting metric and maybe even an “over-aggressiveness” component; the idea being that a poor hitting team with bad plate discipline is more likely to be no-hit. I have not thought much about how to do this yet.
- I think the importance of defense in a no-hitter is well accounted for here, though even I’ll continue to say this metric is far from perfect. Using an aggregated DER doesn’t account for the differences in fielders. Maybe this will become possible with the new technology being tested this year, but we’ll have to see what that data ends up looking like and what will be available for public consumption.
- There are still a few other things not included, such as any kind of park factor or a way of accounting for the difficulty of fielding plays. The latter information, however, is hindered by the lack of fielder positioning data at the moment. We won’t be able to get that data historically, but it is coming, maybe even as soon as 2015.
Just like Bill James did, multiplying these odds by the number of starts in a season and adding those totals up for all seasons a player pitched will give us an expected number of no hitters. The top 10 since 1920 are, for the most part, who you’d expect:
|nameFirst||nameLast||Sum of ExpNoHit|