Statcasting Expectations

The next level of public baseball data has arrived. MLB Advanced Media’s Statcast made a hyped television debut, although it had made cameos in online replay videos last year. With the system installed in all 30 ballparks to track all movement on the field, hopes are high for discovering many things about the game via data that previously could only be imprecisely discerned by watching a lot of baseball.

However, while MLBAM have stated that Statcast data will be made public, it is still unclear what types of data and how much of it will be available for public use. Bits and pieces of the data have slowly appeared as the 2015 season started. Among the first pieces have been the velocity and angle of the ball off the bat, which the savvy scrapers, such as Daren Wilman of Baseball Savant fame, of the Gameday files have captured and published. But whether the public will have access to the raw data remains to be seen.

It seems unlikely to me that there will be public access to the raw Statcast data anytime soon. The first challenge is the sheer size of the data set, which is already measured in petabytes. This is unlike the pitchF/X data, which can be scraped and saved on a home PC. Raw Statcast data is best stored on a cloud server. While MLBAM is certainly using “the cloud” as the method for allowing the 30 teams to access the data, it would be a massive security risk to open that server up to the public domain. Setting up a public server would be an additional cost, and it’s hard to argue that there would be any significant return on that investment for MLBAM. However, Statcast is already sponsored by Amazon Web Services, so the possibility is there for the raw data to be made public via the AWS platform. That possibility seems very remote at this time.

A more likely scenario (at least in my mind) for the release of Statcast data is something like what the NBA did with its SportVU data. SportVU, the player tracking system developed by a subsidiary company of STATS, Inc., is akin to Statcast in that it tracks player and ball movement. The Stats section of (linked above) shows various measures and animations gleaned from the SportVU data, but does not provide fans access to the raw data. This is the path I expect MLBAM to take. The batted ball data that has already shown up in Gameday is like this, and many of the other metrics that have been teased via broadcast, such as route efficiency and perceived velocity, could also be distributed in this manner.

Releasing the data in a summarized or snapshot form isn’t as risky to the teams, who were not all that happy when pitchF/X data made its way into the open world. Allowing public researchers to make insights based on that available to all teams took away an opportunity to gain a competitive advantage. This is why the other Sportvision products, like hitF/X that also provided batted ball information and commandF/X that tracked where the catcher’s glove was position, have been available to teams but not the public.

Regardless of what form the data takes when it is released, Statcast data should enable saberists to use more granular data to show what it takes to succeed in the game of baseball. Some of these data-driven discoveries may merely affirm what scouts and those in the game have been taught and believed for years and decades, but I’m sure some will not. Like many others, I can’t wait to get my hands on it.


Sabermetrically Gaming: Strat-O-Matic Baseball

I happen to be a man of many interests. Besides baseball, one of my other primary interests is gaming, especially tabletop gaming. My interest in games is rooted in my love of competition and my explorations of the world via mathematics and statistics. It’s no coincidence those are traits inherent to following baseball as well.

Today, I’m starting a series exploring various games that attempt to simulate baseball. My focus will be more of the math that underlies each game and how closely it helps replicate the on-field experience, though I’m sure some game play commentary will filter in. Leading off is perhaps the most well-known of the baseball table top simulations, Strat-O-Matic Baseball.

In the book Curve Ball, Jim Albert and Jay Bennett open the book with a dissection of how various baseball tabletop games model the actual action of a baseball game. Naturally, Strat-O-Matic Baseball was covered, in which they explain some of the math behind the model and how it assigned credit to the batter, pitcher, and defense. I want to focus more on the game design and the probabilities involved.

The basic mechanics of the game are relatively simple, though there are optional levels of complexity that can be added to the game now that were not a part of the original edition. There are batter cards and pitcher cards, and each card contains a table of possible results that are determined by the roll of 3 six-sided dice. One die, typically white, determines which card and which column the result comes from, with the result corresponding to the the sum of the 2 other dice, typically red, in the designated column. In many instances, a result then requires the roll of an additional 20-sided die. This provides 4,320 different possible outcomes.

SOMpitcher SOMBatter

Unfortunately, there is no master database of SOM player cards that is available to fully analyze this model. However, a massive Strat-O-Matic Baseball fan by the name of Bruce Bundy put together a bunch of formulas to forecast how a player’s card would be created. My impression is that he created these formulas by looking at a bunch of player card sheets. I’ll use it here because it’s the best publicly available information about the game model that I can find.

Looking at the formulas provides insights into a number of assumptions made about baseball by Strat-O-Matic. Player cards are customized based on their statistics, but this customization is achieved using some assumptions about the probabilities of certain events occurring that are built into the game model.

Consider the old fashioned base-on-balls, the least sexy of the Three True Outcomes. The Walk formulas for both Batters and Pitchers both are adjusted by a constant of 9. In terms of SOM, this means that the batter and pitcher cards are designed with the assumption that 9 out of the 108 results from the other card will result in a walk. Thus, the game implies an unintentional walk occurs about 8.3% of the time in baseball, with the credit being split between the pitcher and the batter. While the latter claim is not possible to investigate prior to pitch-by-pitch data being available, the former is. Here’s the overall major league non-Intentional walk rate year-by-year since 1952, using the event logs courtesy of Retrosheet

NonIBBWalkRateYou see that for most seasons here, the actual MLB non-intentional walk rate (in red) is slightly less than the the estimated rate modeled by SOM (in blue). The average across these seasons is that non-IBB walks occur in 7.81% of the plate appearances, which is about 17/216. Since 17 is an odd number, it can’t be divided equally between the batter and pitcher, a key component of the Strat-O-Matic model. Thus, it seems that the walk rate implied by Bundy’s formulas is reasonable, though a bit high.

Here are the implied rates from Bundy’s formulas and their actual instance rates from the same Retrosheet data for a few other events:

  • Doubles – SOM rate of 180/4320 = 4.2%, Actual = 4.1%
  • Triples – SOM rate of 30/4320 = 0.7%, Actual = 0.6%
  • HRs – SOM rate of 100/4320 = 2.3%, Actual = 2.3%

This replication of a generic baseball reality is why Strat-O-Matic has been so beloved for over 50 years. Hal Richman, the game’s inventor and mastermind, has created a game model that is flexible enough to work across eras. This enables SOM to sell new sets based on every season and specially designed sets, all of which can be mixed and matched as the gamer sees fit. If you’ve never played the game, find a way to do so at least once.

2015 SABR Analytics Conference Research Awards

Voting closed on President’s Day for this year’s SABR Analytics Conference Research Awards, and like last year, I have taken a great interest in seeing which articles were nominated. Although the voting is closed, I once again am sharing which articles I voted for and runners up in each category.

Contemporary Baseball Analysis: Harry Pavlidis and Dan Brooks, “Framing and Blocking Pitches: A Regressed, Probabilistic Model,” Baseball Prospectus, March 3, 2014.
This category was stacked. I could have reasonably voted for 4 of the 5 articles. But Pavlidis and Brooks managed to stand out above the rest by a hair. Like Max Marchi’s winning article from last year, this is another landmark addition to our statistical understanding of catcher framing, possibly the hottest topic in sabermetric research until the StatCast data sees the public light of day. While Jonathan Judge and this duo have already updated and improved on their work, its import to quantifying catcher framing was without equal in 2014.
Runner up: Jon Roegele, “The Effects of Pitch Sequencing,” The Hardball Times, November 24, 2014.
Pitch sequencing is my current favorite topic in sabermetric research. It’s not quite as popular as catcher framing because sequencing is largely dependent on the pitcher’s arsenal and the techniques needed to study sequencing tend go beyond basic data mining. Jon’s work is the best on the topic that doesn’t require an understanding of Markov chains and/or the mathematical mechanics of game theory.
The other 2 articles I almost voted for were:

  • Russell Carleton, “N=1,” Baseball Prospectus 2014: The Essential Guide to the 2014 Season, January 2014. Pizza asks what we really know about an individual player, and explores swing rates for individual players using regression. (Yes, I’m one of those who instantly started mouthing GLM, HLM, and MLM at the words “gory math” and “regression” in the article.)
  • Jeff Sullivan, “Alex Gordon Barely Had a Chance,” FanGraphs, October 30, 2014. The best breakdown of the most scrutinized play of this year’s World Series.

Historical Analysis/Commentary: Steve Treder, “The Strikeout Ascendant (and What Should Be Done About It),” The Hardball Times Baseball Annual 2014.
A tough category to pick, but Steve’s breakdown of strikeout eras in baseball history was an exploration reminiscent of a Bill James essay do in his 1980s Abstracts. He explores strikeout rates rates through history, citing that the increase is part of a natural rise of the power game in baseball, both at the plate and on the mound. Nothing, not even a proposal to lop off the bottom three inches of the strike zone, will change the minds of batters sacrificing discipline for power or pitchers trying to keep that power in check by throwing hard at the expense of in-game longevity.
Runner Up: Bryan Soderholm-Difatte, “The 1914 Stallings Platoon: Assessing Execution, Impact, and Strategic Philosophy,” SABR Baseball Research Journal, Fall 2014.
While platoons aren’t anything new, I always find it interesting when someone looks at a season in the distant past using modern tools. Bryan’s analysis of the 1914 Stallings was well thought out and about as comprehensive as such an analysis is capable of being.

Contemporary Baseball Commentary: Lewie Pollis, “If You Build It: Rethinking the Market for Major League Baseball Front Office Personnel,” Brown University, senior honors thesis, Spring 2014.
Most senior theses don’t make it beyond the adviser’s desk. If you happen to read one, it’s probably because you know the person who wrote it or you were in the person’s grauduating class and major while they wrote it. Lewie’s thesis is clearly more pubic than that. It’s also an extremely articulate breakdown as to why wages for lower-level front office personnel should be higher. It won my vote in a rout.
Runner Up: Eno Sarris, “Learning the Language of the Clubhouse,” The Hardball Times, March 13, 2014.
Eno’s article was full of wonderful anecdotes and personal reflections on speaking the ballplayer’s language. It’s the runner up almost by default, as the other 3 articles rehashed (or completely missed) ideas I have previously seen explored.

Organizing the World’s Sabermetric Research, Part 4 – Designing a Database

Here’s an idea of how much other stuff has gone on in my life: I talked about building a sabermetric research database 11 months ago. Version 1 has yet to be published. Much like my postings here at this blog, the time to work on this project has been sporadic. That inconsistency made designing the database challenging.

While I did pick up a minor in computer science while an undergrad, I’ve been primarily a user, rather than a designer, of databases ever since. I know the basic principles of database design, but designing one with minimal experience from years ago is not easy. So I looked for examples.

I started with the best example of a database I knew of for recording information on various printed and recorded materials: the digital library catalog. I couldn’t get access to the database schema that a real library uses, but did manage to find an example. Granted, this example of an entity-relationship diagram covers only books, but it was a start. It affirmed 3 different base tables that were pretty obvious to me based on what I wanted when I first talked about the design: author, book, and category. The intermediate link tables between book and author and book and category were something I didn’t have in mind at first, but incorporating those kind of link tables for the underlying database is actually a key element of a third normal form relational database. The link tables will help with database organization.

I also found the schema for a database that served as an inspiration for this idea. As a statistician with a slight academic bent working in industry, one of my resources is the Current Index of Statistics. While their schema wasn’t displayed in a nice entity-relationship diagram, it is available in code form. Of course, there are many things the CIS is interested in that I am not, but the schema follows the core idea of third normal form: each element of the data needs its own table.

All this matters because I want to make sure I record all the information for the DB with a few passes through the material, and knowing which pieces of information to collect is critical to that process. A few of the elements I want to collect are universal to all of the material types I discussed in Part 2, with a few notes on the columns

  • Author – first and last name, with a key built using the same logic as the Retrosheet player ID
  • Publisher – name and city. The name could be the key, but I think creating a shortened version of the name will be a better key and make queries easier.
  • Citations – The heart of this project. Just a listing of two publication IDs, one being the piece of research cited in the other. At one point early on, I considered including page numbers, but that seems to be more effort than it’s worth at this point and could be added later.
  • Subject – The subject list needs to be uniform across all media types. The subject table will be like the Citations table, with a publication ID and a column identifying the subject. I’m thinking that the subject list will be coded to help conserve disk space as this database grows. Players and teams can be included as subjects, and I’ll use the same codes as Retrosheet.

The other tables are specific to different media types:

  • Book – publication ID, title, author IDs, publisher ID, publication year, ISBN. Books only published electronically will be treated the same as printed books. Publication IDs will start with “b” to denote book. ISBN would be the key for this table if it weren’t for the need for a unified key across the other media types that can’t be identified that way.
  • Article – publication ID, title, author IDs, journal ID, publication date, start page, end page, URL. This should work for both journals and magazines. Publication IDs will start with “a” to denote article. I’m also including URL since so much of what’s in print is migrating to or simultaneously published online nowadays.
  • Journal – journal ID, journal name, publisher ID, domain URL. Magazines are included here as well; journal ID is just so that the field name is distinct from other fields in the database.
  • Presentations – publication ID, title, author IDs, speaker IDs, presentation date, conference ID. I separate out the speaker and the author IDs only because not all authors will present and, in rare instances, someone else presents who didn’t the author the presentation. Publication IDs will start with “p”
  • Conference – conference ID, conference name. I’m not going to list each annual conference separately, as that can be inferred by the presentation date when these two tables are linked. This will just be to identify different conferences and conventions (e.g. SABR, JSM, NESSIS, SaberSeminar, etc.)
  • Web articles – publication ID, title, author IDs, website ID, publication date, URL. The web is a nebulous place, and the article I read today may be different than what I read tomorrow, but reputable online sites will note original publication date and edits if they occur, so I’m not worried about that as an issue. Publicaiton IDs will start with “w”
  • Websites – website ID, website name, domain URL. Pretty straight forward. I don’t want to combine this with the Journal table so that it uses few columns

If you’ve stuck with me this far, I’m going to add one last note about how I’m building this database. I’m breaking up my exploration into 3 eras to help with identifying and finding sabermetric research. The first era ends with the publication of The Hidden Game of Baseball. That’s my starting point for this project and that book marks a pretty significant milestone in sabermetric history. I also feel that going backwards in time from 1984 will be of more value to the sabermetric community, and it will allow me to focus on printed material initially. The second era is the period between 1984 and 1996, which is mostly printed material. 1996 is the year Baseball Prospectus was founded, and it serves as a proxy for the start of the explosion of sabermetric research courtesy of the Internet. The Internet era (1996-present day) will be handled last.

Version 1 will hopefully be ready in the next few months, and if it isn’t published by the start of SABR 45 at the end of June, this project will have been abandoned.

SABR Day 2015

It might have been a week later than the official date, but avoiding fan fest date conflicts proved to be a wise decision for the Ken Keltner Badger State and Emil Rothe chapters of the Society for American Baseball Research. 48 baseball fans made their way to and from Kenosha’s “world famous” Brat Stop for what has become an annual Hot Stove tradition.

This year’s meeting opened in the sadness that could only be brought on by the death of a beloved ballplayer. Not only was Ernie Banks’ funeral playing on the TVs as baseball fans arrived, but the meeting also took place on what would have been Mr. Cub’s 84th birthday. Rich Schabowski (dressed in football attire of teams not from the state of Wisconsin or Illinois) opened the meeting and led all in attendance in a moment of silence for the Cubs legend, which ended with a shout of “Let’s Play Two!” That call ended up symbolizing the day: each chapter organized half of the meeting, with a lunch break in between. It really was like playing a doubleheader.

The portion of the meeting organized by the Chicago chapter took the top half of the program. Leading off was guest speaking Ozzie Guillen, Jr. Currently employed as a financial adviser, he worked in the clubhouses while his father coached in Atlanta and Florida and also while Ozzie, Sr. managed the Pale Hose to their first championship in 88 years. Holding high expectations for both of Chicago’s clubs in 2015, he spoke his mind on two issues in baseball today. The first is that the pendulum has swung to far in favor of analytics and sabermetrics within some organizations. The second is that today’s players make too much money, exacerbating the disconnect between the players and the fans and mirroring the current stratification of American society. He then took many questions from the audience, discussing everything from his favorite player as a clubhouse manager (“The best tipper”) and observations on the aforementioned World Series champion 2005 White Sox to pitch counts and broadcasting the team his father managed in Florida.

Batting second was a man whom his boss has called the “Ben Zobrist of Baseball Prospectus”, prospect writer Mauricio Rubio, Jr. With a deep love of baseball inherited from his family and dreams of being a pro scout, Mauricio started working for the fantasy side of BP before his constant pestering finally landed him a chance to write on prospects. With a focus on the Midwest League, he commented on how his writing tends to focus on melding stats with scouting, in line with BP’s brand as a leading sabermetric site. He also remarked about how mechanical analysis has become big with saber-scouts, but cautioned that mechanical analysis might be overemphasized, concurring with some of the commentary on the importance of a prospect’s character from Ozzie Guillen, Jr. The Q+A revealed his typical day at the park starts with a focus on 2 pitchers (typically the starters) and 2 hitters, moving around from the bullpen to behind home plate to a side view for the hitters and a rear view to better analyze arm action.

The final speaker before lunch was Merle Branner, who shared a paper from a leadership course she took as part of her studies in Library and Information Science. The paper examines the leadership dynamic between Branch Rickey and Jackie Robinson using the Servant Leadership model proposed by Robert Greenleaf. She examines all 10 aspects of the model in relation to Rickey’s signing of Robinson and integration of the major leagues. Once she was done, it was undoubtedly time for lunch.

While lunch was delicious (the cajun bratwurst is highly recommended if you’re ever able to stop at the Brat Stop), there was more baseball to be discussed, and the Badger State portion of the meeting commenced. Jim Nitz told the story of the Milwaukee Chicks, the 1944 champions of the All American Girls Professional Baseball League made famous by the film A League of Their Own. Their only year in Milwaukee was a turbulent one despite the on-field success. Media coverage for the team was poor in Milwaukee, failing to replicate the success of teams like the Rockford Peaches and leading to multiple nicknames used in the papers (primarily Schnitts and Brewerettes). The Chicks cohabited in Milwaukee’s Borchert Field with the Brewers (the minor league club), leading to a cavernous stadium that was sparsely inhabited. Nitz noted their success was largely due to some fantastic ballplayers like Connie Wisniewski and Hall of Famer Max Carey’s well-regarded management of the team, and also shared anecdotes on each of the players. His Q+A was enhanced by some women who play in an AAGPBL re-enactment league.

Afterwards, it was time to close the silent auction (a Chicago chapter fundraiser) and draw the winner of the 50/50 raffle (a Badger State chapter fundraiser). After claiming items from the silent auction, a presentation on Ginger Beaumont was up next up. Unfortunately, it was at this point when I had to leave, so I can’t comment on the rest of the meeting.

Thank goodness Spring Training was only 2 weeks away.

For those that failed to make it, Emil Rothe chapter secretary David Malamut took photos and even video of the day’s events. The photos can be seen on Twitter @sabrchicago, and links to the videos can be found here

Mr. Cub

Somehow, thanks to my slower-than-a-tortoise pace in getting some research articles written for posting here, this is post #14 for this blog. The previous post focused on the last #14 for Chicago’s South Siders. Yet, for many of the North Side partisans, #14 will always be associated with Ernest Banks.

Needless to say, his death was a surprise.

Perhaps the defining characteristic of a Cubs fan is his or her boundless optimism that, some day, some way, some how, their beloved nine will find a way to win the last game played in October. It comes as no surprise that their most beloved players share this trait, and the moniker “Mr. Cub” was bestowed on the man who radiated that hope each and every day since September 17, 1953.

For someone born well after Ernie Banks stopped playing, most memories of the man come from replays and interactions with him as an ambassador for his beloved Cubs. Perhaps it is fitting, then, that a song at a concert epitomizes the man for me.

I’m a White Sox fan. I still played it twice.

Rest in peace, Ernie.


There once was a baseball. This, however, was not just any old baseball. This baseball had participated on the biggest stage it could. It was hurled by a large man at 97 MPH. It did not make a lot of contact with the refined sticks of ash used by those who attempted to hit it. It never left the infield, until it vanished.

For 3 days it went missing. Many speculated on where it may have disappeared to. How could a ball that significant be unaccounted for? Surely it wasn’t left in a room 925 miles away, soaked by alcohol. Someone had it, for this baseball was too valuable for someone not to have.

Those who watched the baseball’s last known appearance knew in their hearts where the baseball was. Many of them focused on one man, the last known possessor of the ball, a player. And, after 3 days, he made sure the man who paid his wages had that baseball in his hands.

9 years later, that player decided his time to step out of the spotlight had come. He didn’t get, want, or need an elaborate farewell tour. But the man who paid his wages made sure that, when it was time for the player’s team to honor him, all the stops would be pulled out.

And that is how Paul Konerko got a statue in left field, his World Series grand slam baseball, and a retired number from Jerry Reinsdorf.