2015 SABR Analytics Conference Research Awards

Voting closed on President’s Day for this year’s SABR Analytics Conference Research Awards, and like last year, I have taken a great interest in seeing which articles were nominated. Although the voting is closed, I once again am sharing which articles I voted for and runners up in each category.

Contemporary Baseball Analysis: Harry Pavlidis and Dan Brooks, “Framing and Blocking Pitches: A Regressed, Probabilistic Model,” Baseball Prospectus, March 3, 2014.
This category was stacked. I could have reasonably voted for 4 of the 5 articles. But Pavlidis and Brooks managed to stand out above the rest by a hair. Like Max Marchi’s winning article from last year, this is another landmark addition to our statistical understanding of catcher framing, possibly the hottest topic in sabermetric research until the StatCast data sees the public light of day. While Jonathan Judge and this duo have already updated and improved on their work, its import to quantifying catcher framing was without equal in 2014.
Runner up: Jon Roegele, “The Effects of Pitch Sequencing,” The Hardball Times, November 24, 2014.
Pitch sequencing is my current favorite topic in sabermetric research. It’s not quite as popular as catcher framing because sequencing is largely dependent on the pitcher’s arsenal and the techniques needed to study sequencing tend go beyond basic data mining. Jon’s work is the best on the topic that doesn’t require an understanding of Markov chains and/or the mathematical mechanics of game theory.
The other 2 articles I almost voted for were:

  • Russell Carleton, “N=1,” Baseball Prospectus 2014: The Essential Guide to the 2014 Season, January 2014. Pizza asks what we really know about an individual player, and explores swing rates for individual players using regression. (Yes, I’m one of those who instantly started mouthing GLM, HLM, and MLM at the words “gory math” and “regression” in the article.)
  • Jeff Sullivan, “Alex Gordon Barely Had a Chance,” FanGraphs, October 30, 2014. The best breakdown of the most scrutinized play of this year’s World Series.

Historical Analysis/Commentary: Steve Treder, “The Strikeout Ascendant (and What Should Be Done About It),” The Hardball Times Baseball Annual 2014.
A tough category to pick, but Steve’s breakdown of strikeout eras in baseball history was an exploration reminiscent of a Bill James essay do in his 1980s Abstracts. He explores strikeout rates rates through history, citing that the increase is part of a natural rise of the power game in baseball, both at the plate and on the mound. Nothing, not even a proposal to lop off the bottom three inches of the strike zone, will change the minds of batters sacrificing discipline for power or pitchers trying to keep that power in check by throwing hard at the expense of in-game longevity.
Runner Up: Bryan Soderholm-Difatte, “The 1914 Stallings Platoon: Assessing Execution, Impact, and Strategic Philosophy,” SABR Baseball Research Journal, Fall 2014.
While platoons aren’t anything new, I always find it interesting when someone looks at a season in the distant past using modern tools. Bryan’s analysis of the 1914 Stallings was well thought out and about as comprehensive as such an analysis is capable of being.

Contemporary Baseball Commentary: Lewie Pollis, “If You Build It: Rethinking the Market for Major League Baseball Front Office Personnel,” Brown University, senior honors thesis, Spring 2014.
Most senior theses don’t make it beyond the adviser’s desk. If you happen to read one, it’s probably because you know the person who wrote it or you were in the person’s grauduating class and major while they wrote it. Lewie’s thesis is clearly more pubic than that. It’s also an extremely articulate breakdown as to why wages for lower-level front office personnel should be higher. It won my vote in a rout.
Runner Up: Eno Sarris, “Learning the Language of the Clubhouse,” The Hardball Times, March 13, 2014.
Eno’s article was full of wonderful anecdotes and personal reflections on speaking the ballplayer’s language. It’s the runner up almost by default, as the other 3 articles rehashed (or completely missed) ideas I have previously seen explored.


Organizing the World’s Sabermetric Research, Part 4 – Designing a Database

Here’s an idea of how much other stuff has gone on in my life: I talked about building a sabermetric research database 11 months ago. Version 1 has yet to be published. Much like my postings here at this blog, the time to work on this project has been sporadic. That inconsistency made designing the database challenging.

While I did pick up a minor in computer science while an undergrad, I’ve been primarily a user, rather than a designer, of databases ever since. I know the basic principles of database design, but designing one with minimal experience from years ago is not easy. So I looked for examples.

I started with the best example of a database I knew of for recording information on various printed and recorded materials: the digital library catalog. I couldn’t get access to the database schema that a real library uses, but did manage to find an example. Granted, this example of an entity-relationship diagram covers only books, but it was a start. It affirmed 3 different base tables that were pretty obvious to me based on what I wanted when I first talked about the design: author, book, and category. The intermediate link tables between book and author and book and category were something I didn’t have in mind at first, but incorporating those kind of link tables for the underlying database is actually a key element of a third normal form relational database. The link tables will help with database organization.

I also found the schema for a database that served as an inspiration for this idea. As a statistician with a slight academic bent working in industry, one of my resources is the Current Index of Statistics. While their schema wasn’t displayed in a nice entity-relationship diagram, it is available in code form. Of course, there are many things the CIS is interested in that I am not, but the schema follows the core idea of third normal form: each element of the data needs its own table.

All this matters because I want to make sure I record all the information for the DB with a few passes through the material, and knowing which pieces of information to collect is critical to that process. A few of the elements I want to collect are universal to all of the material types I discussed in Part 2, with a few notes on the columns

  • Author – first and last name, with a key built using the same logic as the Retrosheet player ID
  • Publisher – name and city. The name could be the key, but I think creating a shortened version of the name will be a better key and make queries easier.
  • Citations – The heart of this project. Just a listing of two publication IDs, one being the piece of research cited in the other. At one point early on, I considered including page numbers, but that seems to be more effort than it’s worth at this point and could be added later.
  • Subject – The subject list needs to be uniform across all media types. The subject table will be like the Citations table, with a publication ID and a column identifying the subject. I’m thinking that the subject list will be coded to help conserve disk space as this database grows. Players and teams can be included as subjects, and I’ll use the same codes as Retrosheet.

The other tables are specific to different media types:

  • Book – publication ID, title, author IDs, publisher ID, publication year, ISBN. Books only published electronically will be treated the same as printed books. Publication IDs will start with “b” to denote book. ISBN would be the key for this table if it weren’t for the need for a unified key across the other media types that can’t be identified that way.
  • Article – publication ID, title, author IDs, journal ID, publication date, start page, end page, URL. This should work for both journals and magazines. Publication IDs will start with “a” to denote article. I’m also including URL since so much of what’s in print is migrating to or simultaneously published online nowadays.
  • Journal – journal ID, journal name, publisher ID, domain URL. Magazines are included here as well; journal ID is just so that the field name is distinct from other fields in the database.
  • Presentations – publication ID, title, author IDs, speaker IDs, presentation date, conference ID. I separate out the speaker and the author IDs only because not all authors will present and, in rare instances, someone else presents who didn’t the author the presentation. Publication IDs will start with “p”
  • Conference – conference ID, conference name. I’m not going to list each annual conference separately, as that can be inferred by the presentation date when these two tables are linked. This will just be to identify different conferences and conventions (e.g. SABR, JSM, NESSIS, SaberSeminar, etc.)
  • Web articles – publication ID, title, author IDs, website ID, publication date, URL. The web is a nebulous place, and the article I read today may be different than what I read tomorrow, but reputable online sites will note original publication date and edits if they occur, so I’m not worried about that as an issue. Publicaiton IDs will start with “w”
  • Websites – website ID, website name, domain URL. Pretty straight forward. I don’t want to combine this with the Journal table so that it uses few columns

If you’ve stuck with me this far, I’m going to add one last note about how I’m building this database. I’m breaking up my exploration into 3 eras to help with identifying and finding sabermetric research. The first era ends with the publication of The Hidden Game of Baseball. That’s my starting point for this project and that book marks a pretty significant milestone in sabermetric history. I also feel that going backwards in time from 1984 will be of more value to the sabermetric community, and it will allow me to focus on printed material initially. The second era is the period between 1984 and 1996, which is mostly printed material. 1996 is the year Baseball Prospectus was founded, and it serves as a proxy for the start of the explosion of sabermetric research courtesy of the Internet. The Internet era (1996-present day) will be handled last.

Version 1 will hopefully be ready in the next few months, and if it isn’t published by the start of SABR 45 at the end of June, this project will have been abandoned.

SABR Day 2015

It might have been a week later than the official date, but avoiding fan fest date conflicts proved to be a wise decision for the Ken Keltner Badger State and Emil Rothe chapters of the Society for American Baseball Research. 48 baseball fans made their way to and from Kenosha’s “world famous” Brat Stop for what has become an annual Hot Stove tradition.

This year’s meeting opened in the sadness that could only be brought on by the death of a beloved ballplayer. Not only was Ernie Banks’ funeral playing on the TVs as baseball fans arrived, but the meeting also took place on what would have been Mr. Cub’s 84th birthday. Rich Schabowski (dressed in football attire of teams not from the state of Wisconsin or Illinois) opened the meeting and led all in attendance in a moment of silence for the Cubs legend, which ended with a shout of “Let’s Play Two!” That call ended up symbolizing the day: each chapter organized half of the meeting, with a lunch break in between. It really was like playing a doubleheader.

The portion of the meeting organized by the Chicago chapter took the top half of the program. Leading off was guest speaking Ozzie Guillen, Jr. Currently employed as a financial adviser, he worked in the clubhouses while his father coached in Atlanta and Florida and also while Ozzie, Sr. managed the Pale Hose to their first championship in 88 years. Holding high expectations for both of Chicago’s clubs in 2015, he spoke his mind on two issues in baseball today. The first is that the pendulum has swung to far in favor of analytics and sabermetrics within some organizations. The second is that today’s players make too much money, exacerbating the disconnect between the players and the fans and mirroring the current stratification of American society. He then took many questions from the audience, discussing everything from his favorite player as a clubhouse manager (“The best tipper”) and observations on the aforementioned World Series champion 2005 White Sox to pitch counts and broadcasting the team his father managed in Florida.

Batting second was a man whom his boss has called the “Ben Zobrist of Baseball Prospectus”, prospect writer Mauricio Rubio, Jr. With a deep love of baseball inherited from his family and dreams of being a pro scout, Mauricio started working for the fantasy side of BP before his constant pestering finally landed him a chance to write on prospects. With a focus on the Midwest League, he commented on how his writing tends to focus on melding stats with scouting, in line with BP’s brand as a leading sabermetric site. He also remarked about how mechanical analysis has become big with saber-scouts, but cautioned that mechanical analysis might be overemphasized, concurring with some of the commentary on the importance of a prospect’s character from Ozzie Guillen, Jr. The Q+A revealed his typical day at the park starts with a focus on 2 pitchers (typically the starters) and 2 hitters, moving around from the bullpen to behind home plate to a side view for the hitters and a rear view to better analyze arm action.

The final speaker before lunch was Merle Branner, who shared a paper from a leadership course she took as part of her studies in Library and Information Science. The paper examines the leadership dynamic between Branch Rickey and Jackie Robinson using the Servant Leadership model proposed by Robert Greenleaf. She examines all 10 aspects of the model in relation to Rickey’s signing of Robinson and integration of the major leagues. Once she was done, it was undoubtedly time for lunch.

While lunch was delicious (the cajun bratwurst is highly recommended if you’re ever able to stop at the Brat Stop), there was more baseball to be discussed, and the Badger State portion of the meeting commenced. Jim Nitz told the story of the Milwaukee Chicks, the 1944 champions of the All American Girls Professional Baseball League made famous by the film A League of Their Own. Their only year in Milwaukee was a turbulent one despite the on-field success. Media coverage for the team was poor in Milwaukee, failing to replicate the success of teams like the Rockford Peaches and leading to multiple nicknames used in the papers (primarily Schnitts and Brewerettes). The Chicks cohabited in Milwaukee’s Borchert Field with the Brewers (the minor league club), leading to a cavernous stadium that was sparsely inhabited. Nitz noted their success was largely due to some fantastic ballplayers like Connie Wisniewski and Hall of Famer Max Carey’s well-regarded management of the team, and also shared anecdotes on each of the players. His Q+A was enhanced by some women who play in an AAGPBL re-enactment league.

Afterwards, it was time to close the silent auction (a Chicago chapter fundraiser) and draw the winner of the 50/50 raffle (a Badger State chapter fundraiser). After claiming items from the silent auction, a presentation on Ginger Beaumont was up next up. Unfortunately, it was at this point when I had to leave, so I can’t comment on the rest of the meeting.

Thank goodness Spring Training was only 2 weeks away.

For those that failed to make it, Emil Rothe chapter secretary David Malamut took photos and even video of the day’s events. The photos can be seen on Twitter @sabrchicago, and links to the videos can be found here

Mr. Cub

Somehow, thanks to my slower-than-a-tortoise pace in getting some research articles written for posting here, this is post #14 for this blog. The previous post focused on the last #14 for Chicago’s South Siders. Yet, for many of the North Side partisans, #14 will always be associated with Ernest Banks.

Needless to say, his death was a surprise.

Perhaps the defining characteristic of a Cubs fan is his or her boundless optimism that, some day, some way, some how, their beloved nine will find a way to win the last game played in October. It comes as no surprise that their most beloved players share this trait, and the moniker “Mr. Cub” was bestowed on the man who radiated that hope each and every day since September 17, 1953.

For someone born well after Ernie Banks stopped playing, most memories of the man come from replays and interactions with him as an ambassador for his beloved Cubs. Perhaps it is fitting, then, that a song at a concert epitomizes the man for me.

I’m a White Sox fan. I still played it twice.

Rest in peace, Ernie.


There once was a baseball. This, however, was not just any old baseball. This baseball had participated on the biggest stage it could. It was hurled by a large man at 97 MPH. It did not make a lot of contact with the refined sticks of ash used by those who attempted to hit it. It never left the infield, until it vanished.

For 3 days it went missing. Many speculated on where it may have disappeared to. How could a ball that significant be unaccounted for? Surely it wasn’t left in a room 925 miles away, soaked by alcohol. Someone had it, for this baseball was too valuable for someone not to have.

Those who watched the baseball’s last known appearance knew in their hearts where the baseball was. Many of them focused on one man, the last known possessor of the ball, a player. And, after 3 days, he made sure the man who paid his wages had that baseball in his hands.

9 years later, that player decided his time to step out of the spotlight had come. He didn’t get, want, or need an elaborate farewell tour. But the man who paid his wages made sure that, when it was time for the player’s team to honor him, all the stops would be pulled out.

And that is how Paul Konerko got a statue in left field, his World Series grand slam baseball, and a retired number from Jerry Reinsdorf.

Rebalancing the Schedule

Ed. note – First post in a long time due to many a thing happening in my personal life. Thanks for coming back!

One of the more challenging aspects of modern sabermetrics is the unbalanced schedule. This imbalance began in 1997 with the introduction of Interleague play during the regular season, and the balance was tilted further when MLB decided to put an additional focus on divisional play and have 19 games a year between teams within each division. An additional wrinkle was added last year when the Astros, as a condition of their sale to Jim Crane, switched leagues and caused the AL and NL to have an odd number of teams.

Let’s try to rebalance the schedule. I’m going to make a few assumptions:

  • The teams will stay in their leagues
  • The possibility of expansion or contraction of the leagues will be ignored
  • The schedule will remain at 162 games
  • No 2 game “series” will be allowed

It’s fairly simple to see that, under these restrictions, a truly balanced schedule across both leagues is impossible. 162 is not divisible by 29, and with 15 teams in each league, interleague play is required. The closest possibilities to a balanced schedule would each violate at least one of my assumptions: 162 divided by 29 is approximately 5.6, so having one team play each other team 5 or 6 times would result in 145 or 174 game schedules, respectively.

The next closest thing to a completely balanced schedule across both leagues is to try and keep the number of games played against each other team as close as possible. Additionally, it makes sense, for both logical and historic reasons, to make sure that a team plays more games within its league than outside of it. Let’s look at a scenario where a team plays each team from the other league 4 times. This leaves 102 games against the other teams in the same league. The schedule could then be completed with an almost balanced intraleague schedule: 7 games against the 10 teams in the other 2 divisions and 8 games against the other 4 teams within the same division.

That schedule actually could work out pretty well. With 15 series against teams from the other league each year, each team could alternate home and away each year, playing 8 interleague series at home one season and 7 the next. This impacts how the 7 games against the teams in the other 2 divisions within the same league would be split. You can’t have the intraleague ideal of playing 5 teams 4 games at home and 3 games on the road and 5 teams 4 games on the road and 3 at home. In the season with 8 home interleague series, only 1 of these series would be played with 4 games at home and 3 on the road. This would get reversed in the season with 7 home interleague series. The 2 teams impacted could be changed every 2 years, resulting in a 10 year cycle for this schedule scheme.

There’s still a bit of imbalance in the schedule, but I feel it’s more balanced than what MLB currently uses. It also makes sure fans who follow their teams would have a chance to see all of baseball’s stars, making the players, and the sport by extension, more marketable. An additional bonus is the hype for those storied intradivisional rivalries is more justifiable (looking at you, Entertainment and Sports Programming Network).

Mr. Manfred, my phone line is open if you’d like to discuss.

Organizing the World’s Sabermetric Research, Part 3 – Plugging into SABR

It’s been just about a month since SABR 44. As I teased in my very verbose recap, a few things came out of the committee meetings that relate to this possibly quixotic quest I have to catalog all the world’s sabermetric reasearch.

The first thing to note is that SABR already has an effort to catalog all baseball writing: The Baseball Index. Clicking through that link will show a functional, but dated and incomplete, reference of baseball documents, recordings, and other materials that any baseball researcher could want to know of. It does include many sabermetric works already, but finding these works isn’t all that easy. This is mostly because of an fickle search function that doesn’t work quite as well as a modern internet user would like, but also because the tags on the articles are designed around indexing all baseball research, not just sabermetrics. Most sabermetric entries are listed with the tag of “statistical analysis” and nothing deeper, unlike the articles on Saber Archive. Another thing of note is its current state of incompleteness, which is due to a broken data entry system. I will note here that the committee did mention someone is working on an upgrade to this system at SABR 44 (hint: he runs a very popular website).

Secondly, the Statistical Analysis committee, in its 7 AM Friday morning meeting at SABR 44, brought up the idea of a group project to create a centralized reference list for sabermetric research. Many members of the committee had various ideas about what such a resource should look like: a list of the most recommended articles, a full literature review of one area of sabermetrics (e.g. defensive metrics), working with the Baseball Index Project committee, and a wiki were all suggested. Phil Birnbaum, who chairs the Stat Analysis committee, is currently collecting names of those interesting in helping with this committee project, even if you’re not a member of SABR.

How do these two things affect what I had in mind? The Baseball Index, when upgraded and if designed better than its current state, would contain a lot of the features I am working to include in my database. There are a few features I plan to have that are not in TBI, most notably a citation link between works and topical tags that are more like Saber Archive’s. Thus, my database and the Baseball Index should be able to co-exist. I’ll be using TBI as an additional source for locating sabermetric research works to be included. I’ll also be contributing to the Baseball Index, focusing on articles in academic journals, which appears to be a major gap in their listing at the present time.

The Statistical Analysis committee project is too new to really know how things will shake out. I’m already on board to help with this committee project, and there’s a non-zero chance I take a leadership role with it. However, there are just too many unknowns to really know how my work will fit in with what this project turns into. All I can do is keep on keeping on.