Organizing the World’s Sabermetric Research, Part 1

Lent started recently for western Christianity, and I have a confession that may not be all that surprising: I  read an absurd amount of sabermetric research. I read articles on the big sites like Baseball Prospectus and FanGraphs. I read the blogs of a number of fellow SABR members. I read articles from academic journals like the Journal for Quantitative Analysis in Sport and The American Statistician. I buy the Baseball Prospectus, Hardball Times, and Bill James annual books each winter. I, at a minimum, keep tabs on what papers are presented at the big conferences like the recently completed MIT Sloan Sports Analytics Conference, the summer SABR Convention, and the Joint Statistical Meetings.

All of this information is being disseminated, and but to find it all is still rather difficult and time consuming. The links above are just a small sampling of the bookmarks and RSS feeds that I use to read baseball statistical research. That’s ultimately not a good thing for sabermetrics, because it leads to 2 major issues:

  1. There are 2 main types of researchers doing work in the field of sabermetrics: the analytically-inclined fan or sports industry professional, and the academic statistician. For the most part, neither type ever looks into what is going on with the other type. The fan avoids the academic in part because of the the inherent barriers built into accessing and understanding academic research and in part because the academic is typically more focused on the statistical technique used than subject matter applicability. The academic avoids the fan because of the perceived lack of statistical rigor in the fan’s analyses and the fact that a blog post can go up without any sort of peer review before publishing.
  2. As Colin Wyers noted in one of his final articles at Baseball Prospectus, there is no single record of the research done by saberists/sabermetricians, in part because a lot of this research was done on an individual basis and it is not all channeled through any single organization. While this decentralization is generally good for making research available, it does mean that articles and books with old ideas can easily be discarded and lost despite their merit at the time of publication.

How can this be fixed? One method would be an aggregator site that captures and categorizes articles. This is the approach of Saber Archive, currently in closed beta testing. (Note: I have participated in this closed beta. You can sign up to participate here.) Matt Dennewitz has built a fine interface to capture the text of research articles and make these articles searchable by key words and by research topic category. This should become an invaluable resource, especially once the site is public and more fully populated with articles. However, its scope of research will be limited to what is currently published on the web and can be archived by Matt’s software. Granted, that covers much of the research published today, but it does leave out anything that is only published the old-fashioned way: on paper, in books and magazines.

It seems to bridge this gap between the historic print world and the modern web, sabermetrics needs something closer to the modern library catalog, a database of books and articles with sabermetric content. Charlie Pavitt has attempted this with his Statistical Baseball Research Bibliography, though with a narrow scope. Like myself, he is an avid reader of sabermetric material, and he draws upon that knowledge base to cull through the research published and identify the works of sabermetric research that, as he states in the the description of the file, “have been intended to make a contribution to our knowledge about baseball as a statistical science.” What he’s produced is a fine resource, though because this definition is ultimately arbitrary. In my opinion, his list leaves off some pretty key pieces of research, even if you assume he just hasn’t had a chance to update the list since 2011, the most recent publication date of any entry.

These two are the only attempts that I know of that try to catalog even a part of the world’s sabermetric research. Both have good features, both have drawbacks. I feel that sabermetrics deserves a little bit more than what is currently provided. Thus, I am proposing to build a new comprehensive database of sabermetric research articles. Consider this my attempt to take up Colin Wyers’ challenge to the sabermetric community.

This new database will undoubtedly share some features with the efforts described above. It will be built as a database, which I presume is the underlying framework of Saber Archive. However, the contents of that database all merged together will result in a table more like Pavitt’s Bibliography. The main goal of it all will be to capture every piece of sabermetric research and link it all together, with the ancillary goals of bringing together the fans and the academics together more often and, just maybe, increasing everyone’s ability to understand statistical methods and results.

Admittedly, this is a huge project for one person to take up on his own. Part of the reason this is a “Part I” post is because of the project’s scale, but also because I still do have a number of features to work out and design issues to confront. I’ll cover those issues in future installments. I do have my starting point; The Hidden Game of Baseball by John Thorn and Pete Palmer will be entry #1 in the database. I feel that’s fitting, as it was one of my gateways into sabermetrics.

Help will most certainly be needed, especially in identifying new sources of research to be added to the database. You can offer your assistance in the comment section or by any other means of contacting me if you happen to know them.



  1. Pingback: Organizing the World’s Sabermetric Research, Part 2 – Contents and Design | Four Pitch Random Walk
  2. Pingback: Summer of MOOC: Sabermetrics 101 | Four Pitch Random Walk
  3. Pingback: SABR 44: 3,000+ words on my first convention | Four Pitch Random Walk
  4. Pingback: Organizing the World’s Sabermetric Research, Part 4 – Designing a Database | Four Pitch Random Walk

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s