Organizing the World’s Sabermetric Research, Part 2 – Contents and Design

In Part 1, I discussed current efforts to capture sabermetric research and announced my intention to create a new sabermetric research reference database that will be more comprehensive than those efforts. In this part, the critical questions of what works should be included and how the database is designed will be addressed.

Considering my goal for the yet-to-be-named sabermetric research database is to capture all sabermetric research, it’s important to consider what qualifies as sabermetric research. There are certain articles or books which most involved in sabermetrics would certainly agree should be included. Those are the easy entries. But to be comprehensive, the database will need to include more than the “greatest hits” of sabermetrics. That means there will need to be some way to decide what goes in and what doesn’t, and it shouldn’t be an individual person.

So how will it be determined that a published work will be included in the database?  This is where the “greatest hits” will be of great use. Using those pieces or research that are already very well known as a starting point (such as The Hidden Game of Baseball or the Bill James abstracts), the database can be expanded upon by looking at the references and citations used in those works and including those works cited that are also sabermetric in nature. This, however, is merely a good start.

One result of a well-populated comprehensive research database will be the ability to identify schools of thought, and it’s possible that the “greatest hits” list I use will be from the currently dominant school of thought. It is also likely that, at some point, I’ll run into dead ends walking through citations in other works.  So after exhausting the “greatest hits” path, I’ll turn to looking for work by the most noted researchers. They’ll be determined either by having won awards from SABR or having been hired by a big league front office as a result of their published work. It should be noted that this author-based method could encounter the same problems as the “greatest hits” method.

A third angle for finding works to be added to the database will be based on the source of the research. Certain publications and websites definitely carry more weight than others. Books and academic journal articles are certain to be included due to the review process these works undergo before being published. Presentations at conferences are also fairly likely to be included, though duplication of  work will be avoided as many presentations will have corresponding papers published, especially from academia. Certain blogs and websites, like The Hardball Times, will also be considered. There are a few sabermetric primers that can be cross referenced to help identify other sources.

For getting the database initially populated, those methods should work just fine. With all those different inputs, it also means that a variety of materials need to be considered and managed. One of the main considerations is how to record the information for each type of published work. Much like how different types of references had to be cited using different syntax in papers in high school, different types of research sources need to be treated differently in the database to accommodate the unique features of each type. Let’s consider each type of research source:

  • Books – Seemingly simple, as you have lots of publication information included (title, author, publication date/year, etc.). However, many sabermetric books cover multiple topics of research; thus, there will need to be a way to tag subject matter based on page number.
  • Journal/Magazine Articles – Perhaps the easiest type to enter into this database. Usually focused on a single topic, publication information is easily identifiable (title, author, journal, page #s, etc). Also often includes keywords for topical focus.
  • Blog posts – Like journal articles, typically single topic. Tags are often used to identify topics covered. Can run into the issue of not being able to properly identify authorship. URL  will need to be captured, but that can become inaccurate if website is deleted/moved. Otherwise, cited very similarly to journal articles.
  • Online forums – Specifically thinking of the old Usenet group here, but could also apply to e-mail lists like SABR-L and other online arenas like Tango’s Forum. Many of the same issues as blog posts. I’m really disinclined to include postings from these sources, as they are more of a discussion forum for working out ideas than they are a place to publish research. However, if a posting here is cited in another work included in the database, it’ll probably be included. Citations will be almost exactly the same as blog posts.
  • Conference presentation slides/posters – Inconsistently published, but valuable when available. Usually accompanied by formal paper, though content can differ slightly to account for new information. These materials are typically meant to be accompanied by someone speaking. This database will only include those presentation slides that can be accessed electronically or were published as part of a conference proceedings book that could be found in a library. It will also link the slides to the corresponding paper or…
  • Audio/Video – One of the great things about being a researcher in modern sabermetrics is the ability to virtually attend conferences, thanks to posted video/audio from the event. These will probably be included in the same manner as conference slides/posters. Citations for both will incorporate information typically used for citing conference proceedings: presenter, conference, title, and date at a minimum.

The schema of the database is still being edited as I work towards building version 1. I’ll discuss that once the first version is released. That date is still TBD, as this is not my full time job. If you want to help or have a suggested name, please feel free to drop me a line on Twitter or in the comments section.



