The SABR 101 Project

One of the things I missed when I had to skip out of SABR 45 Saturday was the committee meeting for SABR’s largest research committee, Statistical Analysis. Unlike many of the other committees, the Stat Analysis committee didn’t have a group project to work on, in part due to the individual nature of most members’ research. A couple ideas were bandied about the meeting during SABR 44, but it took until SABR 45 to get one of those ideas off the ground.

A few weeks back, Phil Birnbaum, the chair of the committee and editor of the By the Numbers newsletter, announced that group project. The idea is to create a crowd-sourced list of key resources for helping newcomers to sabermetrics learn what has been done and provide to him or her the foundation for additional contributions.

There are plenty of books and articles which I could cite, so I’m going to start with the broad resources that cover multiple topics. That means it does skew towards books. They are listed in the order they came out of my head.

Before I get into my long list, I want to invite you, dear reader, to contribute your recommendations to this project. If you do so in the comments, I’ll be sure to pass them on.

  1. The Numbers Game, by Alan Schwarz. This book came up recently when Graham Womack of Baseball Past & Present and I discussed the importance of this book and a few other titles that will make there way onto this list as for which one we’d recommend first. We both agreed that this title is where we’d tell others to start. A fantastic history of baseball’s numbers, and the understanding of how a particular stat like batting average or OBP came to be is key to understanding any analysis with those measures.
  2. The Hidden Game of Baseball, by John Thorn and Pete Palmer. It’s over 30 years old, and it might be the most important book in sabermetric history. There’s a reason I started my sabermetric research database project with this book: it was The Numbers Game before Schwarz wrote his book with its concise history of baseball statistics AND it introduced the linear weights model to the world, which is much more of the mathematical foundation of modern sabermetrics than anything put out by the most famous name in the field.
  3. The Bill James Abstracts, both the annuals printed from 1977-1988 and the Historical Abstract (first published in 1986, revised and updated in 2001). For the many who grew up before Al Gore’s invention came to the masses, these books were how they were introduced to sabermetrics. Bill isn’t a statistician in the academic sense, but his understanding of baseball endows his analyses with tremendous insight.
  4. Curve Ball, by Jim Albert and Jay Bennett. I have a rare relationship with this book. I read it before I ever read anything by Bill James. It steered me from being a pure mathematics major in college to a statistics major, which is one of the 5 best decisions I have made in my life. So yeah, I hold this title in high esteem for many personal reasons. That being said, it might be the best book for helping aspiring saberists to start understanding mathematical statistics, which is essential to advancing the field.
  5. The Book, by Tango, Lichtman, and Dolphin. For many saberists, this is the modern treatise on the subject. Grounded in an understanding of Palmer’s Linear Weights system, they introduce wOBA and use it to explore every facet of the game.
  6. For online reference guides, the FanGraphs Sabermetric Library is my preferred site, as I consider to be the most complete. Neil Weinberg is also authoring weekly posts to explain the ins and outs of various metrics, helping keep the reference guide current with new research.
  7. The Best of Baseball Prospectus: 1996-2011 is a 2 volume set that is a compilation of the most important articles from the first 15 years of that sites’ history. This is essentially my proxy for the excellent writing on that website, including Voros McCracken’s article on DIPS Theory and Keith Woolner’s “Baseball’s Hilbert Problems“.
  8. Baseball Hacks, by Joseph Adler. The ability to analyze data is great, but it is useless if you can’t get data to analyze. While the book is somewhat dated, it’s a great introduction to many of the coding skills required to do sabermetrics efficiently in the computing era, and one I still find worthwhile to have on my shelf.
  9. SABR101x, the massively open online course at edX administered by Boston University and designed by Andy Andres et al. If you prefer a class-based method for learning sabermetrics, this is as good as you’ll find. There are tracks on the history of sabermetrics, statistics, SQL/R skills needed, and a build up to understanding some key metrics used by saberists.

One thing I want to keep separate from this list is SABR’s own Guide to Sabermetric Research, which was put together by the aforementioned Phil Birnbaum. His involvement spearheading this SABR 101 project is why I leave it out for now. I have a sense that it will be that guide that is updated as a result of this group work.

SABR 45: A Partial Review for a Partial Experience

Almost 2 years ago, I was sitting on my computer scrolling through Twitter when this appeared:

Yeah, I got a little bit excited when I saw that.


It was never a question of whether or not I would be attending the SABR Convention this year. Having a convention in your backyard has some benefits, the biggest of which is cost. Aside from the convention registration, I had my choice as far as how to get to and from the Palmer House and whether I wanted to sleep in a hotel bed or my own. With an infant crawling around my house, I chose to not book a hotel room (at approximately $200/night) and took commuter rail in and out of Chicago each day.

The downsides of my lodging and travel decision were twofold: 1) I didn’t partake in nearly as many hallway and bar conversations as I did last year, depriving me of what many consider the most fun part of the convention experience; and 2) it made it easier for other things to pull me away from the convention activities. Having to catch a train at 7 am to just make it to a day of events running from 8 am to 10 pm meant having to reconcile sleep with the train schedule. Then, family events cropped up on the weekend, making it unfeasible for me to go downtown Saturday or Sunday. While missing Sunday only cost me the Historic Ballpark Site tour, not being able to attend Saturday cost me half of the presentations and panels and most of the committee meetings I was interested in.

However, what I did attend and help with as a volunteer and member of the host chapter was quite fantastic. Wednesday is typically a travel and get acquainted with the city day. With minimal travel, I helped as a volunteer with registration and Cubs ticket distribution. As with past conventions, there as a tour of the host city. I skipped this year’s walking tour due to the aforementioned volunteer work, but Jacob Pomrenke put together a fantastic document highlighting the sites with baseball history attached to them as the tour traversed downtown Chicago. (If a KML file gets created for it, I’ll link to it here). After registration closed down for the night, I sacrificed the welcome reception in order to catch the train and be home.

Thursday was what I presume is a rare day in recent SABR Convention history. At no time did any attendees have to pick between different meetings or presentations, as it was a single program of events for the day. Cubs broadcasters Len Kasper, Jim Deshaies, and Ron Coomer graced the broadcasters panel in the morning, chiming in when moderator Curt Smith would let them do so. Many of Smith’s questions centered on the Cubs, and all three provided the level of insight that I’ve become accustomed to when I do tune in for Cubs broadcasts. This was followed by the annual business meeting, which showed the continued positive growth of the society but, unlike last year, revealed no final verdict on next year’s convention. It seems the society learned its lessons the hard way: Houston had a hotel location near a mall instead of the ballpark due to the latter option’s lack of availability after the 2014 MLB schedule was released; Chicago corrected for that by getting the ideal hotel location early, but ending up victimized by selecting the one weekend BOTH Chicago clubs were on the road. There is a tentative plan for SABR 46’s host next year, but it would be unwise to get excited for seeing a bobble head museum and the most colorful home run sculpture in MLB quite yet (never mind my own personal ability to attend next year). Thankfully, despite the lack of weekend games, the Cubs were finishing up a series with the Dodgers, so Thursday afternoon’s getaway day contest ended up being the convention game. It was entertaining simply because of Joe Maddon’s tinkering with the line-up every 2 innings or there about. Thursday night ended up being what I think was the biggest highlight of the convention (and perhaps a way of the national office apologizing for the schedule debacle): a concert in the Palmer House’s Grand Ballroom with the Baseball Project.The Baseball Project Rocks SABR45From left to right, Scott McCaughey, Linda Pitmon, Mike Mills, and Steve Wynn rocked the house with their songs about Harvey Haddix, Ted Williams,  Larry Yount, Big Ed Delahanty, and many others. Wisely, they opened with “Box Scores” of their album 3rd, which to me is the quintessential SABR song. It was pretty awesome. If you like baseball and rock music (especially R.E.M.), you’ll love this band.

After forgetting my phone at home Friday morning, I made it in time for the second group of presentations Friday morning, dropping in on Tara Kreiger’s presentation about Andy Coakley’s labor struggles with organized baseball. It was a fascinating story that I was unfamiliar with, but it exemplified the blackballing many early players went through when they complained about their contract. This was followed by 2 panels: one title Pitching Prodigies that featured Steve Trout and Joe Berton a.k.a. “Sidd Finch”, and an presentation by the 4 Letters on an upcoming project. The former was my favorite panel I attended, as Berton told the story of how he got involved in the Sidd Finch hoax perpetrated by George Plimpton and Sports Illustrated. Trout seem more subdued about his experiences, which I guess is to be expected from an 8th overall pick who did not have the career he expected to have. The latter was a “stealth announcement” about a project entitled “1927: The Diary of Myles Thomas”, which looks to chronicle the 1927 Yankees via “real-time historical fiction” storytelling. I kind of like the concept, but will probably wait and see what ends up being produced by Steve Wulf and Douglas Alden Warshaw. The presentation I saw after the panels was entitled “Aging Fan Base: Using Twitter to Develop a New Geneartion of Baseball Fans” and given by Allison Levin. Unfortunately, she didn’t get to many suggestions in her slides, as most of the time was spent looking at Twitter usage during the 2014 World Series. But she has a few avenues for further exploration that will hopefully yield some results, thought I have a sense that MLB might be ahead of her on doing this.

The morning block was followed by a tribute-filled awards luncheon. I skipped this last year, since my meal times were spent with my wife who graciously traveled to Houston with me. I’m glad I went this year, because I got a better sense of what this organization means to so many people. Tom Hufford couldn’t avoid breaking down as he eulogized two of his fellow Cooperstown 16 that founded SABR, Ray Nemec and Joe Semenick. Phil Rogers had it a bit easier in terms of emotions, but still had to encapsulate what Ernie Banks and Minnie Minoso meant to their adopted hometown. He did so, and did it well. After the banquet I took time to peruse the vendor room, which is a dangerous endeavor given the number of baseball books that are available for sale. My wallet came away only somewhat dented. The only committee meeting I attended was for the Business of Baseball, which gave an update on the Winter Meetings project (all years are being researched by someone!), the Team Ownership bios (4 of 30 done or in progress), and a reminder from chair Michael Haupert about the importance of examining the source of data in research, using examples from the pre-1983 salary database to show how what’s printed isn’t always accurate.

I then attended 5 more presentations between the committee meeting and heading home. In order:

  • David Kaiser questioned “What Makes a Dynasty?” He counted at teams who played postseason baseball in 3 of 6 seasons as a dynasty, splitting the analysis into 3 eras based on the postseason structure in place. He noted which ones were dominated by pitching and which ones weren’t. Most of the expected teams showed up where you would expect. The only bone I pick is that, based on the average winning percentage by era for the dynastic teams in the study, he said mediocrity was more prevalent today then it used to be. I think that’s just a function of his definition of dynasty.
  • David W. Smith, the Retrosheet president, updated his look at run scoring in the 1st inning, asserting that travel doesn’t seem to have an effect but that the number of runs the visiting team scores in the top of the 1st is highly correlated with the number of runs they allow in the bottom of the 1st. You can find his paper on Retrosheet’s site.
  • Zach Moser gave an oral presentation on how Cap Anson’s views on colored players in professional baseball were portrayed over time. While revered in his time, Anson’s racism became a hot topic while he was among the early players considered for induction into Cooperstown’s most noted museum. Anson’s racism was revisited as many of his team records for the Cubs were eclipsed by the aforementioned Ernie Banks, and Moser suggests that most modern apologists for Anson are deficient in their criticism.
  • John Burbridge examined “The Increasing Importance of Quality Starts” by mostly just doing an x-ray on the definition of a quality start. He ultimately came to the conclusion that 6 IP with 3 or fewer runs allowed is reasonable, and claims that is it increasingly relevant as bullpens are utilized more and more.
  • Finally, Bruce Allardice talked about how pro baseball became a big part of Chicago in the mid 1800s. Baseball grew in popularity in Chicago, paralleling the game’s growth in popularity nationwide. By 1870, the city’s elite coveted the status of being the nation’s pork capital, vying against a river town called Cincinnati. Because of this rivalry with the 2015 All Star Game host city, Chicago’s wealthy pooled funds to found the first professional club in the City. The White Stockings did manage to beat Cincinnati twice late in that season, and would go to claim the championship based on a disputed victory over the New York Mutuals, who also claimed the title. Unfortunately, baseball took a 2 year hiatus after a cow tipped a lantern and ignited a magnificent blaze that required years of rebuilding.

I’d love to say more about SABR 45, but (1) I’m already at 1,750 words if you’ve read to this point and (2) the downside of a local convention is that you can be pulled to do other things since you aren’t travelling. That’s what happened to me on the weekend, as family event popped up and hindered by ability to get in and out of the city. I don’t know if I’ll get to go to another convention for a while at this point, and next year looks doubtful regardless of location. When I do go again, I’m going to make sure of 2 things: I’m staying at the hotel so I can go hang at the bars and talk baseball over beers. That’s the convention experience that I missed, and why those who go to one convention try to make it an annual trip.

Statcasting Expectations

The next level of public baseball data has arrived. MLB Advanced Media’s Statcast made a hyped television debut, although it had made cameos in online replay videos last year. With the system installed in all 30 ballparks to track all movement on the field, hopes are high for discovering many things about the game via data that previously could only be imprecisely discerned by watching a lot of baseball.

However, while MLBAM have stated that Statcast data will be made public, it is still unclear what types of data and how much of it will be available for public use. Bits and pieces of the data have slowly appeared as the 2015 season started. Among the first pieces have been the velocity and angle of the ball off the bat, which the savvy scrapers, such as Daren Wilman of Baseball Savant fame, of the Gameday files have captured and published. But whether the public will have access to the raw data remains to be seen.

It seems unlikely to me that there will be public access to the raw Statcast data anytime soon. The first challenge is the sheer size of the data set, which is already measured in petabytes. This is unlike the pitchF/X data, which can be scraped and saved on a home PC. Raw Statcast data is best stored on a cloud server. While MLBAM is certainly using “the cloud” as the method for allowing the 30 teams to access the data, it would be a massive security risk to open that server up to the public domain. Setting up a public server would be an additional cost, and it’s hard to argue that there would be any significant return on that investment for MLBAM. However, Statcast is already sponsored by Amazon Web Services, so the possibility is there for the raw data to be made public via the AWS platform. That possibility seems very remote at this time.

A more likely scenario (at least in my mind) for the release of Statcast data is something like what the NBA did with its SportVU data. SportVU, the player tracking system developed by a subsidiary company of STATS, Inc., is akin to Statcast in that it tracks player and ball movement. The Stats section of NBA.com (linked above) shows various measures and animations gleaned from the SportVU data, but does not provide fans access to the raw data. This is the path I expect MLBAM to take. The batted ball data that has already shown up in Gameday is like this, and many of the other metrics that have been teased via broadcast, such as route efficiency and perceived velocity, could also be distributed in this manner.

Releasing the data in a summarized or snapshot form isn’t as risky to the teams, who were not all that happy when pitchF/X data made its way into the open world. Allowing public researchers to make insights based on that available to all teams took away an opportunity to gain a competitive advantage. This is why the other Sportvision products, like hitF/X that also provided batted ball information and commandF/X that tracked where the catcher’s glove was position, have been available to teams but not the public.

Regardless of what form the data takes when it is released, Statcast data should enable saberists to use more granular data to show what it takes to succeed in the game of baseball. Some of these data-driven discoveries may merely affirm what scouts and those in the game have been taught and believed for years and decades, but I’m sure some will not. Like many others, I can’t wait to get my hands on it.

Sabermetrically Gaming: Strat-O-Matic Baseball

I happen to be a man of many interests. Besides baseball, one of my other primary interests is gaming, especially tabletop gaming. My interest in games is rooted in my love of competition and my explorations of the world via mathematics and statistics. It’s no coincidence those are traits inherent to following baseball as well.

Today, I’m starting a series exploring various games that attempt to simulate baseball. My focus will be more of the math that underlies each game and how closely it helps replicate the on-field experience, though I’m sure some game play commentary will filter in. Leading off is perhaps the most well-known of the baseball table top simulations, Strat-O-Matic Baseball.

In the book Curve Ball, Jim Albert and Jay Bennett open the book with a dissection of how various baseball tabletop games model the actual action of a baseball game. Naturally, Strat-O-Matic Baseball was covered, in which they explain some of the math behind the model and how it assigned credit to the batter, pitcher, and defense. I want to focus more on the game design and the probabilities involved.

The basic mechanics of the game are relatively simple, though there are optional levels of complexity that can be added to the game now that were not a part of the original edition. There are batter cards and pitcher cards, and each card contains a table of possible results that are determined by the roll of 3 six-sided dice. One die, typically white, determines which card and which column the result comes from, with the result corresponding to the the sum of the 2 other dice, typically red, in the designated column. In many instances, a result then requires the roll of an additional 20-sided die. This provides 4,320 different possible outcomes.

SOMpitcher SOMBatter

Unfortunately, there is no master database of SOM player cards that is available to fully analyze this model. However, a massive Strat-O-Matic Baseball fan by the name of Bruce Bundy put together a bunch of formulas to forecast how a player’s card would be created. My impression is that he created these formulas by looking at a bunch of player card sheets. I’ll use it here because it’s the best publicly available information about the game model that I can find.

Looking at the formulas provides insights into a number of assumptions made about baseball by Strat-O-Matic. Player cards are customized based on their statistics, but this customization is achieved using some assumptions about the probabilities of certain events occurring that are built into the game model.

Consider the old fashioned base-on-balls, the least sexy of the Three True Outcomes. The Walk formulas for both Batters and Pitchers both are adjusted by a constant of 9. In terms of SOM, this means that the batter and pitcher cards are designed with the assumption that 9 out of the 108 results from the other card will result in a walk. Thus, the game implies an unintentional walk occurs about 8.3% of the time in baseball, with the credit being split between the pitcher and the batter. While the latter claim is not possible to investigate prior to pitch-by-pitch data being available, the former is. Here’s the overall major league non-Intentional walk rate year-by-year since 1952, using the event logs courtesy of Retrosheet

NonIBBWalkRateYou see that for most seasons here, the actual MLB non-intentional walk rate (in red) is slightly less than the the estimated rate modeled by SOM (in blue). The average across these seasons is that non-IBB walks occur in 7.81% of the plate appearances, which is about 17/216. Since 17 is an odd number, it can’t be divided equally between the batter and pitcher, a key component of the Strat-O-Matic model. Thus, it seems that the walk rate implied by Bundy’s formulas is reasonable, though a bit high.

Here are the implied rates from Bundy’s formulas and their actual instance rates from the same Retrosheet data for a few other events:

  • Doubles – SOM rate of 180/4320 = 4.2%, Actual = 4.1%
  • Triples – SOM rate of 30/4320 = 0.7%, Actual = 0.6%
  • HRs – SOM rate of 100/4320 = 2.3%, Actual = 2.3%

This replication of a generic baseball reality is why Strat-O-Matic has been so beloved for over 50 years. Hal Richman, the game’s inventor and mastermind, has created a game model that is flexible enough to work across eras. This enables SOM to sell new sets based on every season and specially designed sets, all of which can be mixed and matched as the gamer sees fit. If you’ve never played the game, find a way to do so at least once.

2015 SABR Analytics Conference Research Awards

Voting closed on President’s Day for this year’s SABR Analytics Conference Research Awards, and like last year, I have taken a great interest in seeing which articles were nominated. Although the voting is closed, I once again am sharing which articles I voted for and runners up in each category.

Contemporary Baseball Analysis: Harry Pavlidis and Dan Brooks, “Framing and Blocking Pitches: A Regressed, Probabilistic Model,” Baseball Prospectus, March 3, 2014.
This category was stacked. I could have reasonably voted for 4 of the 5 articles. But Pavlidis and Brooks managed to stand out above the rest by a hair. Like Max Marchi’s winning article from last year, this is another landmark addition to our statistical understanding of catcher framing, possibly the hottest topic in sabermetric research until the StatCast data sees the public light of day. While Jonathan Judge and this duo have already updated and improved on their work, its import to quantifying catcher framing was without equal in 2014.
Runner up: Jon Roegele, “The Effects of Pitch Sequencing,” The Hardball Times, November 24, 2014.
Pitch sequencing is my current favorite topic in sabermetric research. It’s not quite as popular as catcher framing because sequencing is largely dependent on the pitcher’s arsenal and the techniques needed to study sequencing tend go beyond basic data mining. Jon’s work is the best on the topic that doesn’t require an understanding of Markov chains and/or the mathematical mechanics of game theory.
The other 2 articles I almost voted for were:

  • Russell Carleton, “N=1,” Baseball Prospectus 2014: The Essential Guide to the 2014 Season, January 2014. Pizza asks what we really know about an individual player, and explores swing rates for individual players using regression. (Yes, I’m one of those who instantly started mouthing GLM, HLM, and MLM at the words “gory math” and “regression” in the article.)
  • Jeff Sullivan, “Alex Gordon Barely Had a Chance,” FanGraphs, October 30, 2014. The best breakdown of the most scrutinized play of this year’s World Series.

Historical Analysis/Commentary: Steve Treder, “The Strikeout Ascendant (and What Should Be Done About It),” The Hardball Times Baseball Annual 2014.
A tough category to pick, but Steve’s breakdown of strikeout eras in baseball history was an exploration reminiscent of a Bill James essay do in his 1980s Abstracts. He explores strikeout rates rates through history, citing that the increase is part of a natural rise of the power game in baseball, both at the plate and on the mound. Nothing, not even a proposal to lop off the bottom three inches of the strike zone, will change the minds of batters sacrificing discipline for power or pitchers trying to keep that power in check by throwing hard at the expense of in-game longevity.
Runner Up: Bryan Soderholm-Difatte, “The 1914 Stallings Platoon: Assessing Execution, Impact, and Strategic Philosophy,” SABR Baseball Research Journal, Fall 2014.
While platoons aren’t anything new, I always find it interesting when someone looks at a season in the distant past using modern tools. Bryan’s analysis of the 1914 Stallings was well thought out and about as comprehensive as such an analysis is capable of being.

Contemporary Baseball Commentary: Lewie Pollis, “If You Build It: Rethinking the Market for Major League Baseball Front Office Personnel,” Brown University, senior honors thesis, Spring 2014.
Most senior theses don’t make it beyond the adviser’s desk. If you happen to read one, it’s probably because you know the person who wrote it or you were in the person’s grauduating class and major while they wrote it. Lewie’s thesis is clearly more pubic than that. It’s also an extremely articulate breakdown as to why wages for lower-level front office personnel should be higher. It won my vote in a rout.
Runner Up: Eno Sarris, “Learning the Language of the Clubhouse,” The Hardball Times, March 13, 2014.
Eno’s article was full of wonderful anecdotes and personal reflections on speaking the ballplayer’s language. It’s the runner up almost by default, as the other 3 articles rehashed (or completely missed) ideas I have previously seen explored.

Organizing the World’s Sabermetric Research, Part 4 – Designing a Database

Here’s an idea of how much other stuff has gone on in my life: I talked about building a sabermetric research database 11 months ago. Version 1 has yet to be published. Much like my postings here at this blog, the time to work on this project has been sporadic. That inconsistency made designing the database challenging.

While I did pick up a minor in computer science while an undergrad, I’ve been primarily a user, rather than a designer, of databases ever since. I know the basic principles of database design, but designing one with minimal experience from years ago is not easy. So I looked for examples.

I started with the best example of a database I knew of for recording information on various printed and recorded materials: the digital library catalog. I couldn’t get access to the database schema that a real library uses, but did manage to find an example. Granted, this example of an entity-relationship diagram covers only books, but it was a start. It affirmed 3 different base tables that were pretty obvious to me based on what I wanted when I first talked about the design: author, book, and category. The intermediate link tables between book and author and book and category were something I didn’t have in mind at first, but incorporating those kind of link tables for the underlying database is actually a key element of a third normal form relational database. The link tables will help with database organization.

I also found the schema for a database that served as an inspiration for this idea. As a statistician with a slight academic bent working in industry, one of my resources is the Current Index of Statistics. While their schema wasn’t displayed in a nice entity-relationship diagram, it is available in code form. Of course, there are many things the CIS is interested in that I am not, but the schema follows the core idea of third normal form: each element of the data needs its own table.

All this matters because I want to make sure I record all the information for the DB with a few passes through the material, and knowing which pieces of information to collect is critical to that process. A few of the elements I want to collect are universal to all of the material types I discussed in Part 2, with a few notes on the columns

  • Author – first and last name, with a key built using the same logic as the Retrosheet player ID
  • Publisher – name and city. The name could be the key, but I think creating a shortened version of the name will be a better key and make queries easier.
  • Citations – The heart of this project. Just a listing of two publication IDs, one being the piece of research cited in the other. At one point early on, I considered including page numbers, but that seems to be more effort than it’s worth at this point and could be added later.
  • Subject – The subject list needs to be uniform across all media types. The subject table will be like the Citations table, with a publication ID and a column identifying the subject. I’m thinking that the subject list will be coded to help conserve disk space as this database grows. Players and teams can be included as subjects, and I’ll use the same codes as Retrosheet.

The other tables are specific to different media types:

  • Book – publication ID, title, author IDs, publisher ID, publication year, ISBN. Books only published electronically will be treated the same as printed books. Publication IDs will start with “b” to denote book. ISBN would be the key for this table if it weren’t for the need for a unified key across the other media types that can’t be identified that way.
  • Article – publication ID, title, author IDs, journal ID, publication date, start page, end page, URL. This should work for both journals and magazines. Publication IDs will start with “a” to denote article. I’m also including URL since so much of what’s in print is migrating to or simultaneously published online nowadays.
  • Journal – journal ID, journal name, publisher ID, domain URL. Magazines are included here as well; journal ID is just so that the field name is distinct from other fields in the database.
  • Presentations – publication ID, title, author IDs, speaker IDs, presentation date, conference ID. I separate out the speaker and the author IDs only because not all authors will present and, in rare instances, someone else presents who didn’t the author the presentation. Publication IDs will start with “p”
  • Conference – conference ID, conference name. I’m not going to list each annual conference separately, as that can be inferred by the presentation date when these two tables are linked. This will just be to identify different conferences and conventions (e.g. SABR, JSM, NESSIS, SaberSeminar, etc.)
  • Web articles – publication ID, title, author IDs, website ID, publication date, URL. The web is a nebulous place, and the article I read today may be different than what I read tomorrow, but reputable online sites will note original publication date and edits if they occur, so I’m not worried about that as an issue. Publicaiton IDs will start with “w”
  • Websites – website ID, website name, domain URL. Pretty straight forward. I don’t want to combine this with the Journal table so that it uses few columns

If you’ve stuck with me this far, I’m going to add one last note about how I’m building this database. I’m breaking up my exploration into 3 eras to help with identifying and finding sabermetric research. The first era ends with the publication of The Hidden Game of Baseball. That’s my starting point for this project and that book marks a pretty significant milestone in sabermetric history. I also feel that going backwards in time from 1984 will be of more value to the sabermetric community, and it will allow me to focus on printed material initially. The second era is the period between 1984 and 1996, which is mostly printed material. 1996 is the year Baseball Prospectus was founded, and it serves as a proxy for the start of the explosion of sabermetric research courtesy of the Internet. The Internet era (1996-present day) will be handled last.

Version 1 will hopefully be ready in the next few months, and if it isn’t published by the start of SABR 45 at the end of June, this project will have been abandoned.

SABR Day 2015

It might have been a week later than the official date, but avoiding fan fest date conflicts proved to be a wise decision for the Ken Keltner Badger State and Emil Rothe chapters of the Society for American Baseball Research. 48 baseball fans made their way to and from Kenosha’s “world famous” Brat Stop for what has become an annual Hot Stove tradition.

This year’s meeting opened in the sadness that could only be brought on by the death of a beloved ballplayer. Not only was Ernie Banks’ funeral playing on the TVs as baseball fans arrived, but the meeting also took place on what would have been Mr. Cub’s 84th birthday. Rich Schabowski (dressed in football attire of teams not from the state of Wisconsin or Illinois) opened the meeting and led all in attendance in a moment of silence for the Cubs legend, which ended with a shout of “Let’s Play Two!” That call ended up symbolizing the day: each chapter organized half of the meeting, with a lunch break in between. It really was like playing a doubleheader.

The portion of the meeting organized by the Chicago chapter took the top half of the program. Leading off was guest speaking Ozzie Guillen, Jr. Currently employed as a financial adviser, he worked in the clubhouses while his father coached in Atlanta and Florida and also while Ozzie, Sr. managed the Pale Hose to their first championship in 88 years. Holding high expectations for both of Chicago’s clubs in 2015, he spoke his mind on two issues in baseball today. The first is that the pendulum has swung to far in favor of analytics and sabermetrics within some organizations. The second is that today’s players make too much money, exacerbating the disconnect between the players and the fans and mirroring the current stratification of American society. He then took many questions from the audience, discussing everything from his favorite player as a clubhouse manager (“The best tipper”) and observations on the aforementioned World Series champion 2005 White Sox to pitch counts and broadcasting the team his father managed in Florida.

Batting second was a man whom his boss has called the “Ben Zobrist of Baseball Prospectus”, prospect writer Mauricio Rubio, Jr. With a deep love of baseball inherited from his family and dreams of being a pro scout, Mauricio started working for the fantasy side of BP before his constant pestering finally landed him a chance to write on prospects. With a focus on the Midwest League, he commented on how his writing tends to focus on melding stats with scouting, in line with BP’s brand as a leading sabermetric site. He also remarked about how mechanical analysis has become big with saber-scouts, but cautioned that mechanical analysis might be overemphasized, concurring with some of the commentary on the importance of a prospect’s character from Ozzie Guillen, Jr. The Q+A revealed his typical day at the park starts with a focus on 2 pitchers (typically the starters) and 2 hitters, moving around from the bullpen to behind home plate to a side view for the hitters and a rear view to better analyze arm action.

The final speaker before lunch was Merle Branner, who shared a paper from a leadership course she took as part of her studies in Library and Information Science. The paper examines the leadership dynamic between Branch Rickey and Jackie Robinson using the Servant Leadership model proposed by Robert Greenleaf. She examines all 10 aspects of the model in relation to Rickey’s signing of Robinson and integration of the major leagues. Once she was done, it was undoubtedly time for lunch.

While lunch was delicious (the cajun bratwurst is highly recommended if you’re ever able to stop at the Brat Stop), there was more baseball to be discussed, and the Badger State portion of the meeting commenced. Jim Nitz told the story of the Milwaukee Chicks, the 1944 champions of the All American Girls Professional Baseball League made famous by the film A League of Their Own. Their only year in Milwaukee was a turbulent one despite the on-field success. Media coverage for the team was poor in Milwaukee, failing to replicate the success of teams like the Rockford Peaches and leading to multiple nicknames used in the papers (primarily Schnitts and Brewerettes). The Chicks cohabited in Milwaukee’s Borchert Field with the Brewers (the minor league club), leading to a cavernous stadium that was sparsely inhabited. Nitz noted their success was largely due to some fantastic ballplayers like Connie Wisniewski and Hall of Famer Max Carey’s well-regarded management of the team, and also shared anecdotes on each of the players. His Q+A was enhanced by some women who play in an AAGPBL re-enactment league.

Afterwards, it was time to close the silent auction (a Chicago chapter fundraiser) and draw the winner of the 50/50 raffle (a Badger State chapter fundraiser). After claiming items from the silent auction, a presentation on Ginger Beaumont was up next up. Unfortunately, it was at this point when I had to leave, so I can’t comment on the rest of the meeting.

Thank goodness Spring Training was only 2 weeks away.

For those that failed to make it, Emil Rothe chapter secretary David Malamut took photos and even video of the day’s events. The photos can be seen on Twitter @sabrchicago, and links to the videos can be found here