3 members of the baseball media took a couple hours out of their Labor Day weekend to talk baseball and media with 15 members of the Emil Rothe chapter.
UPDATED: Now with videos!
Batting lead-off was Sun-Times beat writer Gordon Wittenmeyer. I can’t comment on any opening remarks he made, because I was a little late for an unusually prompt meeting start time. He covers the Cubs, so much of the Q+A I did hear centered on the struggles of the club this year in the wake of last year’s championship. A few highlights:
- Having covered 4 different clubs (Seattle, LAnaheim, and Minnesota previously) a big issue in baseball that should have been solved 20 years ago is the language barrier with Spanish-speakers from Latin America. Too many organizations, even a few years ago, had the mindset that “baseball is the only language that matters”. Yet the experiences of Dennis, Ramon, and Pedro Martinez are a clear example that it isn’t. Clubs are coming around to realize this, but still make mistakes. The Cubs exemplified this with the fiasco of a press conference that occurred after they acquired Aroldis Chapman.
- The closest the Cubs came to falling out of contention seemed to be the last week of June. In case you’re not inundated with Cubs talk as often as I am, that was the week the team took a 2nd visit to the White House and Montero was made an example of and released for speaking his mind. Good thing the NL Central never ran away from their talent level.
- David Ross is missed somewhat in the clubhouse, but his absence is minor compared to the absence of the massively unifying goal that breaking the drought was
- He lauded the organization on what it has done with the ballpark improvements, but was less keen on how the team has driven down property values to take over the rooftops and neighborhood around the ballpark.
Batting second, and for me the highlight of the afternoon, was Peabody award winner Julie DiCaro. I’m fairly certain that if you know her for one thing, it’s this video that won her said award. A former lawyer, she meandered her way into sports media through the explosion of the blogosphere. The now-radio host ignited a lively discussion on the usefulness of stats, discussing both her use as a member of the media and how the public consumes the information explosion that sabermetrics and now Statcast are producing. She also talked about the efforts she’s made to help women in or interested in sports media to network with each other, and opined on the possibility there will be a female GM in the next 15 years.
Wait, that’s it? My notes are way shorter for Julie than they were for Gordon, yet I said she was the highlight of my afternoon. Why? She talked less. Why did she talk less? With Julie at the podium, that lively discussion was very much an open discussion, with multiple people (myself included at a couple points) chiming in on particular stats and their usefulness. With Gordon, most everyone held to the Q + A protocol: someone asks, the speaker answers. Thinking back, I ask myself the question of what she thought of that difference in the dynamic and whether it was as she intended. Julie left before the meeting ended and I could ask her. You can watch the videos
once their posted and opine on your own.
Anyway, back to the meeting, where one more speaker took the audience on a trip through the minor leagues. Emily Waldon writes about Tigers prospects for 20/80 Baseball and The Athletic Detroit. Her interest in baseball started with having 4 brothers and took off when the West Michigan Whitecaps moved to town in 1994. She started covering the Tigers minor league affiliates as part of the Bless You Boys blog, moving on to her current posts subsequently. Her visit became much more timely with the waiver deadline deal of Justin Verlander to Houston, allowing her to talk about the new acquisitions, the other prospects in the system, and Avila’s philosophy for the rebuild and player development. She also noted that the parents of minor leaguers greatly appreciate her coverage, whether they’re local, across the country, or from one of the Latin American hotbeds of baseball.
Ed. note – I’m back, and hopefully for a good long while.
This year I found myself with 5 major conventions I was interested in going on within a 2 week span, covering my varied interests in baseball, stats, gaming, and religion. Alas, I could only afford to go to one, so I took advantage of the fact that the Joint Statistical Meetings were being held in Chicago, thus eliminating my need to pay for airfare.
JSM is the largest single gathering of statisticians in the world. The American Statistical Association, the prime organizer among the half dozen statistical societies that co-sponsor the event, always books the host city’s signature convention center in order to hold 6,000 attendees over the course of 6 days. This was only my second time attend, having previously attended the 2008 conference in Denver.
The conference has many different subjects being analyzed, from finance and risk to data modeling and visualization. But, with this being a blog that focuses on baseball, I’ll share 4 things I learned (and 1 thing I already knew) from the sports sessions.
1) The gap between academics and practitioners in statistics is narrowing
The ASA has a well-deserved reputation as an organization that favors academic pursuits in statistics. Its membership is predominantly employed in academia, and membership growth has not kept pace with the growth of the profession. Yet, it seems the explosion of data has led to more cross-over. One paper that was presented listed a FiveThirtyEight writer among its authors.
2) There are more ways to get into baseball than by studying it directly.
One of the presenters I was able to interact with has spent a lot of time looking at SportVu, the NBA’s player tracking data. This person is now slated to start working for a baseball team in the fall because of that work. You can probably guess that the position will focus on understanding Statcast data.
3) Sports teams aren’t looking for subject matter experts with stats knowledge; they want stats experts with subject matter knowledge.
At a panel discussion about stats in sports, both panelists that are currently employed by major sports teams noted that front offices are loaded with SMEs. They want people with stats backgrounds who can analyze the data well. The odds of anyone getting into a front office today in the same manner as Bill James is highly unlikely.
4) It pays to think about analogs from other fields.
Dan Cervone presented a model trying to value court space in the NBA. He built his model analogous to how real estate valuations are made. It’s the type of thinking that leads me to pay attention to JSM, and it also is, in my opinion, a requirement to getting the most out of the conference.
5) Ideas at JSM are starting points, not end points.
The fact that JSM does tend to emphasis statistical methods over results means many papers aren’t necessarily providing new insights into well studied issues, but new ways to analyzing the questions at the heart of the issue. Take this presentation on predicting outcomes of plate appearances. It uses a type of regression modeling designed to handle structured outcomes like in baseball. It may not provide any new insights into how baseball is played, but an idea like this could end up in the next great forecasting system.
One of the things I missed when I had to skip out of SABR 45 Saturday was the committee meeting for SABR’s largest research committee, Statistical Analysis. Unlike many of the other committees, the Stat Analysis committee didn’t have a group project to work on, in part due to the individual nature of most members’ research. A couple ideas were bandied about the meeting during SABR 44, but it took until SABR 45 to get one of those ideas off the ground.
A few weeks back, Phil Birnbaum, the chair of the committee and editor of the By the Numbers newsletter, announced that group project. The idea is to create a crowd-sourced list of key resources for helping newcomers to sabermetrics learn what has been done and provide to him or her the foundation for additional contributions.
There are plenty of books and articles which I could cite, so I’m going to start with the broad resources that cover multiple topics. That means it does skew towards books. They are listed in the order they came out of my head.
Before I get into my long list, I want to invite you, dear reader, to contribute your recommendations to this project. If you do so in the comments, I’ll be sure to pass them on.
- The Numbers Game, by Alan Schwarz. This book came up recently when Graham Womack of Baseball Past & Present and I discussed the importance of this book and a few other titles that will make there way onto this list as for which one we’d recommend first. We both agreed that this title is where we’d tell others to start. A fantastic history of baseball’s numbers, and the understanding of how a particular stat like batting average or OBP came to be is key to understanding any analysis with those measures.
- The Hidden Game of Baseball, by John Thorn and Pete Palmer. It’s over 30 years old, and it might be the most important book in sabermetric history. There’s a reason I started my sabermetric research database project with this book: it was The Numbers Game before Schwarz wrote his book with its concise history of baseball statistics AND it introduced the linear weights model to the world, which is much more of the mathematical foundation of modern sabermetrics than anything put out by the most famous name in the field.
- The Bill James Abstracts, both the annuals printed from 1977-1988 and the Historical Abstract (first published in 1986, revised and updated in 2001). For the many who grew up before Al Gore’s invention came to the masses, these books were how they were introduced to sabermetrics. Bill isn’t a statistician in the academic sense, but his understanding of baseball endows his analyses with tremendous insight.
- Curve Ball, by Jim Albert and Jay Bennett. I have a rare relationship with this book. I read it before I ever read anything by Bill James. It steered me from being a pure mathematics major in college to a statistics major, which is one of the 5 best decisions I have made in my life. So yeah, I hold this title in high esteem for many personal reasons. That being said, it might be the best book for helping aspiring saberists to start understanding mathematical statistics, which is essential to advancing the field.
- The Book, by Tango, Lichtman, and Dolphin. For many saberists, this is the modern treatise on the subject. Grounded in an understanding of Palmer’s Linear Weights system, they introduce wOBA and use it to explore every facet of the game.
- For online reference guides, the FanGraphs Sabermetric Library is my preferred site, as I consider to be the most complete. Neil Weinberg is also authoring weekly posts to explain the ins and outs of various metrics, helping keep the reference guide current with new research.
- The Best of Baseball Prospectus: 1996-2011 is a 2 volume set that is a compilation of the most important articles from the first 15 years of that sites’ history. This is essentially my proxy for the excellent writing on that website, including Voros McCracken’s article on DIPS Theory and Keith Woolner’s “Baseball’s Hilbert Problems“.
- Baseball Hacks, by Joseph Adler. The ability to analyze data is great, but it is useless if you can’t get data to analyze. While the book is somewhat dated, it’s a great introduction to many of the coding skills required to do sabermetrics efficiently in the computing era, and one I still find worthwhile to have on my shelf.
- SABR101x, the massively open online course at edX administered by Boston University and designed by Andy Andres et al. If you prefer a class-based method for learning sabermetrics, this is as good as you’ll find. There are tracks on the history of sabermetrics, statistics, SQL/R skills needed, and a build up to understanding some key metrics used by saberists.
One thing I want to keep separate from this list is SABR’s own Guide to Sabermetric Research, which was put together by the aforementioned Phil Birnbaum. His involvement spearheading this SABR 101 project is why I leave it out for now. I have a sense that it will be that guide that is updated as a result of this group work.
Almost 2 years ago, I was sitting on my computer scrolling through Twitter when this appeared:
— sabr (@sabr) August 1, 2013
Yeah, I got a little bit excited when I saw that.
It was never a question of whether or not I would be attending the SABR Convention this year. Having a convention in your backyard has some benefits, the biggest of which is cost. Aside from the convention registration, I had my choice as far as how to get to and from the Palmer House and whether I wanted to sleep in a hotel bed or my own. With an infant crawling around my house, I chose to not book a hotel room (at approximately $200/night) and took commuter rail in and out of Chicago each day.
The downsides of my lodging and travel decision were twofold: 1) I didn’t partake in nearly as many hallway and bar conversations as I did last year, depriving me of what many consider the most fun part of the convention experience; and 2) it made it easier for other things to pull me away from the convention activities. Having to catch a train at 7 am to just make it to a day of events running from 8 am to 10 pm meant having to reconcile sleep with the train schedule. Then, family events cropped up on the weekend, making it unfeasible for me to go downtown Saturday or Sunday. While missing Sunday only cost me the Historic Ballpark Site tour, not being able to attend Saturday cost me half of the presentations and panels and most of the committee meetings I was interested in.
However, what I did attend and help with as a volunteer and member of the host chapter was quite fantastic. Wednesday is typically a travel and get acquainted with the city day. With minimal travel, I helped as a volunteer with registration and Cubs ticket distribution. As with past conventions, there as a tour of the host city. I skipped this year’s walking tour due to the aforementioned volunteer work, but Jacob Pomrenke put together a fantastic document highlighting the sites with baseball history attached to them as the tour traversed downtown Chicago. (If a KML file gets created for it, I’ll link to it here). After registration closed down for the night, I sacrificed the welcome reception in order to catch the train and be home.
Thursday was what I presume is a rare day in recent SABR Convention history. At no time did any attendees have to pick between different meetings or presentations, as it was a single program of events for the day. Cubs broadcasters Len Kasper, Jim Deshaies, and Ron Coomer graced the broadcasters panel in the morning, chiming in when moderator Curt Smith would let them do so. Many of Smith’s questions centered on the Cubs, and all three provided the level of insight that I’ve become accustomed to when I do tune in for Cubs broadcasts. This was followed by the annual business meeting, which showed the continued positive growth of the society but, unlike last year, revealed no final verdict on next year’s convention. It seems the society learned its lessons the hard way: Houston had a hotel location near a mall instead of the ballpark due to the latter option’s lack of availability after the 2014 MLB schedule was released; Chicago corrected for that by getting the ideal hotel location early, but ending up victimized by selecting the one weekend BOTH Chicago clubs were on the road. There is a tentative plan for SABR 46’s host next year, but it would be unwise to get excited for seeing a bobble head museum and the most colorful home run sculpture in MLB quite yet (never mind my own personal ability to attend next year). Thankfully, despite the lack of weekend games, the Cubs were finishing up a series with the Dodgers, so Thursday afternoon’s getaway day contest ended up being the convention game. It was entertaining simply because of Joe Maddon’s tinkering with the line-up every 2 innings or there about. Thursday night ended up being what I think was the biggest highlight of the convention (and perhaps a way of the national office apologizing for the schedule debacle): a concert in the Palmer House’s Grand Ballroom with the Baseball Project.From left to right, Scott McCaughey, Linda Pitmon, Mike Mills, and Steve Wynn rocked the house with their songs about Harvey Haddix, Ted Williams, Larry Yount, Big Ed Delahanty, and many others. Wisely, they opened with “Box Scores” of their album 3rd, which to me is the quintessential SABR song. It was pretty awesome. If you like baseball and rock music (especially R.E.M.), you’ll love this band.
After forgetting my phone at home Friday morning, I made it in time for the second group of presentations Friday morning, dropping in on Tara Kreiger’s presentation about Andy Coakley’s labor struggles with organized baseball. It was a fascinating story that I was unfamiliar with, but it exemplified the blackballing many early players went through when they complained about their contract. This was followed by 2 panels: one title Pitching Prodigies that featured Steve Trout and Joe Berton a.k.a. “Sidd Finch”, and an presentation by the 4 Letters on an upcoming project. The former was my favorite panel I attended, as Berton told the story of how he got involved in the Sidd Finch hoax perpetrated by George Plimpton and Sports Illustrated. Trout seem more subdued about his experiences, which I guess is to be expected from an 8th overall pick who did not have the career he expected to have. The latter was a “stealth announcement” about a project entitled “1927: The Diary of Myles Thomas”, which looks to chronicle the 1927 Yankees via “real-time historical fiction” storytelling. I kind of like the concept, but will probably wait and see what ends up being produced by Steve Wulf and Douglas Alden Warshaw. The presentation I saw after the panels was entitled “Aging Fan Base: Using Twitter to Develop a New Geneartion of Baseball Fans” and given by Allison Levin. Unfortunately, she didn’t get to many suggestions in her slides, as most of the time was spent looking at Twitter usage during the 2014 World Series. But she has a few avenues for further exploration that will hopefully yield some results, thought I have a sense that MLB might be ahead of her on doing this.
The morning block was followed by a tribute-filled awards luncheon. I skipped this last year, since my meal times were spent with my wife who graciously traveled to Houston with me. I’m glad I went this year, because I got a better sense of what this organization means to so many people. Tom Hufford couldn’t avoid breaking down as he eulogized two of his fellow Cooperstown 16 that founded SABR, Ray Nemec and Joe Semenick. Phil Rogers had it a bit easier in terms of emotions, but still had to encapsulate what Ernie Banks and Minnie Minoso meant to their adopted hometown. He did so, and did it well. After the banquet I took time to peruse the vendor room, which is a dangerous endeavor given the number of baseball books that are available for sale. My wallet came away only somewhat dented. The only committee meeting I attended was for the Business of Baseball, which gave an update on the Winter Meetings project (all years are being researched by someone!), the Team Ownership bios (4 of 30 done or in progress), and a reminder from chair Michael Haupert about the importance of examining the source of data in research, using examples from the pre-1983 salary database to show how what’s printed isn’t always accurate.
I then attended 5 more presentations between the committee meeting and heading home. In order:
- David Kaiser questioned “What Makes a Dynasty?” He counted at teams who played postseason baseball in 3 of 6 seasons as a dynasty, splitting the analysis into 3 eras based on the postseason structure in place. He noted which ones were dominated by pitching and which ones weren’t. Most of the expected teams showed up where you would expect. The only bone I pick is that, based on the average winning percentage by era for the dynastic teams in the study, he said mediocrity was more prevalent today then it used to be. I think that’s just a function of his definition of dynasty.
- David W. Smith, the Retrosheet president, updated his look at run scoring in the 1st inning, asserting that travel doesn’t seem to have an effect but that the number of runs the visiting team scores in the top of the 1st is highly correlated with the number of runs they allow in the bottom of the 1st. You can find his paper on Retrosheet’s site.
- Zach Moser gave an oral presentation on how Cap Anson’s views on colored players in professional baseball were portrayed over time. While revered in his time, Anson’s racism became a hot topic while he was among the early players considered for induction into Cooperstown’s most noted museum. Anson’s racism was revisited as many of his team records for the Cubs were eclipsed by the aforementioned Ernie Banks, and Moser suggests that most modern apologists for Anson are deficient in their criticism.
- John Burbridge examined “The Increasing Importance of Quality Starts” by mostly just doing an x-ray on the definition of a quality start. He ultimately came to the conclusion that 6 IP with 3 or fewer runs allowed is reasonable, and claims that is it increasingly relevant as bullpens are utilized more and more.
- Finally, Bruce Allardice talked about how pro baseball became a big part of Chicago in the mid 1800s. Baseball grew in popularity in Chicago, paralleling the game’s growth in popularity nationwide. By 1870, the city’s elite coveted the status of being the nation’s pork capital, vying against a river town called Cincinnati. Because of this rivalry with the 2015 All Star Game host city, Chicago’s wealthy pooled funds to found the first professional club in the City. The White Stockings did manage to beat Cincinnati twice late in that season, and would go to claim the championship based on a disputed victory over the New York Mutuals, who also claimed the title. Unfortunately, baseball took a 2 year hiatus after a cow tipped a lantern and ignited a magnificent blaze that required years of rebuilding.
I’d love to say more about SABR 45, but (1) I’m already at 1,750 words if you’ve read to this point and (2) the downside of a local convention is that you can be pulled to do other things since you aren’t travelling. That’s what happened to me on the weekend, as family event popped up and hindered by ability to get in and out of the city. I don’t know if I’ll get to go to another convention for a while at this point, and next year looks doubtful regardless of location. When I do go again, I’m going to make sure of 2 things: I’m staying at the hotel so I can go hang at the bars and talk baseball over beers. That’s the convention experience that I missed, and why those who go to one convention try to make it an annual trip.
The next level of public baseball data has arrived. MLB Advanced Media’s Statcast made a hyped television debut, although it had made cameos in online replay videos last year. With the system installed in all 30 ballparks to track all movement on the field, hopes are high for discovering many things about the game via data that previously could only be imprecisely discerned by watching a lot of baseball.
However, while MLBAM have stated that Statcast data will be made public, it is still unclear what types of data and how much of it will be available for public use. Bits and pieces of the data have slowly appeared as the 2015 season started. Among the first pieces have been the velocity and angle of the ball off the bat, which the savvy scrapers, such as Daren Wilman of Baseball Savant fame, of the Gameday files have captured and published. But whether the public will have access to the raw data remains to be seen.
It seems unlikely to me that there will be public access to the raw Statcast data anytime soon. The first challenge is the sheer size of the data set, which is already measured in petabytes. This is unlike the pitchF/X data, which can be scraped and saved on a home PC. Raw Statcast data is best stored on a cloud server. While MLBAM is certainly using “the cloud” as the method for allowing the 30 teams to access the data, it would be a massive security risk to open that server up to the public domain. Setting up a public server would be an additional cost, and it’s hard to argue that there would be any significant return on that investment for MLBAM. However, Statcast is already sponsored by Amazon Web Services, so the possibility is there for the raw data to be made public via the AWS platform. That possibility seems very remote at this time.
A more likely scenario (at least in my mind) for the release of Statcast data is something like what the NBA did with its SportVU data. SportVU, the player tracking system developed by a subsidiary company of STATS, Inc., is akin to Statcast in that it tracks player and ball movement. The Stats section of NBA.com (linked above) shows various measures and animations gleaned from the SportVU data, but does not provide fans access to the raw data. This is the path I expect MLBAM to take. The batted ball data that has already shown up in Gameday is like this, and many of the other metrics that have been teased via broadcast, such as route efficiency and perceived velocity, could also be distributed in this manner.
Releasing the data in a summarized or snapshot form isn’t as risky to the teams, who were not all that happy when pitchF/X data made its way into the open world. Allowing public researchers to make insights based on that available to all teams took away an opportunity to gain a competitive advantage. This is why the other Sportvision products, like hitF/X that also provided batted ball information and commandF/X that tracked where the catcher’s glove was position, have been available to teams but not the public.
Regardless of what form the data takes when it is released, Statcast data should enable saberists to use more granular data to show what it takes to succeed in the game of baseball. Some of these data-driven discoveries may merely affirm what scouts and those in the game have been taught and believed for years and decades, but I’m sure some will not. Like many others, I can’t wait to get my hands on it.
Voting closed on President’s Day for this year’s SABR Analytics Conference Research Awards, and like last year, I have taken a great interest in seeing which articles were nominated. Although the voting is closed, I once again am sharing which articles I voted for and runners up in each category.
Contemporary Baseball Analysis: Harry Pavlidis and Dan Brooks, “Framing and Blocking Pitches: A Regressed, Probabilistic Model,” Baseball Prospectus, March 3, 2014.
This category was stacked. I could have reasonably voted for 4 of the 5 articles. But Pavlidis and Brooks managed to stand out above the rest by a hair. Like Max Marchi’s winning article from last year, this is another landmark addition to our statistical understanding of catcher framing, possibly the hottest topic in sabermetric research until the StatCast data sees the public light of day. While Jonathan Judge and this duo have already updated and improved on their work, its import to quantifying catcher framing was without equal in 2014.
Runner up: Jon Roegele, “The Effects of Pitch Sequencing,” The Hardball Times, November 24, 2014.
Pitch sequencing is my current favorite topic in sabermetric research. It’s not quite as popular as catcher framing because sequencing is largely dependent on the pitcher’s arsenal and the techniques needed to study sequencing tend go beyond basic data mining. Jon’s work is the best on the topic that doesn’t require an understanding of Markov chains and/or the mathematical mechanics of game theory.
The other 2 articles I almost voted for were:
- Russell Carleton, “N=1,” Baseball Prospectus 2014: The Essential Guide to the 2014 Season, January 2014. Pizza asks what we really know about an individual player, and explores swing rates for individual players using regression. (Yes, I’m one of those who instantly started mouthing GLM, HLM, and MLM at the words “gory math” and “regression” in the article.)
- Jeff Sullivan, “Alex Gordon Barely Had a Chance,” FanGraphs, October 30, 2014. The best breakdown of the most scrutinized play of this year’s World Series.
Historical Analysis/Commentary: Steve Treder, “The Strikeout Ascendant (and What Should Be Done About It),” The Hardball Times Baseball Annual 2014.
A tough category to pick, but Steve’s breakdown of strikeout eras in baseball history was an exploration reminiscent of a Bill James essay do in his 1980s Abstracts. He explores strikeout rates rates through history, citing that the increase is part of a natural rise of the power game in baseball, both at the plate and on the mound. Nothing, not even a proposal to lop off the bottom three inches of the strike zone, will change the minds of batters sacrificing discipline for power or pitchers trying to keep that power in check by throwing hard at the expense of in-game longevity.
Runner Up: Bryan Soderholm-Difatte, “The 1914 Stallings Platoon: Assessing Execution, Impact, and Strategic Philosophy,” SABR Baseball Research Journal, Fall 2014.
While platoons aren’t anything new, I always find it interesting when someone looks at a season in the distant past using modern tools. Bryan’s analysis of the 1914 Stallings was well thought out and about as comprehensive as such an analysis is capable of being.
Contemporary Baseball Commentary: Lewie Pollis, “If You Build It: Rethinking the Market for Major League Baseball Front Office Personnel,” Brown University, senior honors thesis, Spring 2014.
Most senior theses don’t make it beyond the adviser’s desk. If you happen to read one, it’s probably because you know the person who wrote it or you were in the person’s grauduating class and major while they wrote it. Lewie’s thesis is clearly more pubic than that. It’s also an extremely articulate breakdown as to why wages for lower-level front office personnel should be higher. It won my vote in a rout.
Runner Up: Eno Sarris, “Learning the Language of the Clubhouse,” The Hardball Times, March 13, 2014.
Eno’s article was full of wonderful anecdotes and personal reflections on speaking the ballplayer’s language. It’s the runner up almost by default, as the other 3 articles rehashed (or completely missed) ideas I have previously seen explored.
Here’s an idea of how much other stuff has gone on in my life: I talked about building a sabermetric research database 11 months ago. Version 1 has yet to be published. Much like my postings here at this blog, the time to work on this project has been sporadic. That inconsistency made designing the database challenging.
While I did pick up a minor in computer science while an undergrad, I’ve been primarily a user, rather than a designer, of databases ever since. I know the basic principles of database design, but designing one with minimal experience from years ago is not easy. So I looked for examples.
I started with the best example of a database I knew of for recording information on various printed and recorded materials: the digital library catalog. I couldn’t get access to the database schema that a real library uses, but did manage to find an example. Granted, this example of an entity-relationship diagram covers only books, but it was a start. It affirmed 3 different base tables that were pretty obvious to me based on what I wanted when I first talked about the design: author, book, and category. The intermediate link tables between book and author and book and category were something I didn’t have in mind at first, but incorporating those kind of link tables for the underlying database is actually a key element of a third normal form relational database. The link tables will help with database organization.
I also found the schema for a database that served as an inspiration for this idea. As a statistician with a slight academic bent working in industry, one of my resources is the Current Index of Statistics. While their schema wasn’t displayed in a nice entity-relationship diagram, it is available in code form. Of course, there are many things the CIS is interested in that I am not, but the schema follows the core idea of third normal form: each element of the data needs its own table.
All this matters because I want to make sure I record all the information for the DB with a few passes through the material, and knowing which pieces of information to collect is critical to that process. A few of the elements I want to collect are universal to all of the material types I discussed in Part 2, with a few notes on the columns
- Author – first and last name, with a key built using the same logic as the Retrosheet player ID
- Publisher – name and city. The name could be the key, but I think creating a shortened version of the name will be a better key and make queries easier.
- Citations – The heart of this project. Just a listing of two publication IDs, one being the piece of research cited in the other. At one point early on, I considered including page numbers, but that seems to be more effort than it’s worth at this point and could be added later.
- Subject – The subject list needs to be uniform across all media types. The subject table will be like the Citations table, with a publication ID and a column identifying the subject. I’m thinking that the subject list will be coded to help conserve disk space as this database grows. Players and teams can be included as subjects, and I’ll use the same codes as Retrosheet.
The other tables are specific to different media types:
- Book – publication ID, title, author IDs, publisher ID, publication year, ISBN. Books only published electronically will be treated the same as printed books. Publication IDs will start with “b” to denote book. ISBN would be the key for this table if it weren’t for the need for a unified key across the other media types that can’t be identified that way.
- Article – publication ID, title, author IDs, journal ID, publication date, start page, end page, URL. This should work for both journals and magazines. Publication IDs will start with “a” to denote article. I’m also including URL since so much of what’s in print is migrating to or simultaneously published online nowadays.
- Journal – journal ID, journal name, publisher ID, domain URL. Magazines are included here as well; journal ID is just so that the field name is distinct from other fields in the database.
- Presentations – publication ID, title, author IDs, speaker IDs, presentation date, conference ID. I separate out the speaker and the author IDs only because not all authors will present and, in rare instances, someone else presents who didn’t the author the presentation. Publication IDs will start with “p”
- Conference – conference ID, conference name. I’m not going to list each annual conference separately, as that can be inferred by the presentation date when these two tables are linked. This will just be to identify different conferences and conventions (e.g. SABR, JSM, NESSIS, SaberSeminar, etc.)
- Web articles – publication ID, title, author IDs, website ID, publication date, URL. The web is a nebulous place, and the article I read today may be different than what I read tomorrow, but reputable online sites will note original publication date and edits if they occur, so I’m not worried about that as an issue. Publicaiton IDs will start with “w”
- Websites – website ID, website name, domain URL. Pretty straight forward. I don’t want to combine this with the Journal table so that it uses few columns
If you’ve stuck with me this far, I’m going to add one last note about how I’m building this database. I’m breaking up my exploration into 3 eras to help with identifying and finding sabermetric research. The first era ends with the publication of The Hidden Game of Baseball. That’s my starting point for this project and that book marks a pretty significant milestone in sabermetric history. I also feel that going backwards in time from 1984 will be of more value to the sabermetric community, and it will allow me to focus on printed material initially. The second era is the period between 1984 and 1996, which is mostly printed material. 1996 is the year Baseball Prospectus was founded, and it serves as a proxy for the start of the explosion of sabermetric research courtesy of the Internet. The Internet era (1996-present day) will be handled last.
Version 1 will hopefully be ready in the next few months, and if it isn’t published by the start of SABR 45 at the end of June, this project will have been abandoned.