Organizing the World’s Sabermetric Research, Part 3 – Plugging into SABR

It’s been just about a month since SABR 44. As I teased in my very verbose recap, a few things came out of the committee meetings that relate to this possibly quixotic quest I have to catalog all the world’s sabermetric reasearch.

The first thing to note is that SABR already has an effort to catalog all baseball writing: The Baseball Index. Clicking through that link will show a functional, but dated and incomplete, reference of baseball documents, recordings, and other materials that any baseball researcher could want to know of. It does include many sabermetric works already, but finding these works isn’t all that easy. This is mostly because of an fickle search function that doesn’t work quite as well as a modern internet user would like, but also because the tags on the articles are designed around indexing all baseball research, not just sabermetrics. Most sabermetric entries are listed with the tag of “statistical analysis” and nothing deeper, unlike the articles on Saber Archive. Another thing of note is its current state of incompleteness, which is due to a broken data entry system. I will note here that the committee did mention someone is working on an upgrade to this system at SABR 44 (hint: he runs a very popular website).

Secondly, the Statistical Analysis committee, in its 7 AM Friday morning meeting at SABR 44, brought up the idea of a group project to create a centralized reference list for sabermetric research. Many members of the committee had various ideas about what such a resource should look like: a list of the most recommended articles, a full literature review of one area of sabermetrics (e.g. defensive metrics), working with the Baseball Index Project committee, and a wiki were all suggested. Phil Birnbaum, who chairs the Stat Analysis committee, is currently collecting names of those interesting in helping with this committee project, even if you’re not a member of SABR.

How do these two things affect what I had in mind? The Baseball Index, when upgraded and if designed better than its current state, would contain a lot of the features I am working to include in my database. There are a few features I plan to have that are not in TBI, most notably a citation link between works and topical tags that are more like Saber Archive’s. Thus, my database and the Baseball Index should be able to co-exist. I’ll be using TBI as an additional source for locating sabermetric research works to be included. I’ll also be contributing to the Baseball Index, focusing on articles in academic journals, which appears to be a major gap in their listing at the present time.

The Statistical Analysis committee project is too new to really know how things will shake out. I’m already on board to help with this committee project, and there’s a non-zero chance I take a leadership role with it. However, there are just too many unknowns to really know how my work will fit in with what this project turns into. All I can do is keep on keeping on.

SABR 44: 3,000+ words on my first convention

In the 8 years since I joined SABR, I’ve been to a few dozen chapter meetings, at times driving more than 2.5 hours just to get to the meeting locale. I served on a local chapter board, partially redrawing the chapter map for SABR. I’ve consumed baseball material since I was 6. Many times in the past, I had been asked if I was going to the upcoming SABR convention, and for many years, my answer would always be a disappointed “no”.

That answer changed about 8 months ago. With ample vacation time from my day job and not needing that time for other purposes, I was finally able to go to a SABR convention. I’m pretty sure I wasn’t the first person to register, but I definitely made sure to register as soon as I saw the notice that I could do so in the weekly SABR notes e-mail. And so it was that last week, I made my way to Houston for SABR 44.

A typical SABR convention is a mix of research presentations, player panels, research committee meetings, a ballgame or two, and seeing how many attendees can shut down the hotel bar. With a jam packed schedule, the baseball chatter starts early (around 7 AM on the earliest days) and goes well past midnight. Since I haven’t been blessed with the ability to be in two places at once, I’ll go over what I was able to attend, ranking things in order of my favorites as I go, and close with some general thoughts on the experience.

Panels/Keynote

The biggest draw of any convention are the player and media panels. Thanks to SABR’s improving relationship with MLB and the Larry Dierker Chapter’s  close relationship with the Houston Astros and its chapter namesake, the panels here had a special twist: the last 2 were held Saturday afternoon at Minute Maid Park before that night’s game between Toronto and Houston. I’ll note those two in the ranks and comments below. I’m also including Reid Ryan’s opening keynote here in this section, since it is set aside like a panel in the schedule. For reference, I’m using the names of the panels as listed in the convention schedule, and have included links to audio/video where available:

  1. College Baseball Panel (audio/video) – The recordings don’t do justice to the hush that came over the room as soon as Roger Clemens entered and everyone noticed. I enjoyed the wide range of topics covered here (recruiting, bats, experiences) and the perspectives that Clemens, Mike Gustafson, and Lamar head coach Jim Gilligan provided.
  2. From Playing Field to Front Office – This ended up being more entertaining than informative, because at least 5 different Yogi Berra stories were shared. Dr. Bobby Brown is still sharp as a tack, Bob Watson was defiant when asked about his “struggles” against Don Sutton, and Eddie Robinson talked about his friendship with Joe DiMaggio and Marilyn Monroe.
  3. Reid Ryan keynote (audio/video) – Like most people who get a podium to themselves at a SABR meeting, Ryan told his story of involvement in the game and how he got to where he is today. What made him extra fascinating was hearing the perspective of a player’s son who has been involved at all levels of baseball.
  4. Decision Sciences Panel (at MMP) – For me, the most anticipated session. Moderated by and featuring Astros GM Jeff Luhnow, along with AGM David Stearns and Sig Mejdal (official title: Director of Decision Sciences. Real title: guy who has one of my 30 dream jobs). Lots of discussion about how the Astros front office works, and a little bit of insight into how they do things. Baseball Prospectus referred to as the “minor leagues” by Mejdal.
  5. Astros Player Panel (at MMP) – Larry Dierker, Alan Ashby, and Art Howe featured on this panel, and started swapping tales and interjecting into each other’s stories. A joy to hear multiple perspectives on the same story, especially the 1980 NLCS. Howe also commented on how he would have been involved in Steven Soderbergh’s version of Moneyball, and Howe talked to Philip Seymour Hoffman about his portrayal only after the movie premiered because of how Bennett Miller directed Hoffman to play the role.
  6. Colt .45s Panel (audio/video) – 4 players and a beat writers discussing the early seasons of Houston’s MLB franchise, with a lot of references to the heat, humidity, and mosquitos that made Houston the most interesting addition with the 1961 expansion. Featured Bob Aspromonte, Hal Smith, Carl Warwick, Jimmy Wynn, and Mickey Herskowitz. You can guess which one was the writer.
  7. Media Panel – This was the one panel that included someone without ties to Houston, and he is probably the most well known of the four panelists: Buck Martinez, currently working for the Toronto Blue Jays broadcasts. Writers Evan Drellich (Houston Chronicle) and Alyson Footer (MLB.com) discussed the print side, while Martinez and Bill Brown (Astros TV play-by-play) discussed the TV side. It’s tough to rate this one seventh, which goes to show just how good the majority of these panels were.
  8. Women in Baseball panel (audio) – In a week where the role of women in sports was getting plenty of play in national media, this panel ended up with more of a media slant than what seemed to be intended. Marie “Red” Mahoney was the headliner, as the only women from Houston to play in the AAGPBL. Jana Howser talked about things from her perspective as the head of development for the College Baseball Hall of Fame. Alyson Footer and Laila Rahimi addressed the media issues.
  9. 1980 Houston Astros – What should have been a more interesting panel ended up as 10 minutes of panelist introductions and Tal Smith talking for almost 15 minutes with all the details of how that 1980 Astros team came together. The players and coach from that 1980 team, Enos Cabell, Deacon Jones, and Jose Cruz (Sr.), weren’t left a lot of time to tell their sides of the story. Jose Cruz looks like he could still swing a bat today.

One additonal note: many veteran convention attendees remarked that this was the first convention where question cards were used as opposed to open mics. This did expedite the asking of questions, as some SABR members tend to take a long time to ask a questions at a mic. However, it did seem to allow for the questions to be screened, which kept the questions on the panel topic but also likely enabled some of the potentially thorny questions for the panelists to be avoided. This process will likely be adopted for SABR 45, especially if it turns out that using the cards was a condition for some of the panelists to appear. (Here’s looking at you, Roger.)

Presentations

The heart of the convention are the research presentations. Only 32 are given, each only supposed to be 15-20 minutes in length with time for Q+A towards the end of each 25-minute session. 2 presentations are given during each time slot on the convention schedule, meaning there’s no way to attend all of them. Below are the ones I attended, again ranked from my favorite to my least liked, with a brief recap of their findings as best as I could take notes on them:

  1. RP21: An Expanded Game-Theoretic Model of a Batter-Pitcher Confrontation in Baseball – Yes, it was as academic as it sounds. But game theory is something that fascinates me, and seeing Anton Dahbura’s model was interesting to me, as it was the first time I had seen a game-theory model account for the number of balls and strikes and the rate at which the umpire misses the call. A few assumptions made here to simplify things, such as pitchers being able to throw strikes on command, that aren’t practical in real life. In the 3-2 count example he went over, Dahbura advocated the pitcher throw a strike around 91% of the time, while the batter swing only 75% of the time.
  2. RP16: The Ballpark Sportscape: Outfield Advertising and the Branding Issue – Ed Mayo presented on behalf of co-authors Dobb Mayo and John Weitzel, looking at outfield wall ads in 8 major league parks with a panel of 6 interior designers, 2 ad industry pros, and 2 marketing consultants. Their most important take-away: the fan experience, not the baseball game, is the core product.
  3. RP25: Why Does the Home Team Score So Much in the First Inning? – Retrosheet founder David W. Smith noticed an uptick in 1st inning runs compared to all other innings, and just kept asking questions of the data. Based on what he showed, it appears to be some combination of the best hitters tending to be at the top of lineups, how long the away starting pitcher has to wait to throw in the bottom of the 1st,  and travel impacts, though no single metric that was used to investigate these impacts was found as a strong correlating factor. Interaction effects were not investigated.
  4. RP20: William Hulbert and the Birth of the Business of Professional Baseball – Business of Baseball committee chair, and UW-LaCrosse economist, Mike Haupert discussed William Haupert’s influence on the formation of the National League in the late 1800s. Many of the ideas he used to sell the owners on are now hallmarks of the modern American professional sports landscape: territorial exclusivity, fixed schedules (a problem at the time), and no admittance of  teams from “small towns”. This was also the winner of the award for best presentation at the convention
  5. RP27: The Strike Zone Squeeze – Richard Thurston explored the jump in BB in the AL between 1948 and 1950 (inclusive). He created a metric called WALA, Walks Above League Average, which uses a bit of a WOWY methodology calculate how much better or worse a player is at drawing walks. (The formula was too complicated to copy in notes in the time it was displayed.) Thurston theorizes it was an attempt by the AL owners to pressure umpires into calling a strike zone that would cause run scoring to go up and lead to better attendance.
  6. RP32: Was Mantle’s Peak Value Really Greater than Mays’? David Kaiser revisited Bill James’ articles in the Historical Baseball Abstract and the New Historical Baseball Abstract. Instead of using Win Shares, as James did in his updated comparison in the the latter book, Kaiser used Wins Above Average from Baseball-Reference, but substituted their fielding wins estimates with those created by Michael Humprhey’s in Wizardry. His results? Mantle and Mays are similar in their best seasons when you account for league difficulty, but Mays had more great years than Mantle did. And I have another book to add to the reading list.
  7. RP10: “The Biker Boys Beat the Boy Scouts”: Facial Hair and the 1972 World Series – Maxwell Kates, a very colorful Canadian, examined MLB teams’ facial hair policies over time, using the 1972 World Series between the mustachioed Oakland A’s and the baby-faced Cincinnati Reds as a springboard. Kates comes back around to cite that year as a turning point for facial hair in the game, which seems plausible just from looking at players’ photos from the Topps card sets in those years. I especially enjoyed the point he made that MLB was marketing to families from 1962 in 1972. (As a postscript, it gave me great joy to see the mustache featured on the Reds’ 2015 All Star Game logo).
  8. RP23: An In-Depth Study of Team Chemistry in Baseball – SABR president Vince Gennaro presented some early findings from interviewing players and front office personnel, then tied it into teamwork studies on the business and military realms. Most of the saberists you meet will question whether chemistry matters at all, but yet all those in the game continue to insist that how players get along and interact matters. I agree with Vince that something is there, and it seems as though we’re just getting our arms around how to study it.
  9. RP04: Just a Little Bit Outside…: Drs. Nick Miceli and Tom Bertoncino attempted to use pitchF/X data to see if pitcher injuries could be better predicted. Their results show that prior injuries and mix of pitches thrown are the biggest keys. I’m a little skeptical, because despite consulting with many knowledgeable people, including Harry Pavlidis and Alan Nathan, they still decided to use the pitchF/X pitch type tags to classify pitches. Those tags are notably inaccurate in many cases.
  10. Poster presentations – It’s a shame they only had these up for an hour on Friday evening. The posters should have been up sooner, and hopefully will be in Chicago. Kudos to Evan Wassman for basically creating his own linear weights system without having read any of the previous work (he’s still in high school, so there’s time) and to Matthew Crownover and Dr. Jimmy Sanderson on being awarded best poster presentation for their work looking at roster construction.
  11. RP17: The Cuban Baseball “Defectors”: An Insider’s Full Revelation – Peter Bjarkman is one of the authorities on Cuban baseball, and gave a pretty solid overview of the origins of current wave of Cubans in MLB, what the change in Cuban regulations means for players on the island, and an outlook on the potential for future talent. The main takeaway for me: the talent levels in Cuba have dropped significantly in recent years, and opening up the borders too much would likely turn the Serie Nacional into a low-level minor league. Its ranking is more a sign of the quality of the other presentations.
  12. RP14: Lead Me Out to the Ballgame: A Study Investigating the Leadership of MLB Managers – I take lots of notes at SABR meetings during all presentations. So it says something when a presentation only has 3 lines filled in my papers. It’s nice that Dr. Howard Fero and Dr. Rebecca Herman have a leadership manual based on their interviews with big league managers, but all of the leadership traits they cite can be found in dozens of other leadership books. This would have been more interesting if they had tried to find a trait that was unique to managing a baseball team that isn’t required in other industries.
  13. RP07: Let Them Play! The Houston Astrodome, the 190s, and America’s Golden Age of Popular Culture – There’s a certain style of presentation that’s common when exploring baseball history. I like to think of it as show and tell: show a picture and tell a story with the picture on the screen. David Krell just tried to tell a story, positing the Astrodome’s central role in helping turn sporting contest coverage into the event model that Fox has used to excess. Led Zeppelin’s “Ramble On” played in my head as I left early to make it to another presentation.

This doesn’t even cover a couple of the presentations I wish I could have attended, which I either skipped so I could eat a midday meal or missed because of conversations going for a half hour after the previous presentation/committee meeting.

Committees

I attended 4 committee meetings, each with a little different structure and format. The Business of Baseball committee reviewed the status its current and recently completed committee projects. The Retrosheet meeting had a few presentations on investigations into data discrepancies. The Bibliography committee discussed updates to The Baseball Index system and the need to track a number of publications for baseball-related material. The Statistical Analysis committee discussed starting a committee project to create a centralized reference or bibliography. Given what I’ve already discussed here and here, there will be more on this last project to come.

Other Events

The biggest event of any convention is the trip to the local Major League stadium. You can always tell where the SABR group is seated at these games; just look for every other seat in a group of 10 rows or so keeping score of the game. For me the game was a first trip to some hallowed ground for White Sox fans, as Minute Maid Park is the site where the first World Series since 1917 was clinched for next year’s host city. The game itself had its share of interesting events as well: the first time in 10 years with the roof open for an August game, an inside-the-park HR that was confirmed by replay after Jon Singleton was called out at home, R.A. Dickey’s knuckleball, and a robbed HR that those of us in the right field mezzanine could only see by video replay.

The trivia contest is the nerdiest part of the convention. What’s impressive isn’t just the depth of knowledge that those who compete have, but how quickly the contestants can answer some of these questions. The most entertaining category had contestants  pantomime various batting stances and incidents in baseball history. My favorite was the best call to the bullpen by any manager in history: Ozzie Guillen’s signal for Bobby Jenks to come into Game 2 of the 2005 World Series.

The city history tour is a prelude to each convention. While it’s focus is aimed at exposing attendees to the area’s historical (and non-baseball) highlights, it does occasionally point out locations tied to baseball. Our tour in Houston highlighted the early years of the city’s history, wandering through downtown, the ritzy River Oaks neighborhood, the museum campus, the massive medical campus, and past the sadly neglected stadium in the Harris County Sports Center complex.

I didn’t attend the outing for the Sugar Land Skeeters game, the Awards Banquet, or the Historical Ballpark Sites Tour. Two of these I wish I had attended. I’ll let you guess which ones.

Overall

SABR 44 was 4 days of almost total immersion into baseball. Even more than a week later, I still wish I was there. Thankfully, the Internet has made the world a smaller place, allowing me to keep in touch with some of the fantastic people I met there. The baseball chatter that I engaged in with the likes of Graham Womack, Phil Birnbaum, Anthony Rescan, Andy McCue, Sean Lahman, Maxwell Kates, Chip Atkinson, Tara Kreiger and countless others  is really the heart of any SABR convention. It’s no wonder the hotel lounge was hopping most of the weekend, even though many reasoned the drink prices to be too high.

That being said, I didn’t spend as much time chatting there as I might have under different circumstances. My wife made the trip to Houston with me so she could get some R&R before her school year starts up again. Thus, I ended up spending most of my time after each day’s sessions were done with her instead of hanging around and talking to whoever happened to be lounging in the lobby. This will probably not be the case for me next year at SABR 45.

The schedule is jam packed. This is not likely to change soon, as the organization doesn’t want to extend the convention a day longer for a variety of reasons. I didn’t get to attend everything I wanted to, but still making it to 95% of what I wanted to do was still pretty good. That being said, it is the first convention or conference I’ve attended where events started before 8 AM. An extra day for the convention could help this, but that addition is not likely to happen for a myriad of reasons.

Since SABR 45 will be in my homeland of Chicago, this time constraint could be extra challenging. This is a rare occasion where the convention dates have been announced ahead of the 2015 MLB schedule being released, and the games for the MLB team(s) are centerpiece events. We on the planning committee will hope to have 2 games to schedule, along with the 32 research presentations, 8-10 panels, and other regular convention events. Houston was so well organized that it will be tough act to follow.

If you made it this far, thank for reading. Hope to see you at the Palmer House Hilton in Chicago, June 24-28, 2015.

 

No-No Fever (Reprised)!

Once upon a time, I wrote for the blog Seamheads.com. As that time was short-lived, and Mike Lynch needed the disk space to host other writers, my posts were deleted from their active archive. Thankfully, they have survived thanks to the Internet Archive. Since SABR101x used the Lahman Database, I was reminded of a toy stat I created back in 2008 in a post on Seamheads. The article below is an updated and edited version of what was originally published there. Treat it as something to hold you over until SABR 44 begins in Houston next week.

You might have guessed that today’s discussion will be about predicting a no hitter based on the wealth of numerical data that exists, and you’d be right. So where should we start? Well, let’s start with (who else?) Bill James, who estimated the odds of a no-no by a pitcher in a single game using his career statistics:

[(3*IP)/((3*IP)+H)]^26

Simple probability and logic: the more frequently you get people out without allowing a hit, the better chance you have of getting 27 batters out without allowing a hit. J.C. Bradbury modifies this formula slightly to be more centered on the pitcher’s ability as well as utilizing the Poisson distribution, but essentially follows the same idea.

Let’s stay within Bill’s framework, but consider a few extra components:

  • Defense is typically something that, courtesy of DIPS, tends to be ignored, but it always seems to play a critical role in a no hitter. A statistic like DER can account for the percentage of batted balls that become outs. Let’s use this statistic here. For the estimate of runners reaching on error, I used the approximation of .71 * team errors that Baseball Reference uses
  • Since we’re accounting for defense, we’ll need to know how often the ball gets batted into play where the defense can field it. The denominator from the formula for BABIP is a good start, but we want it from the pitchers perspective, as well as it in the form of a percentage. Let’s use (BFP-BB-HBP-K-HRA)/BFP. This will be the estimate for BIP%
  • Some balls leave the yard and never come back. Home runs will be accounted for with HR%, HR/BFP
  • We’ll also need to know what percentage of batters do not get a hit during their turn at bat. This is every thing else not covered with the previous 2 bullet points, so 1-HR%-BIP% should work fine
  • What about walks and other ways of getting on that aren’t hits? Forget them here, as they are not hits or outs, the only factors that matter in a no hitter. This does ignore the issue of pitching with runners on, which tends to be a detriment to preserving a no hitter.
  • The exponential factor is the last component. Bill used 26 as an estimate, assuming that somewhere along the way a caught stealing or double play factored into the mix. What he probably wanted was to find the average number of batters that had an out-producing at bat per 27 outs. Easy to say, harder to show how to calculate. When I first did this analysis in 2008, I used the Retrosheet play-by-play files from 2000-2007, counting the number of plays with an out and dividing by the number of games times 2. This came out to 25.8, close to James’ 26. Let’s keep using 26 since its a whole number, and you can’t have 80% of an at bat during a game.

And there you have it. The full formula thus looks something like this:

NoHitOdds = [1-HR%-BIP%*(1-DER)]^26

A few notes before I get to the numbers:

  • This estimator works best at as a season-level estimator. The reason for the season-level calculation is two-fold. DER is most stable over an extended period, and defensive rosters generally change from year to year, especially in the age of free agency.
  • When I first wrote this post, it came just before Jon Lester’s no hitter in 2008 against Kansas City. At the time, I wondered if there’s an extra component that could account for predicting these slim odds against a given opponent. The 2008 Royals were noteworthy for being a team that doesn’t hit very well, and so even a left-hander in a park that doesn’t like southpaws a lot (thanks Mr. G. Monster) was able to do it. Calculating this would probably include some sort of batting metric and maybe even an “over-aggressiveness” component; the idea being that a poor hitting team with bad plate discipline is more likely to be no-hit. I have not thought much about how to do this yet.
  • I think the importance of defense in a no-hitter is well accounted for here, though even I’ll continue to say this metric is far from perfect. Using an aggregated DER doesn’t account for the differences in fielders. Maybe this will become possible with the new technology being tested this year, but we’ll have to see what that data ends up looking like and what will be available for public consumption.
  • There are still a few other things not included, such as any kind of park factor or a way of accounting for the difficulty of fielding plays. The latter information, however, is hindered by the lack of fielder positioning data at the moment. We won’t be able to get that data historically, but it is coming, maybe even as soon as 2015.

Just like Bill James did, multiplying these odds by the number of starts in a season and adding those totals up for all seasons a player pitched will give us an expected number of no hitters. The top 10 since 1920 are, for the most part, who you’d expect:

nameFirst nameLast Sum of ExpNoHit
Nolan Ryan 2.67
Randy Johnson 1.50
Sam McDowell 1.13
Roger Clemens 0.80
Pedro Martinez 0.80
Bob Feller 0.68
Tom Seaver 0.67
Steve Carlton 0.67
Sandy Koufax 0.61
J. R. Richard 0.60

SABR Chicago June Meeting

On a classically hot and humid day to celebrate the summer solstice in Chicago, a small but fervent gathering of seamheads came together at the Lisle Library for an afternoon of baseball chatter.

Rich Hansen opened up the meeting with a few announcements, noting that the upcoming schedule for future meetings is still to be determined and that a site has been selected for the 2015 SABR convention that the Emil Rothe chapter is playing host to. That site is expected to be announced at SABR 44 in Houston at the end of July.

After Rich’s announcements, the keynote speaker of the day was introduced. Coming to Chicago all the way from Kansas City, Bob Kendrick, president of the Negro Leagues Baseball Museum, spoke about the history and significance of the Negro Leagues, intertwining a multitude of stories about many of the Negro League’s biggest stars. Noting how intertwined black baseball was with its communities and culture, Mr. Kendrick drew the parallel of the rise and fall of the Negro Leagues with the rise and fall of black-owned businesses. He discussed how the Negro Leagues came into being in 1920, providing players with recognized major league talent, who were kept out of Major League teams by American social conditions at the time, a place to play professional baseball. Noting that the NLBM’s mission is to ultimately help people see that the Negro Leagues were on the same level as the Major Leagues, Bob told many tales about Martin Dihigo, Buck Leonard, John Henry Lloyd, James Thomas “Cool Papa” Bell, Josh Gibson, Oscar Charleston, Satchel Paige, Rube Foster, and, of course, Buck O’Neill and Jackie Robinson, along with a number of others. During his Q+A, Bob Kendrick addressed the gap in the collection of Negro Leagues statistics and observed that baseball today has evolved into a “country club” sport.

Next up came a slightly briefer version of Chris Kamka’s This Day In Baseball report, noting that the meeting was being held on the anniversary of the appearance of the 1st black player in major league history, Ted Lyon’s 250th win, and Carlton Fisk’s breaking of Bob Boone’s game caught record. He also noted about his contribution to an upcoming book celebrating old Comiskey Park.

Brian Bernardoni presented again about Wrigley Field, picking up almost right where he left off at the chapter’s celebration for the venerable ballpark’s centennial. Rather than present any new findings, he posited a number of questions  about the ballpark’s history that could be answered by researchers, many of them related to the folklore that surrounds the park. Notable topics include biographies of architect Zachary Taylor Davis and Charles Weeghman, investigation into some of the decisions made by previous owners of the Cubs such as “Did Wrigley only allow ads at the ballpark for his own company’s products only?” and “Why did Wrigley or the Tribune Co. just buy the rooftops when they had the chance?”, and explorations into whether the Federal League should be considered a Major League and Chicago weather in April.

Jessica Jensen closed out the meeting with a lively demonstration on the importance of footwear in the major leagues, reporting that she has yet to find a professional club that ensures its players have properly fitted cleats. The discussion held as much, if not more,  interest for attendees with foot issues of their own, as she interactively demonstrated what players should be looking for in a properly fitting shoe. Her company Saberfeet is eager to help new clients avoid foot injuries (e.g. turf toe) that can affect player performance.

The meeting ended early, as a number of scheduled presenters were forced to back out of their commitments at the last minute due to extenuating circumstances. For this author and a few others in attendance, it was a great way to start the countdown clock towards the SABR Convention in Houston.

Summer of MOOC: Sabermetrics 101

As part of my (potentially quixotic) efforts to catalog every single piece of sabermetric research ever published, there is one particular resource that is of great help with identifying and categorizing the research works: college course syllabi. In recent years, a small number of colleges and universities have offered elective courses focused on sabermetrics. These courses have focused not only on studying baseball through its numbers, but also teaching some of the necessary coding and statistical skills needed to do these analyses properly.

The first of these courses I heard about was offered at Tufts University. SABR Member Andy Andres wanted to offer a course that didn’t teach statistical methods through baseball but that focused on doing the sabermetrics studies that have shaped how people understand of the game. Tufts offered an opportunity to do so through its Experimental College, and his syllabus for that course is one of the great starting points those who aspire to emulate Bill James.

Starting May 29, you don’t have to be a Tufts student to experience this course. Teaming up with Boston University, Andres is offering the latest iteration of the Tufts course to be taken by the masses via edX. I’ve signed up for it for 3 reasons:

  1. I’ve never been involved with a massively open online course (MOOC) before, so I’m interested to see how the course operates from the student perspective. (As an aside, I did do quite a bit of distance learning for my master’s degree.)
  2. Andres’ Sabermetrics 101 course is something I wish I had been able to take while I was in college. I get to do that now, even if I learn nothing new.
  3. It’ll hopefully motivate some new research that will end up on this blog.

If you haven’t signed up for it and are sitting around reading this post on Memorial Day weekend in 2014, I’ll recommend you do so. Hopefully you’ll learn a little something about sabermetrics, mathematical statistics, and coding.

A Party for a Century

April 23, 2014 marks the 100th anniversary of the one of the most famous landmarks in Chicago, and certainly the most famous on the North Side, Wrigley Field. Naturally, this celebration could not go unrecognized by the members and friends of the Emil Rothe Chapter of the Society for American Baseball Research, based in that same city. And so it was that Saturday, April 19, 50 members and friends gathered in downtown Chicago to celebrate the most famous mix of brick, steel, concrete, and ivy that Zachary Taylor Davis ever conceived.

Gathering at the Cliff Dweller’s Club overlooking the Lake Michigan waterfront, Rich Hanson, Emil Rothe chapter president, welcomed the attendees by noting the appropriateness of the location in relation to Chicago baseball history such as the sites of Lakefront Park and the Federal League offices. Also recalled were other centennials being celebrated in 2014, including Babe Ruth’s debut and WWI. He then ran through the meeting’s agenda of 4 presentations, all related to Wrigley Field in some way, and gave a reminder to all about the upcoming SABR Conventions in Houston this summer and Chicago next summer.

Dan Levitt led off the presentations with an overview of the short history of the Federal League and how that battle played out with Chicago at the center. Founded in 1913 as a minor league, the Federal League began to aspire to be a major league after Jim Gilmore bought the Chicago team and wrested control of the league. Gilmore then sold his club to a Chicago restaurateur named Charlie Weeghman, who wanted to move the club to a site on the North Side and bring current Major League players to the Federal League. Weeghman’s first signing was Joe Tinker, who he swooped in to take from the Reds as they tried to sell him to the Dodgers. As a result, the owners tried to buy Weeghman, the richest owner in the Federal League, out and cause the Federal League to collapse. That deal, and a subsequent one, would be scuttled by Charles Murphy. Needing a place to play in 1914, Weeghman financed the construction of a stadium at the corner of Addison and Clark Sts. in 6 weeks. The war between the leagues would go on through 1915, when Weeghman was offered the Cubs for approximately $500K and most of the other Federal League owners wanted to be bought out. Despite its short tenure, the legacy of the Federal League has lasted far longer, as it led to MLB’s anti-trust exemption and the implementation of the reserve clause in many player contracts.

After this overview of the historical circumstances around the contruction of the ballpark, Brian Bernardoni gave a history of the host club and, more pertinently, the Friendly Confines. Brian is a tour guide at Wrigley, and is well versed in its history. He provided many of the historical artifacts on display during the meeting, including the newspaper announcement for the first game at then-named Weeghman Field and a pennant from the Chicago Federals. Brian mostly talked about the construction of the field and the key people behind that construction, including Weeghman, Davis, William & PK Wrigley, and Fr. Gorman of DePaul University. (If you must know the details behind that last name, ask Brian.)  He also highlighted the additions and changes made under each ownership group, from the Wrigleys (ivy, organ, baskets) to the the Tribune (lights, suites) and an overview of the Ricketts plans. Brian’s presentation was followed by a break and some Wrigley-inspired trivia questions, all of which were answered by someone in attendance.

Next up was writer Ed Sherman, who discussed his book about the most disputed moment in Wrigley’s history, Babe Ruth’s Called Shot. After relating some personal memories of his days at a vendor in the stands at the venerable ballpark, he described his motivations for writing the book. One of those motivations was interviewing someone who was at that game, former U.S. Supreme Court Justice John Paul Stevens. After detailing the history of what the key players claimed in regards to whether the Bambino did or did not call his shot and the difficulties in accurately researching the event, Ed settled on noting that the at-bat itself was very extraordinary and reserved his judgement on the matter for those who read the  book.

Also with a book on hand, Sam Pathy follwed Ed and gave a brief overview of famous home runs at Wrigley. He handed out a 7 page list of long home runs, defined as those with an estimated distance greater than 440 ft. The list included many citations of these tremendous clouts, from newpapers to TV (usually WGN) calls. Steroid-phobic fans looking through this list will squirm at the last 2 pages that are utterly dominated by a man who forgot how to speak English in front of a Congressional grandstand, Sammy Sosa.

The last presentation by Stuart Shea discussed the Ricketts’ renovation plans for Wrigley and evolved into a lively discussion about the future of the ballpark. Stu noted that one of the reasons fans are able to celebrate Wrigley’s 100th birthday was due to the foresight of Davis and Wrigley. Davis’ original modular design allowed for the ballpark to modified in many ways, and Wrigley was very dilligent about making sure the structure received the necessary upkeep and renovations that have helped the building stay standing for so many years. Shea then went on to discuss what he feels the big changes at the park will be with the proposed renovations, noting most of the changes were about allowing Wrigley field to generate more revenue for the club and the Ricketts family.  The discussion was ignited after Stu provided his views on the most controversial element of the plan, the Jumbotron.

This discussion continued both in large and small groups as the meeting wound to a close. After attendees took the opportunity to get an autographed book or 3 and take some pictures from the rooftop, people went their separate ways. For this group of baseball fans, not all of whom support the Cubs, Wrigley’s birthday party had just started.

Organizing the World’s Sabermetric Research, Part 2 – Contents and Design

In Part 1, I discussed current efforts to capture sabermetric research and announced my intention to create a new sabermetric research reference database that will be more comprehensive than those efforts. In this part, the critical questions of what works should be included and how the database is designed will be addressed.

Considering my goal for the yet-to-be-named sabermetric research database is to capture all sabermetric research, it’s important to consider what qualifies as sabermetric research. There are certain articles or books which most involved in sabermetrics would certainly agree should be included. Those are the easy entries. But to be comprehensive, the database will need to include more than the “greatest hits” of sabermetrics. That means there will need to be some way to decide what goes in and what doesn’t, and it shouldn’t be an individual person.

So how will it be determined that a published work will be included in the database?  This is where the “greatest hits” will be of great use. Using those pieces or research that are already very well known as a starting point (such as The Hidden Game of Baseball or the Bill James abstracts), the database can be expanded upon by looking at the references and citations used in those works and including those works cited that are also sabermetric in nature. This, however, is merely a good start.

One result of a well-populated comprehensive research database will be the ability to identify schools of thought, and it’s possible that the “greatest hits” list I use will be from the currently dominant school of thought. It is also likely that, at some point, I’ll run into dead ends walking through citations in other works.  So after exhausting the “greatest hits” path, I’ll turn to looking for work by the most noted researchers. They’ll be determined either by having won awards from SABR or having been hired by a big league front office as a result of their published work. It should be noted that this author-based method could encounter the same problems as the “greatest hits” method.

A third angle for finding works to be added to the database will be based on the source of the research. Certain publications and websites definitely carry more weight than others. Books and academic journal articles are certain to be included due to the review process these works undergo before being published. Presentations at conferences are also fairly likely to be included, though duplication of  work will be avoided as many presentations will have corresponding papers published, especially from academia. Certain blogs and websites, like The Hardball Times, will also be considered. There are a few sabermetric primers that can be cross referenced to help identify other sources.

For getting the database initially populated, those methods should work just fine. With all those different inputs, it also means that a variety of materials need to be considered and managed. One of the main considerations is how to record the information for each type of published work. Much like how different types of references had to be cited using different syntax in papers in high school, different types of research sources need to be treated differently in the database to accommodate the unique features of each type. Let’s consider each type of research source:

  • Books – Seemingly simple, as you have lots of publication information included (title, author, publication date/year, etc.). However, many sabermetric books cover multiple topics of research; thus, there will need to be a way to tag subject matter based on page number.
  • Journal/Magazine Articles – Perhaps the easiest type to enter into this database. Usually focused on a single topic, publication information is easily identifiable (title, author, journal, page #s, etc). Also often includes keywords for topical focus.
  • Blog posts – Like journal articles, typically single topic. Tags are often used to identify topics covered. Can run into the issue of not being able to properly identify authorship. URL  will need to be captured, but that can become inaccurate if website is deleted/moved. Otherwise, cited very similarly to journal articles.
  • Online forums – Specifically thinking of the old rec.sport.baseball Usenet group here, but could also apply to e-mail lists like SABR-L and other online arenas like Tango’s Forum. Many of the same issues as blog posts. I’m really disinclined to include postings from these sources, as they are more of a discussion forum for working out ideas than they are a place to publish research. However, if a posting here is cited in another work included in the database, it’ll probably be included. Citations will be almost exactly the same as blog posts.
  • Conference presentation slides/posters – Inconsistently published, but valuable when available. Usually accompanied by formal paper, though content can differ slightly to account for new information. These materials are typically meant to be accompanied by someone speaking. This database will only include those presentation slides that can be accessed electronically or were published as part of a conference proceedings book that could be found in a library. It will also link the slides to the corresponding paper or…
  • Audio/Video – One of the great things about being a researcher in modern sabermetrics is the ability to virtually attend conferences, thanks to posted video/audio from the event. These will probably be included in the same manner as conference slides/posters. Citations for both will incorporate information typically used for citing conference proceedings: presenter, conference, title, and date at a minimum.

The schema of the database is still being edited as I work towards building version 1. I’ll discuss that once the first version is released. That date is still TBD, as this is not my full time job. If you want to help or have a suggested name, please feel free to drop me a line on Twitter or in the comments section.

Organizing the World’s Sabermetric Research, Part 1

Lent started recently for western Christianity, and I have a confession that may not be all that surprising: I  read an absurd amount of sabermetric research. I read articles on the big sites like Baseball Prospectus and FanGraphs. I read the blogs of a number of fellow SABR members. I read articles from academic journals like the Journal for Quantitative Analysis in Sport and The American Statistician. I buy the Baseball Prospectus, Hardball Times, and Bill James annual books each winter. I, at a minimum, keep tabs on what papers are presented at the big conferences like the recently completed MIT Sloan Sports Analytics Conference, the summer SABR Convention, and the Joint Statistical Meetings.

All of this information is being disseminated, and but to find it all is still rather difficult and time consuming. The links above are just a small sampling of the bookmarks and RSS feeds that I use to read baseball statistical research. That’s ultimately not a good thing for sabermetrics, because it leads to 2 major issues:

  1. There are 2 main types of researchers doing work in the field of sabermetrics: the analytically-inclined fan or sports industry professional, and the academic statistician. For the most part, neither type ever looks into what is going on with the other type. The fan avoids the academic in part because of the the inherent barriers built into accessing and understanding academic research and in part because the academic is typically more focused on the statistical technique used than subject matter applicability. The academic avoids the fan because of the perceived lack of statistical rigor in the fan’s analyses and the fact that a blog post can go up without any sort of peer review before publishing.
  2. As Colin Wyers noted in one of his final articles at Baseball Prospectus, there is no single record of the research done by saberists/sabermetricians, in part because a lot of this research was done on an individual basis and it is not all channeled through any single organization. While this decentralization is generally good for making research available, it does mean that articles and books with old ideas can easily be discarded and lost despite their merit at the time of publication.

How can this be fixed? One method would be an aggregator site that captures and categorizes articles. This is the approach of Saber Archive, currently in closed beta testing. (Note: I have participated in this closed beta. You can sign up to participate here.) Matt Dennewitz has built a fine interface to capture the text of research articles and make these articles searchable by key words and by research topic category. This should become an invaluable resource, especially once the site is public and more fully populated with articles. However, its scope of research will be limited to what is currently published on the web and can be archived by Matt’s software. Granted, that covers much of the research published today, but it does leave out anything that is only published the old-fashioned way: on paper, in books and magazines.

It seems to bridge this gap between the historic print world and the modern web, sabermetrics needs something closer to the modern library catalog, a database of books and articles with sabermetric content. Charlie Pavitt has attempted this with his Statistical Baseball Research Bibliography, though with a narrow scope. Like myself, he is an avid reader of sabermetric material, and he draws upon that knowledge base to cull through the research published and identify the works of sabermetric research that, as he states in the the description of the file, “have been intended to make a contribution to our knowledge about baseball as a statistical science.” What he’s produced is a fine resource, though because this definition is ultimately arbitrary. In my opinion, his list leaves off some pretty key pieces of research, even if you assume he just hasn’t had a chance to update the list since 2011, the most recent publication date of any entry.

These two are the only attempts that I know of that try to catalog even a part of the world’s sabermetric research. Both have good features, both have drawbacks. I feel that sabermetrics deserves a little bit more than what is currently provided. Thus, I am proposing to build a new comprehensive database of sabermetric research articles. Consider this my attempt to take up Colin Wyers’ challenge to the sabermetric community.

This new database will undoubtedly share some features with the efforts described above. It will be built as a database, which I presume is the underlying framework of Saber Archive. However, the contents of that database all merged together will result in a table more like Pavitt’s Bibliography. The main goal of it all will be to capture every piece of sabermetric research and link it all together, with the ancillary goals of bringing together the fans and the academics together more often and, just maybe, increasing everyone’s ability to understand statistical methods and results.

Admittedly, this is a huge project for one person to take up on his own. Part of the reason this is a “Part I” post is because of the project’s scale, but also because I still do have a number of features to work out and design issues to confront. I’ll cover those issues in future installments. I do have my starting point; The Hidden Game of Baseball by John Thorn and Pete Palmer will be entry #1 in the database. I feel that’s fitting, as it was one of my gateways into sabermetrics.

Help will most certainly be needed, especially in identifying new sources of research to be added to the database. You can offer your assistance in the comment section or by any other means of contacting me if you happen to know them.

2014 SABR Analytics Conference Research Awards

Three years ago, the Society for American Baseball Research (SABR) decided to finally take advantage of the rise in sabermetric research and organized an analytics conference during Spring Training. In that short time, this conference has become one of the big events of the year for aspiring and practicing sabermetricians/saberists. Last year, SABR decided to recognize achievements in sabermetric research by inaugerating the SABR Analytics Conference Research Awards.

My favorite things about this award are that it highlights some of the great written work being published in sabermetrics in all media (primarily the web) and lets the sabermetric community vote on the awards. Awards are handed out in three categories: Historical Analysis/Commentary, Contemporary Commentary, and Contemporary Analysis. This year’s nominees in each of the three categories can be found here, which also includes links to the nominated articles and and the ballot for voting.

Here’s who I voted for in each category, and why:

Historical Analysis/Commentary:  Max Marchi, “Catcher Framing Before PITCHf/x,” Baseball Prospectus, May 16, 2013.
Catcher framing has been one of the hot topics in sabermetric research over the past few years. It began as an offshoot from the PITCHf/x data that MLB Advanced Media (MLBAM) has gratiously allowed to be freely accessed by the public. (SABR has thought so highly of this that Cory Schwartz of MLBAM is being honored with the Chadwick Award this year.) However, using PITCHf/x data means most framing analyses only go back to 2008 or so. Max Marchi’s analysis utilized Retrosheet’s pitch sequencing data contained in its play-by-play files from 1988 onward to build a model that estimated the effects of framing prior to the advent of PITCHf/x data, and he used a modeling technique that I’m not sure has been applied to baseball data before. Because it went into unexplored territory, both in terms of topic and technique used, I gave it my vote.
Runner up: Russell Carleton, “Dating the Impulse to Protect Pitchers,” Baseball Prospectus, December 2, 2013.
The saberist formerly known as “Pizza Cutter” from his days running the now-defunct Statspeak blog spent a lot of his time at Baseball Prospectus last year exploring the evolution of how pitchers are used. The whole series of articles is worth a read, including all the gory math. Like the other articles in the category that I didn’t vote for, I find it’s not quite as innovative as Marchi’s work.

Contemporary Baseball Commentary: Jon Roegele, “The Strike Zone in the PITCHf/x Era,” The Hardball Times Annual 2014, November 2013.
For me, this was the highlight article of this year’s HBT annual. Thankfully for you, they’ve reprinted the article online because of these awards so you don’t have the buy the book, although I highly recommend doing so.
As noted above, PITCHf/x analysis in recent years has started to branch beyond analyzing the pitch itself. While catcher framing is one offshoot, strike zone evaluation is a second offshoot. While there are a number of articles using the data to evaluate individual umpires’ abilities behind the plate, Jon Roegele examines the longer term effect on the strike zone that has been arguably influenced by the use of PITCHf/x data as an evaluation tool for the umpires. He considers not only how the strike zone has changed, but how the players have adjusted to the  changes. Very thoroughly done.
Runner Up: Jonah Keri, “Grand Theft Baseball,” Grantland.com, March 20, 2013.
I’m a big believer that numbers should be paired with narrative when the numbers are used and depicted properly. Jonah is one of the best at doing just that. This is a taste of why stats and scouts should never have been at odds with each other. I put it below Roegele’s work because a lot of the analysis was specific to the examples used, though a number of insights were gleaned.

Contemporary Baseball Analysis: Andrew Ball, “2013 MLB Draft: How Valuable Are Draft Picks?” Beyond the Box Score, June 25, 2013.
This was admittedly the toughest category for me to pick. I ended up leaning towards the more numerically based analysis done by Andrew Ball. I’m not necessarily sure it’s better than a couple of the other nominees in term of originality or depth, but I’ll have the tiers in the back of my head as I keep an eye on the draft this June.
Runner Up: Adam Kilgore, Sohail Al-Jamea, Wilson Andrews, Bonnie Berkowitz, Todd Lindeman, Jonathan Newton, “A Swing of Beauty,” WashingtonPost.com, May 14, 2013.
This breakdown of Bryce Harper’s swing was so thorough, it only missed my vote by a slim margin. Perhaps it was just as surprising that it didn’t come out of one of the sabermetric blogs, but having the resources of a major newspaper definitely enabled the multimedia presentation.

Welcome!

Welcome to my blog, Four Pitch Random Walk!

This blog was created first and foremost as place to publish my baseball research and discuss some of my research projects. I also plan to use this space to recap my baseball-related experiences and occasionally comment on baseball and other topics.

My hope is to write a new post every 2 weeks or so. If time allows, I might post more frequently, but right now every other week feels like a realistic goal.

Thanks for reading!