The next level of public baseball data has arrived. MLB Advanced Media’s Statcast made a hyped television debut, although it had made cameos in online replay videos last year. With the system installed in all 30 ballparks to track all movement on the field, hopes are high for discovering many things about the game via data that previously could only be imprecisely discerned by watching a lot of baseball.
However, while MLBAM have stated that Statcast data will be made public, it is still unclear what types of data and how much of it will be available for public use. Bits and pieces of the data have slowly appeared as the 2015 season started. Among the first pieces have been the velocity and angle of the ball off the bat, which the savvy scrapers, such as Daren Wilman of Baseball Savant fame, of the Gameday files have captured and published. But whether the public will have access to the raw data remains to be seen.
It seems unlikely to me that there will be public access to the raw Statcast data anytime soon. The first challenge is the sheer size of the data set, which is already measured in petabytes. This is unlike the pitchF/X data, which can be scraped and saved on a home PC. Raw Statcast data is best stored on a cloud server. While MLBAM is certainly using “the cloud” as the method for allowing the 30 teams to access the data, it would be a massive security risk to open that server up to the public domain. Setting up a public server would be an additional cost, and it’s hard to argue that there would be any significant return on that investment for MLBAM. However, Statcast is already sponsored by Amazon Web Services, so the possibility is there for the raw data to be made public via the AWS platform. That possibility seems very remote at this time.
A more likely scenario (at least in my mind) for the release of Statcast data is something like what the NBA did with its SportVU data. SportVU, the player tracking system developed by a subsidiary company of STATS, Inc., is akin to Statcast in that it tracks player and ball movement. The Stats section of NBA.com (linked above) shows various measures and animations gleaned from the SportVU data, but does not provide fans access to the raw data. This is the path I expect MLBAM to take. The batted ball data that has already shown up in Gameday is like this, and many of the other metrics that have been teased via broadcast, such as route efficiency and perceived velocity, could also be distributed in this manner.
Releasing the data in a summarized or snapshot form isn’t as risky to the teams, who were not all that happy when pitchF/X data made its way into the open world. Allowing public researchers to make insights based on that available to all teams took away an opportunity to gain a competitive advantage. This is why the other Sportvision products, like hitF/X that also provided batted ball information and commandF/X that tracked where the catcher’s glove was position, have been available to teams but not the public.
Regardless of what form the data takes when it is released, Statcast data should enable saberists to use more granular data to show what it takes to succeed in the game of baseball. Some of these data-driven discoveries may merely affirm what scouts and those in the game have been taught and believed for years and decades, but I’m sure some will not. Like many others, I can’t wait to get my hands on it.