Rounding the Bases

We can’t quit you, baseball! The season might be over, but we want more. So, we’re dipping into the baseball data to see what else we can learn. Read on for one more run around the bases!

Put Me In, Coach

This season, all anyone talked about was home runs. There were 6,770 homers hit during the regular season this year. That’s 665 MORE than the previous record! And exactly half of the teams in the league set franchise home run records. Holy homer! 

But, do all these home runs lead teams to the playoffs? We looked at the past five years (2014-2019), pulling the top five teams for home runs in each year. Edge colors indicate:

  • Blue Edge: Appeared in playoffs
  • Green Edge: Won the World Series
  • Orange Edge: Neither won nor appeared in the playoffs

Here’s the result:

How do other stats affect a team’s success? To find out, we layered in Earned Run Average (ERA), Strikeouts (SO), and Runs Batted In (RBI) one by one.

Home runs + ERA (click for a closer look)

Home runs + ERA + SO (click for a closer look)

Home runs + ERA + SO + RBI (click for a closer look)

Each of these graphs used separation constraints to pull the statistic nodes (home runs, RBI, ERA, SO) away from each other. This adds a customization layer to the standard symmetric layout. In Tom Sawyer Perspectives Designer, a spacing specification in the Model Rule forces the layout to pull a team that is connected only to a single statistic closer to that statistic node. For a team connected to two statistic nodes, the pulling happens in both directions and the team node displays in the middle of the two statistic nodes. Likewise, a team that is connected to multiple stats displays in the middle of the whole graph. By experimenting with the distance between statistic nodes, we can generate a drawing that clusters teams based on the number of nodes to which they connect.

Earth’s Favorite Pastime

For fun, we used our data to see where MLB players were born. After removing the United States (but including Puerto Rico), this was the result:

We applied a node drawing template to change the node width and height depending on how many players are from each country. This caused the nodes to grow based on the number of players. For additional detail, we expanded the Canada node to show each of the 11 Canadian-born players in the league last year. 

Even more information can be added by including an inspector panel to the left of the drawing view, which displays additional stats on a selected node. To help with navigation, we also added a graph overview in the lower right corner that shows a thumbnail of the entire graph.

Armchair Analysis

With the orthogonal layout of the home run graph, we can easily see that of the 21 teams in the top 5, only 2 won the World Series. However, since only 3 teams did NOT make it to the playoffs, home runs do seem to be a big factor in a team’s overall season success. 

Adding in ERA adds interest only in that it doesn’t seem to matter. Our graph shows that in the past five years, no team that performed in the top five for ERA even made it to the division playoffs. Can teams stop worrying about their ERA?

Things perk up when we add in SO. The A’s, Astros, Mariners, and Brewers are being pulled between the home run and strikeout nodes, meaning they have consistently put up good numbers for both. We can also see several playoff appearances and one World Series win in this cluster. The same can’t be claimed by the Rangers and the Orioles. Both teams display in the center of the graph, meaning they performed well in all three areas. Yet all connected edges are orange except the 2016 playoff appearance by the Orioles. Why doesn’t high performance in these areas lead to season success?

Finally, the most striking observation of the final 4-stat graph is the overwhelming amount of orange on the right half. When visualized this way, with orange edges denoting that the team neither won the World Series nor went to the playoffs for that year, SO and ERA seem wholly insignificant. So are they?

One thing is certain: we didn’t notice any of these things until we visualized the data in a graph.

Field of Graphs

2019 marks the 115th year for World Series baseball in the United States. In a year when home run records were smashed with authority, we thought we’d take a look at America’s pastime through the lens of graph visualization. 

Simple trade history of Washington Nationals star pitcher “Mad Max” Scherzer

Over the next several weeks, we’ll ask two fundamental questions about Major League Baseball (MLB):

  1. Who’s on the playoff team rosters?
    Do most players start on other teams? Which players have been traded and from which teams?

  2. What’s with all the homers?
    6,770 home runs were hit during the regular season this year–a new record! Sure, they’re fun to watch, but do they lead teams to the playoffs or the World Series?

Let’s pull some data, make some graphs, and see what we can figure out. Today we’ll:

  • walk through the over-arching process, using MLB datasets as our subject
  • drop some knowledge about how the graphs were created
  • show how we used the customization tools resident in Tom Sawyer Perspectives to easily tailor the graphs to show important statistics  we can use in our analysis

Come on…let’s play ball!

The Preseason

Before pitching the first graph visualizations, customizations, and algorithms to analyze players, teams, and homers, let’s warm up the data and create a schema.

Schema defines different entity and relationship types and their attributes

schema is the blueprint that defines the structure of data. It consists of element types, element attributes, and model attributes. To get the statistics we needed, we gathered multiple sources of MLB data. One set of comma-separated values (CSV) files came from Lahman’s Baseball Database. It included pitching, hitting, and fielding statistics for Major League Baseball from 1871 through 2014. For the more recent statistics, we used the Retrosheet Encyclopedia dataset. Also imported as a CSV file, this included game stats from 1990 through 2018.

Here’s a glimpse of the raw CSV dataset:

Let’s see what insights we can gather about player trades (known as transaction history to baseball insiders) from visualizing this data. We generated the graph below from our dataset. Tom Sawyer Perspectives can also load the data into a Neo4j database, for example, by reading the data from a CSV, or other formats, and doing a data transformation to a graph structure.

Initial graph drawing of player transaction history from 1871 through 2018 showing over 4,000 edges

Oof! That’s obviously too much data to throw at fans or managers. This graph shows hundreds of team and player nodes and thousands of relationship edges that reflect countless trades. To gain focus, we’ll filter to a subset of data showing only transaction histories for players on the final four teams in the 2019 playoffs. First, let’s narrow the field to only one year and one team: the 2019 Washington Nationals.

Transaction history of players on 2019 Washington Nationals roster , no formatting

Phew…much better! We can now clearly see the transaction history of all of the players on the team. While a number of players seem to have started with the Nationals (likely from their minor league team), many arrived on the team through a series of trades from other teams.

The Uniform

What about the look and feel of the graph drawing? Luckily, Tom Sawyer Perspectives makes it easy for us to dress up the drawing elements on the graph, so we did. Here, we added team logos for all team nodes and changed the formatting of the player nodes.

Transaction history of players on 2019 Washington Nationals roster, with formatting

Small changes—big impact. The Perspectives Designer allows us to easily control hundreds of seemingly tiny attributes in a drawing. From the drawing template, we used the Preserve Aspect option for the Resizability setting to maintain a consistent node size throughout the drawing.

The Shape Clipping setting ensures that edges are clipped tightly and precisely to the edge of nodes, no matter what their shape. And Background Color, Border Color, and Text Color all helped improve the readability of the graph.

Precise shape clipping ensures edges are clipped neatly—even around uniquely shaped nodes like the Rays and Dodgers logos

Since Tom Sawyer Perspectives provides multiple layouts out of the box, we can easily view the graph multiple ways to figure out which layout is the most useful. Although the circular, spring embedded layout seems to be the most popular, we found orthogonal to be the easiest to read for this graph.

Here’s circular:

And here’s symmetric:

For transaction history graphs, the orthogonal works best. Orthogonal uses predominantly horizontal and vertical edges, allowing us to make a very compact drawing. With cleaner lines, we can easily see connections and relationships between players and teams—both present and past. In baseball, as with other pro sports, the comings and goings of players cannot only cause fan emotions to run high, but can sometimes make or break a team. The orthogonal layout allows us to clearly see the family history of these teams.

Now that we’ve landed on a graph layout style and format for the elements in the graph, we’ll recreate it for the other three teams in the finals. Check them out! Are you surprised to see where some of your favorite players got their start?

Transaction history of players on 2019 Houston Astros roster
2019 Transaction History of the New York Yankees
Transaction history of players on 2019 St. Louis Cardinals roster

The Final Score

Did we make unseen connections in our data?

Looking at the Washington Nationals team graph, it’s interesting to see so many players coming from just a few teams. The Arizona Diamondbacks look almost like a farm team, showing five of their former players now with the Nationals—including the infamous “Mad Max” Scherzer!

The visualization also makes it easier to see how many players have never played elsewhere. For the Yankees, Astros, and Nationals, less than 25% of their 40-man roster are players without experience in other clubs. The 2019 Cardinals, on the other hand, show more than 42% with no outside MLB experience. Could this be a clue for managers? Do the graphs show that the Cardinals did a poor job adding new talent throughout the year? We aren’t sure what the ultimate conclusion is—but we are sure of this—we didn’t make the connection until we saw the data in a graph.

Next Week: Knocking ‘Em Out of the Park

To say that home runs are off the charts this season would be the cliché of all clichés! Players broke records. Teams broke records. Why? Were walls moved? Are the balls the cause? The Internet is rife with theories.

But we wonder, does it matter? Do all of these home runs lead teams to the playoffs or even to World Series wins? The Oakland A’s Billy Beane (famously played by Brad Pitt in Moneyball) asked these questions years ago—but he had nothing but boring spreadsheets to analyze. We’re going to visualize the data! Here’s a sneak peek of the top five home run teams for the past ten years:

Blue edge shows playoff teams, green edge shows World Series champs, black edge shows top five home run team who did not make the playoffs