Rounding the Bases

We can’t quit you, baseball! The season might be over, but we want more. So, we’re dipping into the baseball data to see what else we can learn. Read on for one more run around the bases!

Put Me In, Coach

This season, all anyone talked about was home runs. There were 6,770 homers hit during the regular season this year. That’s 665 MORE than the previous record! And exactly half of the teams in the league set franchise home run records. Holy homer! 

But, do all these home runs lead teams to the playoffs? We looked at the past five years (2014-2019), pulling the top five teams for home runs in each year. Edge colors indicate:

  • Blue Edge: Appeared in playoffs
  • Green Edge: Won the World Series
  • Orange Edge: Neither won nor appeared in the playoffs

Here’s the result:

How do other stats affect a team’s success? To find out, we layered in Earned Run Average (ERA), Strikeouts (SO), and Runs Batted In (RBI) one by one.

Home runs + ERA (click for a closer look)

Home runs + ERA + SO (click for a closer look)

Home runs + ERA + SO + RBI (click for a closer look)

Each of these graphs used separation constraints to pull the statistic nodes (home runs, RBI, ERA, SO) away from each other. This adds a customization layer to the standard symmetric layout. In Tom Sawyer Perspectives Designer, a spacing specification in the Model Rule forces the layout to pull a team that is connected only to a single statistic closer to that statistic node. For a team connected to two statistic nodes, the pulling happens in both directions and the team node displays in the middle of the two statistic nodes. Likewise, a team that is connected to multiple stats displays in the middle of the whole graph. By experimenting with the distance between statistic nodes, we can generate a drawing that clusters teams based on the number of nodes to which they connect.

Earth’s Favorite Pastime

For fun, we used our data to see where MLB players were born. After removing the United States (but including Puerto Rico), this was the result:

We applied a node drawing template to change the node width and height depending on how many players are from each country. This caused the nodes to grow based on the number of players. For additional detail, we expanded the Canada node to show each of the 11 Canadian-born players in the league last year. 

Even more information can be added by including an inspector panel to the left of the drawing view, which displays additional stats on a selected node. To help with navigation, we also added a graph overview in the lower right corner that shows a thumbnail of the entire graph.

Armchair Analysis

With the orthogonal layout of the home run graph, we can easily see that of the 21 teams in the top 5, only 2 won the World Series. However, since only 3 teams did NOT make it to the playoffs, home runs do seem to be a big factor in a team’s overall season success. 

Adding in ERA adds interest only in that it doesn’t seem to matter. Our graph shows that in the past five years, no team that performed in the top five for ERA even made it to the division playoffs. Can teams stop worrying about their ERA?

Things perk up when we add in SO. The A’s, Astros, Mariners, and Brewers are being pulled between the home run and strikeout nodes, meaning they have consistently put up good numbers for both. We can also see several playoff appearances and one World Series win in this cluster. The same can’t be claimed by the Rangers and the Orioles. Both teams display in the center of the graph, meaning they performed well in all three areas. Yet all connected edges are orange except the 2016 playoff appearance by the Orioles. Why doesn’t high performance in these areas lead to season success?

Finally, the most striking observation of the final 4-stat graph is the overwhelming amount of orange on the right half. When visualized this way, with orange edges denoting that the team neither won the World Series nor went to the playoffs for that year, SO and ERA seem wholly insignificant. So are they?

One thing is certain: we didn’t notice any of these things until we visualized the data in a graph.

Field of Graphs

2019 marks the 115th year for World Series baseball in the United States. In a year when home run records were smashed with authority, we thought we’d take a look at America’s pastime through the lens of graph visualization. 

Simple trade history of Washington Nationals star pitcher “Mad Max” Scherzer

Over the next several weeks, we’ll ask two fundamental questions about Major League Baseball (MLB):

  1. Who’s on the playoff team rosters?
    Do most players start on other teams? Which players have been traded and from which teams?

  2. What’s with all the homers?
    6,770 home runs were hit during the regular season this year–a new record! Sure, they’re fun to watch, but do they lead teams to the playoffs or the World Series?

Let’s pull some data, make some graphs, and see what we can figure out. Today we’ll:

  • walk through the over-arching process, using MLB datasets as our subject
  • drop some knowledge about how the graphs were created
  • show how we used the customization tools resident in Tom Sawyer Perspectives to easily tailor the graphs to show important statistics  we can use in our analysis

Come on…let’s play ball!

The Preseason

Before pitching the first graph visualizations, customizations, and algorithms to analyze players, teams, and homers, let’s warm up the data and create a schema.

Schema defines different entity and relationship types and their attributes

schema is the blueprint that defines the structure of data. It consists of element types, element attributes, and model attributes. To get the statistics we needed, we gathered multiple sources of MLB data. One set of comma-separated values (CSV) files came from Lahman’s Baseball Database. It included pitching, hitting, and fielding statistics for Major League Baseball from 1871 through 2014. For the more recent statistics, we used the Retrosheet Encyclopedia dataset. Also imported as a CSV file, this included game stats from 1990 through 2018.

Here’s a glimpse of the raw CSV dataset:

Let’s see what insights we can gather about player trades (known as transaction history to baseball insiders) from visualizing this data. We generated the graph below from our dataset. Tom Sawyer Perspectives can also load the data into a Neo4j database, for example, by reading the data from a CSV, or other formats, and doing a data transformation to a graph structure.

Initial graph drawing of player transaction history from 1871 through 2018 showing over 4,000 edges

Oof! That’s obviously too much data to throw at fans or managers. This graph shows hundreds of team and player nodes and thousands of relationship edges that reflect countless trades. To gain focus, we’ll filter to a subset of data showing only transaction histories for players on the final four teams in the 2019 playoffs. First, let’s narrow the field to only one year and one team: the 2019 Washington Nationals.

Transaction history of players on 2019 Washington Nationals roster , no formatting

Phew…much better! We can now clearly see the transaction history of all of the players on the team. While a number of players seem to have started with the Nationals (likely from their minor league team), many arrived on the team through a series of trades from other teams.

The Uniform

What about the look and feel of the graph drawing? Luckily, Tom Sawyer Perspectives makes it easy for us to dress up the drawing elements on the graph, so we did. Here, we added team logos for all team nodes and changed the formatting of the player nodes.

Transaction history of players on 2019 Washington Nationals roster, with formatting

Small changes—big impact. The Perspectives Designer allows us to easily control hundreds of seemingly tiny attributes in a drawing. From the drawing template, we used the Preserve Aspect option for the Resizability setting to maintain a consistent node size throughout the drawing.

The Shape Clipping setting ensures that edges are clipped tightly and precisely to the edge of nodes, no matter what their shape. And Background Color, Border Color, and Text Color all helped improve the readability of the graph.

Precise shape clipping ensures edges are clipped neatly—even around uniquely shaped nodes like the Rays and Dodgers logos

Since Tom Sawyer Perspectives provides multiple layouts out of the box, we can easily view the graph multiple ways to figure out which layout is the most useful. Although the circular, spring embedded layout seems to be the most popular, we found orthogonal to be the easiest to read for this graph.

Here’s circular:

And here’s symmetric:

For transaction history graphs, the orthogonal works best. Orthogonal uses predominantly horizontal and vertical edges, allowing us to make a very compact drawing. With cleaner lines, we can easily see connections and relationships between players and teams—both present and past. In baseball, as with other pro sports, the comings and goings of players cannot only cause fan emotions to run high, but can sometimes make or break a team. The orthogonal layout allows us to clearly see the family history of these teams.

Now that we’ve landed on a graph layout style and format for the elements in the graph, we’ll recreate it for the other three teams in the finals. Check them out! Are you surprised to see where some of your favorite players got their start?

Transaction history of players on 2019 Houston Astros roster
2019 Transaction History of the New York Yankees
Transaction history of players on 2019 St. Louis Cardinals roster

The Final Score

Did we make unseen connections in our data?

Looking at the Washington Nationals team graph, it’s interesting to see so many players coming from just a few teams. The Arizona Diamondbacks look almost like a farm team, showing five of their former players now with the Nationals—including the infamous “Mad Max” Scherzer!

The visualization also makes it easier to see how many players have never played elsewhere. For the Yankees, Astros, and Nationals, less than 25% of their 40-man roster are players without experience in other clubs. The 2019 Cardinals, on the other hand, show more than 42% with no outside MLB experience. Could this be a clue for managers? Do the graphs show that the Cardinals did a poor job adding new talent throughout the year? We aren’t sure what the ultimate conclusion is—but we are sure of this—we didn’t make the connection until we saw the data in a graph.

Next Week: Knocking ‘Em Out of the Park

To say that home runs are off the charts this season would be the cliché of all clichés! Players broke records. Teams broke records. Why? Were walls moved? Are the balls the cause? The Internet is rife with theories.

But we wonder, does it matter? Do all of these home runs lead teams to the playoffs or even to World Series wins? The Oakland A’s Billy Beane (famously played by Brad Pitt in Moneyball) asked these questions years ago—but he had nothing but boring spreadsheets to analyze. We’re going to visualize the data! Here’s a sneak peek of the top five home run teams for the past ten years:

Blue edge shows playoff teams, green edge shows World Series champs, black edge shows top five home run team who did not make the playoffs

AWS Amazon Neptune Team and Tom Sawyer Software Integrate New Graph Technologies

At AWS re:Invent, we revealed our partnership with Amazon Web Services (AWS) that supports the integration of Tom Sawyer Graph Database Browser and Tom Sawyer Perspectives with Amazon Neptune. Customers can take advantage of the integration and build applications with Perspectives for Amazon’s newest cloud-based graph database service, Neptune.

Amazon Neptune is a fully managed graph database service that makes it easy to build and run applications that work with highly connected datasets. Our flagship product, Tom Sawyer Perspectives, is designed specifically for building and deploying web applications that visualize and analyze highly related datasets. The two products are tightly integrated.

Tom Sawyer Perspectives now supports Amazon Neptune's fully-managed graph database service.

Tom Sawyer Graph Database Browser makes it easy for anyone to load data into Amazon Neptune and start analyzing the graphs. The application is the first and only to visualize and analyze data in Amazon Neptune as well as automatically execute native Neptune load commands to import data directly from Amazon Simple Storage Service (S3) into the database.

Together with AWS Amazon Neptune, our deep integration creates a super convenient and easy-to-use cloud-based, fast, and scalable graph database and visualization solution. 

Want to try visualizing AWS Amazon Neptune Data?

It’s easy to test drive the Tom Sawyer Graph Database Browser using a free trial on AWS Marketplace. It works with other graph databases too.

The Graph Database Browser includes a number of unique capabilities that go beyond a typical database browser to enhance the AWS Amazon Neptune user experience. Below are just a few.

Read More “AWS Amazon Neptune Team and Tom Sawyer Software Integrate New Graph Technologies”

Pioneer in Graph Analytics Technology Talks about Trends and Why Graphs Are Going Mainstream

Graph Analytics Technology pioneer Dr. Iannis Tollis

Graph Technology expert, Dr. Iannis Tollis, Ph.D, Professor, University of Crete
Dr. Ioannis Tollis, Ph.D.

I recently learned about Dr. Ioannis Tollis’ impressive career in graph analytics and computer science. He is a pioneer in graph analytics technology, and a Professor at the University of Crete.  Dr. Tollis, Ph.D., spoke with us about exciting developments and trends in the field.  Read on to find out more.

You’ve been a leader in the global graph analytics technology community since before it hit the mainstream. What are some of the trends you’ve noticed over time?

There are many trends in graph analytics technology that appeared in the past. Some have disappeared, while others are still with us today.

The most recent graph analytics technology trend deals with large or huge graphs. Read More “Pioneer in Graph Analytics Technology Talks about Trends and Why Graphs Are Going Mainstream”

Beta Release of Tom Sawyer Perspectives, Version 8.0, Java Edition is Now Available

We are excited to announce that you can now download the Beta release of Tom Sawyer Perspectives, Version 8.0, Java Edition.

This version represents a strategic upgrade to our core graph and data visualization architecture and adds new enterprise solutions for systems engineers and architects, and business process analysts and users. With an underlying Spring architecture, this version lays the foundation for further innovation and enterprise-wide deployment.

The Beta release is your chance for early access to powerful new capabilities and solutions, and to influence the future direction of Tom Sawyer Perspectives.

MBSE BPM Tom Sawyer Perspectives 8.0 Beta
Version 8.0 Delivers Web-based, Automated Business Process Modeling (BPM) and Model-based Engineering (MBE) Solutions Built on Tom Sawyer Perspectives

Our team members across three continents have invested more than 50,000 hours designing, developing, and testing to bring this new version to you.

Read More “Beta Release of Tom Sawyer Perspectives, Version 8.0, Java Edition is Now Available”

Automating Swimlane Diagrams

Swimlane diagrams give context to processes. If you haven’t used them, swimlane diagrams combine the logic of a flow chart with the accountability of an action item list. They show who or what is responsible for each step in a process. Because people and processes are continually changing, automating swimlane diagrams saves you hours of repetitive labor in rearranging and updating content.

Automation is easier said than done. Read More “Automating Swimlane Diagrams”