YOU ARE AT:Data AnalyticsUsing big data to accurately predict death in the Game of Thrones...

Using big data to accurately predict death in the Game of Thrones series

Spoiler warning: The following article contains spoilers from the “A Song of Ice and Fire” book series as well as the “Game of Thrones” TV series

Most predictive analysis doesn’t require a spoiler warning. That is, until a data scientist uses it to accurately predict the death of characters in one of today’s most talked-about pop culture productions.

Clement Fredembach, data scientist and “dreamer” at Teradata, hosted a panel at Teradata PARTNERS 2016 that complemented a blog post he had written in which he spoke of a way to use data to accurately predict a character’s death in one of today’s most unpredictable works of literature. Inspired by the social networks created in Florence during the reign of the Medici, Fredembach took a similar approach in linking characters together to determine the likelihood of their forthcoming demise.

Teradata PARTNERS 2016
Teradata PARTNERS 2016

He read each book a minimum of four times to become familiar with the subject material.

“It is important to become an expert on the subject material. We care about just the text, so there is no real validation except expert knowledge from the reader/author,” Fredembach said. “Unfortunately, G.R.R. Martin was not available for comment, so you are getting me for subject expert.”

The numbers of Game of Thrones

  • Fredembach used information from the first five “A Song of Ice and Fire” books. (“Game of Thrones” is only book one, and the title of the television series.)
    • Both the books and TV series require about 50 hours to read/watch and contain about the same information. Text, however, is much easier to analyze and manage than video, according to Fredembach’s blog post. Moreover, the books correspond to G.R.R. Martin’s original story, while adaptations always rewrite the material to some extent.
  • It also corresponds to the first five TV seasons with some variations (the TV series is more advanced, but omits a number of secondary characters for clarity and budgetary reasons).
  • GRRM’s series contains a total of around 1.4 million words.
  • “Game of Thrones” has become famous because any character is “fair game” and can die at any moment, regardless of how prevalent he or she was up to that point.
  • GoT’s complexity is actually an advantage: instead of constant pronoun references, the books usually spells out full character names for clarity. Moreover, the point-of-view (PoV) character name, presented at the beginning of each chapter, is frequently mentioned in its respective pages.

Not just about knowing before everyone else

Fredembach attempts to answer the question:

“Are those deaths the result of whims of the author, or are they part of a longer constructed narrative?”

According to the data scientist, text IS structured. It is a written language, and therefore it must be. Because of this, there must be a way to extract, and perhaps even predict events.

How was the experiment set up?

The Teradata “dreamer” turned to network graphs, starting with raw text data. He:

  • Automatically created a social network of characters for any time period of the story.
    • Loaded e-book files into a Teradata Aster database and separated them into chapters, preserving the entire text content.
  • Analyzed the structure and story lines of individual books through network relations.
  • Measured the importance and position of characters throughout the story based on their network centrality and graph visualizations.
  • Accurately predicted major character deaths with a Belief Propagation algorithm (LBP) over the social network.

All findings only required the text data and a list of character names. In spite of the experiment’s simplicity, the created networks accurately predicted major character deaths.

Mapping social networks without Facebook or Twitter

Because GoT is written as a series of “Point of View” chapters, with each chapter told from a character’s perspective, Fredembach estimated whose story is effectively being told over the books by counting the number of words devoted to the perspective of each PoV character.

In each chapter, the number of occurrences of all character names were counted to create a link between the point-of-view character and the characters whose names occur in the chapter. For example, in an “Arya” PoV chapter, if “Arya” is mentioned 20 times, ‘Sansa” 12 times and ‘Jon’ 11 times, the results would say:

game-of-thrones-2

Using this information, the data scientist created social networks from text information by finding mentions of 102 characters throughout the books. This includes common nicknames from the book appendixes, for example: Kingslayer or Khaleesi.

source: Teradata
source: Teradata

While it may look like flight data, the graphic above is the social network graph of “A Game of Thrones,” the first book of the “A Song of Ice and Fire” series. Here are guidelines to help you understand what is going on:

  • The node color corresponds to the output of the modularity algorithm (graph-based clustering of social communities).
  • Node size represents the number of mentions of a character.
  • Edge thickness and color represent the relative strength of connections between characters (thicker and redder = stronger).

Book one shows large purple clusters centered on the events at King’s Landing with strong links between Eddard, King Robert and the Small Council. The large red cluster concerns the events in the North and the Riverlands, with a strongly connected Stark family and Tyrion’s capture later in the book. Jon’s move to the Wall gives him his own cyan cluster, while the events across the sea are correctly represented with the small but strongly connected green cluster of the Targaryens/Dothrakis.

 

source: Teradata
source: Teradata

The new image above shows the graph of “A Clash of Kings,” the second book of “A Song of Ice and Fire” series.

The second book’s main story relates events in King’s Landing (cyan) with Tyrion, Sansa and Joffrey as well as the “war of the kings” involving the Starks and Baratheons (yellow). A number of smaller, individual clusters appear around Bran in Winterfell (red) and Arya’s (green) escape detached from the “main story”. The Wall (purple) and Slaver’s Bay (blue) still have their individual, independent clusters.

source: Teradata
source: Teradata

This third image shows the social network of “A Storm of Swords,” the third book of the series. The book continues the evolution towards smaller, more separated communities, with Aria and Bran having their own story lines that intersect little with the smaller main story. A notable change is the growth of the Wall cluster with the arrival of Stannis in the North (pink).

source: Teradata
source: Teradata

Now the fourth book of the series. The network takes distinct shape, with isolated communities and the appearance of new characters. The book is about Cersei and Jaime, whereas Daenerys and the Wall are not part of book 4. 

source: teradata
source: teradata

“A Dance with Dragons” is the fifth and most recent novel. It continues to illustrate the change in network structure, with fewer links overall and a more central structure around its main characters. Tyrion has left Westeros and now appears in Slaver’s Bay. The proliferation of “side stories” and the introduction of main characters have stretched the network into a series of individuals.

Character importance, another essential variable

Mapping out all of the social networks for each book was the first major step in predicting who would die next in the “A Song of Ice and Fire” novels. The next was to determine how important each character is.

source: Teradata
source: Teradata

Here is a graph of the “betweenness” (how much information transits through a node) and “closeness” (how central a node is within their local network) values for characters across the books. As you can see, book one is about Eddard, book two about Tyrion and Joffrey, and book three is the best balanced. Book 4 is Lannister-centric, while the last is all about the North. The increase of betweenness in book 5 is a direct result of a more decentralized network as seen in the graphs. Characters that travel or act as gateway to their communities score high on betweenness, while characters central to their communities have a high closeness.

Ground rules:

Being able to calculate social networks at any point of the story allowed Teradata to monitor the state of the social network up to the point of major deaths and determine whether the character dying was a surprise, or whether it was determined by the network structure and previous deaths.

To predict character deaths, Fredembach employed a Belief Propagation algorithm (LBP), which propagates ground truth beliefs of a character status, for example, true information, with 0 = alive, 1 = dead, throughout the graph, according to the weight of an edge.

The prediction rules are:

  • Jon Arryn, Viserys and Robert Baratheon are “dead” (ie, we do not make prediction before Robert’s death, not enough information).
  • When characters die (major or not) we add them to the “dead” ground truth.
  • Daenerys and Jon Snow are alive (our only “alive” ground truth).

Predicting death (accurately)

*A reminder on spoilers

Death risk (top ranked characters) before Eddard Stark’s execution

source: Teradata
source: Teradata

Note: “probability of death” column should not be taken literally. According to Fredembach, there is not enough ground truth data (or data altogether) for the value to be precise, the order is what is meaningful.

At that point in the story, the chapter before Eddard’s death, Eddard is the most likely person to die. Note that Khal Drogo (who dies at about the same time) is second on the list.

Arguably, Eddard was central to all network measures, so this result could be due to “the most central character is marked as the most likely to die.” But according to Fredembach’s blog, the state of the social network at the time of Eddard’s death indicates that it is his close proximity to King Robert that raises his risk.

Death risk (top ranked characters) before Renly Baratheon’s Murder.

source: Teradata
source: Teradata

After Eddard and Khal Drogo, Renly is the next major character death. Moreover, he is not a particularly central character: low betweenness, and not a PoV chapter. Still, the algorithm accurately picks him as the most likely character to die.

Death risk (top ranked characters) before the “Red Wedding.”

source: Teradata
source: Teradata

The chapter (and TV episode) labeled as the “Red Wedding” sees the whole Stark host getting murdered, but most notably Robb and Catelyn Stark.

The predictions continue to be highly reliable: the “likeliest to die” character is determined to be Catelyn Stark, while 5 of the 10 people with the highest likelihoods are part of the “Red Wedding” itself. Catelyn scores significantly above her son Robb due to her closer involvement with Stannis and Renly. At that point in the books, the relationship between Robb and Catelyn is the strongest between any two characters, so when one dies, the other one will instantly top the risk list.

The accuracy of the Belief Propagation algorithm over the first 3 books is remarkable considering the limited information it uses, Fredembach said.

Predicting death (A bit less accurately)

Only Joffrey’s death is missed (he ranks 7th out of the 80 characters). In the books there is little time between the two weddings (10 chapters or so), and without new ground truth or relationships, the predictions are heavily skewed towards the survivors of the “Red Wedding”.

Teradata PARTNERS 2016
Teradata PARTNERS 2016

From the fourth book onward, however, the accuracy of LBP continues to go down for three reasons, according to Fredembach:

  1. The number of dead people keeps on growing while additional “survivors” are hard to pin down without using 20/20 hindsight, so most characters are predicted to die
  2. The story becomes less focused, with a “hub and spoke” model; propagating beliefs along this type of graphs can be inaccurate
  3. Books four and five occur “at the same time,” making propagation prediction more difficult.

Place your bets: What happens in book 6? (and the problem with Jon Snow)

By the end of book five, the “timeline” is back to normal, allowing Fredembach to predict who dies in book 6 based on the state of the social network over the entire story.

However, an important decision must be made: is Jon Snow dead or alive? Jon Snow was labeled with “alive” status for the analysis. LBP therefore predicted that he is alive. If that truth status is removed, most characters will be predicted to die (since only Daenerys would remain as “alive”).

So the algorithm was run under two distinct hypotheses:

  • Label Jon Snow ‘unknown’ and Tyrion as ‘alive’ to predict who’s next to die under the assumption that Jon Snow is “fair game”
  • Keep Jon ‘alive’ (the standing hypothesis).
source: Teradata
source: Teradata

The numbers are close to each other, which is expected considering the low number of ‘alive’ characters. Relationships are built over five books though, so small differences are more meaningful than in earlier measures.

***Fredembach says that that while the book does not equal the TV series, it will be likely that from book five on, the TV series will actually have an impact on the novels, making predictions even less accurate.

ABOUT AUTHOR