Word2Vec and Game of Thrones - Part 1

10 minute read

Following this post by Phelipe Teles, and because it takes so long until the episode 3, I decided to play a bit with Game of Thrones subtitles. And since I wanted to use Word2Vec for a while, it was an obvious choice.

In this first part, we will process the dataset, remove artifacts and prepare it for Word2Vec.

The dataset

The subtitles, in json format, can be obtained here. The pandas package, handily, has .read_json, so no extra tools were needed. Due to the format of the original file, .read_json insisted in reading each season as a separate column, so a small transformation with .melt was necessary to store the subs as one column, and season and episode as another.

Then, I read all 7 seasons into a list, and concatenated it into a single dataframe.

import numpy as np
import pandas as pd

#Read the season into a dataframe and reformat it
def get_season(season):
    temp = pd.read_json('season' + str(season) + '.json')
    temp = pd.melt(temp, value_vars = list(temp), value_name = 'text').dropna()
    return temp
    
#Create a list and read the files into it	
li = []
for i in range(1, 8):
    li.append(get_season(i))
    
#Concatenate into single dataframe
got_subs = pd.concat(li, axis = 0, ignore_index = True)

The data looks like this:

id variable text
0 Game Of Thrones S01E01 Winter Is Coming.srt Easy, boy.
1 Game Of Thrones S01E01 Winter Is Coming.srt Our orders were to track the wildlings.
2 Game Of Thrones S01E01 Winter Is Coming.srt - Right. Give it here. - No!
3 Game Of Thrones S01E01 Winter Is Coming.srt Put away your blade.
4 Game Of Thrones S01E01 Winter Is Coming.srt - I take orders from your father, not you. - P…
5 Game Of Thrones S01E01 Winter Is Coming.srt I’m sorry, Bran.
6 Game Of Thrones S01E01 Winter Is Coming.srt Lord Stark?
7 Game Of Thrones S01E01 Winter Is Coming.srt There are five pups.
8 Game Of Thrones S01E01 Winter Is Coming.srt One for each of the Stark children.
9 Game Of Thrones S01E01 Winter Is Coming.srt The direwolf is the sigil of your house.

As I do not plan using season and episode in the analysis (we already have very few words), I will not extract season or episode numbers, and just drop it.

got_subs.drop('variable', axis = 1, inplace = True)

Cleaning up the data

As it always happens with raw data, there is always dirt here and there in our database. The main issue here is that those little things are not always obvious, and not necessarely appear in the first lines. You can find the most common ones by browsing your dataset at random - just fire .sample a couple of times. Also, later on, always pay close attention to your results, as the word font showing up in your Game of Thrones data may indicate that you forgot to clean up some html from your dataset.

Since Word2Vec works with context, we should ensure that we remove as many things that resemble meaningful words but are not as we can.

Our first fix are the non ASCII characters. A common one is a musical note to identify parts where characters sing. I’m removing those characters first as they sometimes appear in the middle of sentences, which may create some extra complications later on (and this also means I noticed those way after I started cleaning up the data and had to go back).

id text
30754 ♪ The Dornishman’s wife was as fair as the…
30755 ♪ And her kisses were warmer than spring ♪
30756 ♪ The Dornishman’s blade ♪
30757 ♪ It was made of black steel ♪

Another thing you can notice in the table above are the italics, which means we need to remove the <i> and </i>. There are bolds also, but they only appear when the episode title is displayed and are coupled with <font color = ...>:

id text
32948 <font color=#FFFF00> Theme music playing </f…
33059 <font color=#00FF00> Game of Thrones 5x10 </fo…
33170 == sync, corrected by <font color=#00FF00>elde…
33281 <font color=#FFFF00> Music playing ends </font>
33471 <font color=#FFFF00> Music playing ends </font>

Those are not actual dialogue, but episode titles and other similar info (for example, identifying the theme music beginning and end). Those lines are meaningless for our purpose, so we will remove them entirely, and keep everything else.

Next thing we notice is that the subtitles contain some text describing the noises we hear in the show:

id text
39216 (PIANO PLAYING)
39217 (CHANTING CONTINUES)
39231 ( gate rumbling )
39239 (GRUNTS, SCREAMS)
39242 - ( laughter ) - Not just the boys.

We cannot drop those lines entirely, because some of them contain speech too, for example the line 39242 in the table above. But since parentheses are not used for anything else in the subs, we can just remove whatever text appears between ().

I will also use this occasion to remove multiple “-“ characters, such as “–” and “- -“, which will simplify some of the next fixes. While fully aware that this is not optimal, this allows us to avoid some overly complicated regex expressions. Those are pretty hard to read for dummies like me, and in a couple of weeks I will have no clue about what is going on in there, so I prefer to throw a couple of extra .replace instead.

#removes non ASCII and italics
got_subs['text'] = got_subs['text'].str.replace(r'[^\x00-\x7F]+', '') \
                                   .str.replace('<i>', ' ') \
                                   .str.replace('</i>', ' ')
                                   
#keep the lines that do not contain 'font color'
got_subs.drop(got_subs[got_subs['text'].str.contains('font color')].index, \
				inplace = True)

#remove text between '()' and also do some additional cleanup 
got_subs['text'] = got_subs['text'].str.replace(r'\(.*\)', ' ') \
                           .str.replace(r'- +-', '-') \
                           .str.replace(r'-+', '-')

Another issue you have probably noticed is that some lines contain multiple sentences, or even multiple people speaking. Since Word2Vec operates with sentences, we should also break those up.

id text
2 - Right. Give it here. - No!
4 - I take orders from your father, not you. - P…
12 We tracked them. They won’t trouble us no more.
19 The runt of the litter. That one’s yours, Snow.
22 - It’s starting to show. - And you never worry…
25 100-foot drop into the water… you were never…
27 “We’re Lannisters. Lannisters don’t act like f…
28 - What if Jon Arryn told someone? - But who wo…
49 - Tell me. - There was a raven from King’s Lan…
53 - Your sister, the boy? - They both have their…

The idea is to first split the text whenever we have a sentence ending character. This will create a dataframe with multiple columns, so once more we will use .melt to convert it back to a single column dataframe, and then drop NA’s (because most rows have 1 or 2 sentences at most).

got_subs = got_subs['text'].str.split(r'(\.{1,3} )|(\?+ )|(!+ )|((?!^\/)- )', \
								expand = True)\
                   .melt(value_vars = list(range(0, 31)), value_name = 'text') \
                   .drop('variable', axis = 1) \
                   .dropna()

Another problem is that some of the lines idetify who is speaking:

id text
20815 CERSEl: Thank you for your help with the other…
22911 MISSANDEl: Very well, Your Grace.
23155 MISSANDEl: “Isles.”
266403 - Bronn: Get in line!
266414 - Man 2: Hold together!

This is another thing we should remove. More ugly regex and lack of, some things listed as they are because it would be nearly same effort, don’t copy me =)

got_subs['text'] = got_subs['text'].str.replace(r'^(()|(- ))([\w\-]+: )', ' ') \
                           .str.replace(r'^(()|(- ))([\w\-]+ [\w\-]+: )', ' ') \
                           .str.replace('Bran\'s voice:', ' ') \
                           .str.replace('Rhaegar and Lyanna:', ' ') \
                           .str.replace('Gold cloak 2:', ' ')

Stopwords and stuff

Now that we removed most of the artifacts, and organized a data in a series of sentences, it is time to remove stopwords - a purpose specific set of common words - and prepare the data for Word2Vec. The first step is to convert everything to lowercase.

got_subs['text'] = got_subs['text'].str.lower()

To remove stopwords, we will use the nltk package, and then add some custom words based on the text specifics. To start with, I found that nltk english stopwords include words such as “I” and “you”, but do not include “us”. Morals: always check the stopwords list before applying it, because it can be weird.

Then, since we are dealing with transcribed dialog, it has as many “hmm”s and “uh”s as any average student presentation. Lets get rid of those too.

from nltk.corpus import stopwords

#get the english ones
stop = stopwords.words('english')

#replace them in text
got_subs['text'] = got_subs['text'].str.replace(r'\b(?:{})\b' \
							.format('|'.join(stop)), '')

#some additional stopwords
stop = ['us', 'mine', 'whose', 'welcome', 'aye', 'oh', 'ah', 'uh', 'hmm', 
		'shh', 'em' 'yah', 'hyah', 'hey', 'mm']
got_subs['text'] = got_subs['text'].str.replace(r'\b(?:{})\b' \
							.format('|'.join(stop)), '')

I spent quite some time pondering if words such as “king”, “lord”, “lady”, “ser” and so on, extremely frquent in those subtitles, should be also removed as stopwords. But then, on the other hand, those are somewhat meaningful, the sentences are short and the dataset is small. My final decision is to keep them.

#thematic stopwords
stop = ['father', 'mother', 'queen', 'king', 'prince', 'princess',
                   'lord', 'lady', 'grace', 'ser', 'knight']
got_subs['text'] = got_subs['text'].str.replace(r'\b(?:{})(s\b|\b)' \
							.format('|'.join(stop)), '')

Now that we are done with stopwords, a small cleanup, removing everything except letters, and also dropping lines that contain no letters at all. I did not do this before because some of the stopwords, such as “I’m”, contain such characters, and it is easier to just do it in order.

#remove everything except letters
got_subs['text'] = got_subs['text'].str.replace('[^a-zA-Z]', ' ') 

#finally we keep only the lines with some text
got_subs = got_subs[got_subs['text'].str.contains(r'[a-zA-Z]')]

Tokenization and Lemmatization

English is a so called Inflected Language, which means there are words that are derived from other words, for example “know”, “knew”, “knowledge”, “knowingly”. Since we work with words, it makes sense to try and group as many of them as possible. There are two possible approaches to this, stemming and lemmatization. Stemming is a process where we remove the ends of words, while lemmatization uses a vocabulary to try finding a base form of a word (lemma). For example, if we stem the word “ponies” we will get “poni” as result, while lemmatization will return “pony”.

Both approaches have their strengths and weaknesses. My choice here is to use lemmatization, because:

  • it outputs actual words and is less likely to affect names of characters or places (we are dealing with a fantasy world after all)
  • since our datset is small, the performance loss is irrelevant
  • i rarely get to use it =)

Since our current dataset is made of sentences, we start with tokenization. The tokenizer from the nltk package separates sentences into words. There are several tokenizers available, I chose the one that didn’t tokenize “gonna” as “gon” + “na”.

from nltk.tokenize import WhitespaceTokenizer
sentences = [WhitespaceTokenizer().tokenize(sent) for sent in got_subs['text']]

The tricky aspect of lemmatization is part-of-speech (POS) identification: we need to inform if the word is a noun, a verb, an adjective and so on. POS identification is still relatively difficult for a computer, but somehow works. To do it, once again, we use nltk tools.

from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
from nltk import pos_tag

#lemmatizer
wn_lemmatizer = WordNetLemmatizer()

#get part-of-speech (works better when you feed it a sentence)
def get_wn_pos(sent):
    tags = pos_tag(sent)
    tag_dict = {'J': wordnet.ADJ,
                'N': wordnet.NOUN,
                'V': wordnet.VERB,
                'R': wordnet.ADV}
    return [tag_dict.get(t[1][0], wordnet.NOUN) for t in tags]

#lemmatize our data
s_lemmatized = []
for s in sentences:
    newsentence = []
    pos = get_wn_pos(s)
    for i in range(len(s)):
        newsentence.append(wn_lemmatizer.lemmatize(s[i], pos = pos[i]))
    s_lemmatized.append(newsentence)

This is a big deal: we went from 8912 to 6856 different words.

As I mentioned, lemmatization will likely not affect GoT-specific names (because they are not real words in English, it will leave them as they are). For example, we have “Stark” and “Starks”, “Tyrell” and “Tyrells”… We will just make a small custom dictionary based on some of the common Westerosi words and family names, and replace the plurals.

#make a custom dictionary
additional_dict ={'starks': 'stark',
                 'lannisters': 'lannister',
                 'targaryens': 'targaryen',
                 'baratheons': 'baratheon',
                 'greyjoys': 'greyjoy',
                 'tyrells': 'tyrell',
                 'boltons': 'bolton',
                 'umbers': 'umber',
                 'karstarks': 'karstark',
                 'maesters': 'maester',
                 'wildlings': 'wildling'}
#remove plurals
s_lemm = []
for s in s_lemmatized:
    newsentence = []
    for w in s:
        newsentence.append(additional_dict.get(w, w))
    s_lemm.append(newsentence)

The data is now clean (somewhat) and ready to be used in a Word2Vec model, which will be trained in the next part.