The Speech of Star Wars

a text analysis of the original trilogy

Introduction

One of the most iconic choices that the Star Wars franchise made was to include the peculiar speech patterns of Yoda, one of Luke Skywalker's mentors in the ways of the Force. In this project, I aim to explore the subtler speech patterns that distinguish other characters' lines, using text analysis methods available in the programming language R.

I was inspired by similar text analyses of popular media, including these analyses of T​he Office​, P​arks and Recreation,​ and ​mystery novels.​

Methodology

This project uses a dataset that includes all spoken lines in the original Star Wars trilogy, which I downloaded from Kaggle. Initial examination revealed that the data didn't include the full scripts, so factors like setting or directorial notes could not be included in this analysis.

Instead, I focused primarily on discovering any unique characteristics of major speaking characters. I investigated sentiment analysis, word frequency analysis, and topic modeling in order to illustrate differences between the characters' tendencies.

  1. Sentiment Analysis

    I used a two-step process to analyze sentiment in the dialogue. First, I used the NRC Word-Emotion Association Lexicon to sort words into "positive" and "negative" categories. Then I used the AFINN Sentiment Lexicon to give each word a weighted “score”: higher points for more positive words, and lower points for more negative words.

    I used this method to analyze the entire Star Wars scripts, and then to analyze dialogue from specific characters. For the individual characters, I normalized the scores based on the total number of words spoken, creating an average sentiment across the characters' dialogue.

  2. Word Frequency Analysis

    In order to get meaningful results from this analysis, I eliminated common words—like “a,” “the,” “we,” and “it,” among many others—from the dataset. Then I used R to find the most commonly-used words by each of the major characters I identified.

  3. Topic Modeling

    Using the same cleaned text, I used R to find "topics" within the remaining dialogue. This method allowed me to see which words were most likely to appear together in the same phrase or sentence, which could then be interpreted into broad "topics."

  4. Visualizations

    While the main focus of this project was the text analysis itself, it did help to display some of the results visually. I used bar charts to display the results of the sentiment analysis and topic modeling; I used word clouds to visualize the word frequency analysis.

The code used to conduct this analysis was written in R; it can be downloaded here.

Findings

Highlighted findings include:

  1. Overview

    In total, the three Star Wars movies contain 2,523 lines of dialogue, with a total of 25,938 words. Out of those, the NRC analysis categorized a total of 319 words as positive and 173 as negative. Including repeats of the same word, positive words appeared 1,401 times, and negative words appeared 1,121 times.

    The AFINN analysis assigned a score between -5 and +5 to words in order to convey their positive or negative meanings. This method gave Star Wars an overall sentiment score of 112, for an average of 0.004 per spoken word.

    Finally, topic modeling reveals a few different themes across Star Wars; unsurprisingly, as the main character, Luke Skywalker appears in two of the four word groupings. Additionally, one of the groups seems devoted to the dark side of the Force, with words like "Vader" and "star" (presumably for the Death Star).

  2. Character Dialogue

    There were a total of seven characters who had over 100 lines of dialogue across all three movies: Luke Skywalker, Han Solo, Leia Organa, Darth Vader, Obi-Wan Kenobi, C-3PO, and Lando Calrissian. Interestingly, Leia is the only female character to make it into this analysis.

    In analyzing the sentiment of each character's dialogue, Han Solo has the most positive sentiment score (an average of 0.032 per word), while Darth Vader has the most negative score (-0.009 per word). Han isn't a particularly optimistic character, but he uses sarcasm liberally, which may account for his high score.

    Word frequency analysis yields some interesting results, including the characters most likely to speak to or about each other. For example, "father" and "Artoo" appear fairly large in Luke's word cloud, while "Chewie" is the most common for Han.

    Looking at the word groupings created by topic modeling analysis gives us insight into what our characters are most likely to discuss. Looking at which words are grouped together allows us to see that Luke and Han both dedicate much of their dialogue to friends or family, while Leia's most repeated words are from her recorded message—"Help me, Obi-Wan Kenobi. You're my only hope."

Conclusions

Discovering the deeply-embedded trends within the language of the original ​Star Wars movies can help us discover just what has drawn generations of fans to the space epic. In addition to being able to replicate that success in future entertainment, we can find deeper truths about what inspires, impassions, and moves us as human beings.

Future analysis might explore what the rest of the ​Star Wars ​franchise has to offer: analyzing the speech patterns from novelizations, extended-universe comics, movies, TV shows, as well as the prequel or sequel movies.

The final deliverable for this project, an analytical report, can be downloaded here.

References

Allen, J. "Text Mining: Every Line from The Office." Retrieved from https://www.jennadallen.com/post/text-analytics-every-line-from-the-office/.

Baert, S. "Happy Galentine's Day!"" Retrieved from ​https://suzan.rbind.io/2018/02/happy-galentines-day/​.

Corpus. "AFINN Sentiment Lexicon." Retrieved from http://corpustext.com/reference/sentiment_afinn.html.

"Cultural impact of Star Wars." Retrieved from ​https://en.wikipedia.org/wiki/Cultural_impact_of_Star_Wars.

Government of Canada. "Sentiment and Emotion Lexicons." Retrieved from https://nrc.canada.ca/en/research-development/products-services/technical-advisory-services/sentiment-emotion-lexicons.

R Project. "What is R?" Retrieved from https://www.r-project.org/about.html.

Walsh, B. "Sentiment Analysis of Every Mystery Novel on Gutenberg.org." Retrieved from ​https://blogs.elon.edu/com329/2019/03/01/sentiment-analysis-of-every-mystery-novel-on-gutenberg-org/.​

Xavier, V. "Star Wars Movie Scripts." Retrieved from https://www.kaggle.com/xvivancos/star-wars-movie-scripts.​