Dear 100 Hour Board,
I remember hearing on the DVD extras of Gilmore Girls (maybe season 1) that an average 1-hour tv show has X number of pages of dialogue, but an episode of Gilmore Girls had a much greater number of pages because everyone talks so much and so fast. But now that no one watches DVDs anymore, I have no idea what the actual number of pages is.
Recently I've been watching Better Call Saul, which is pretty much the exact opposite--large portions of any given episode have no dialogue at all.
Short of counting the words in all the episodes ever....Any idea what the average amount of dialogue is in a 1-hour show, and in Gilmore Girls and Better Call Saul? Bonus points if anyone has a way to figure out if the average has changed over time.
Why do you ask such interesting questions? You have brought Tally M. back from the dead retirement to answer this question because she just couldn't help herself. I hope you're happy.
You have no idea how excited I've been working on this project. This doesn't mean that it hasn't been frustrating and annoying and crazy, and it doesn't mean I haven't rewritten the programs at least a few times over. But it's been exciting! I have neglected to describe the multitudinous problems I've had in getting this program to work. Regardless, the following write-up should be sufficient. If you just want to know the answer to the question, jump down to the Conclusion.
Luckily, I already had an idea of where to get transcripts for TV episodes, which makes this experiment much easier. The website Forever Dreaming has a lot of television episode transcripts, as well as some movie transcripts.
First, I gathered all of the titles of TV series that Forever Dreaming had available. I figured there wasn't any point in doing analysis if I couldn't calculate the number of words per episode. Then using IMDb's unofficial API, I could quickly get the average episode runtime for each series. Luckily, there was already a Python module that utilized the IMDb API so I didn't have to completely set it up myself. Unfortunately, it necessitated me using Python 2 rather than Python 3, which was only slightly more troublesome (but meant I didn't have to use parentheses in my print statements, so you win some, you lose some). Just as unfortunate, the module (imdbpy for those interested) had very limited documentation.
So, using the API, I got the series that were longer than 40 minutes and less than 65 minutes. Considering commercials, that seemed the most logical for looking at "hour long" television shows. From there, I got the title, runtime, and year (as well as series number and episode number) for each episode for each series.
I then wrote a separate program to get the transcripts of the episodes. All transcripts are recorded slightly differently, so that causes some minor major problems. There are some pretty consistent issues I could deal with right off the bat. Any line that began with a music note, or any line that began with a square bracket was deleted, since they weren't words to be included in the overall average words per minute. Also, each series needed to have at least ten episodes with data in order for me to add it to the database. However, most of the transcripts aren't consistent in their formatting, which caused a bit of a headache. After some discussion with Katya (thanks, by the way!), I found a workaround that mostly consisted of only including series that were transcribed close enough to other transcriptions. What constitutes a word is super complex, and not even linguists agree what a word is, so I just divided the transcript by spaces. From here, it's just a matter of dividing the words in an episode by the number of minutes in the episode.
After throwing out shows whose transcripts didn't behave, I was left with 6,552 episodes from 195 series.
The average number of words per minute in an hour long episode is 100.39, with the median number of words per minute being 105.49.
The episode with the highest words per minute (in my dataset) had 276.17 wpm, and while it didn't belong to Gilmore Girls, Gilmore Girls had seven out of the top ten wpm episodes.
The episode with the lowest words per minute (in my dataset) had 22.57 wpm. The bottom ten episodes were from The Walking Dead, Vikings, or Fear the Walking Dead. To be honest, that didn't really surprise me given the subject matter of those series.
I wanted to include a graph with the wpm of all episodes, but it was a little bit crowded, given the fact that I was plotting 6.5k data points.
Looking at the average wpm for a series' individual seasons, the top ten series' seasons is primarily dominated by Gilmore Girls seasons.
On the other end, the bottom ten series' seasons are either from Vikings or The Walking Dead.
Now we can take a look at the overall series' average words per minute.
Not all of the series are included in the bottom legend, simply because there's not enough room, but sufficeth to say that that the series are all in alphabetical order, and more or less cluster around 100 wpm. See that outlier in the middle? the one with a much higher average wpm? Yep, that's Gilmore Girls.
Here's the top ten series:
And the bottom ten six series (don't ask me why I only added the bottom six to a chart):
My favorite graph is next. Taking a look at the average words per minute by year was considerably more interesting than I expected.
The average number of words per minute has actually significantly decreased since 2002, though there does seem to be a slight uptick in the last couple of years. I'll be interested to see if this upward trend continues.
Average amount of dialogue in a 1-hour show: 100 words per minute.
Average amount of dialogue in Gilmore Girls: 186 words per minute.
Average amount of dialogue in Better Call Saul: 88 words per minute.
Has the average changed over time: Yes, it's decreased since 2002, but may be trending upwards.
It looks like your intuitions were right! I'm very glad to present data that supports your hypotheses.
If you contact me (through Spectre) I'm willing to send you the link to my code (once I get it uploaded). It can be pretty easily tweaked to do whatever you want—I just used it to only get what I needed to answer this question. On a similar note, if any reader has any other questions like this, I'm always in search of interesting research questions, and I'd love to do the research to put on my blog. So, contact me. Please.
So there you go. I hope you are ok with this answer going over hours a little bit. If you aren't, tough luck because my wife just did some awesome stuff for you.