I tried to learn R before this module through Coursera, I wasn’t able to continue to the course after second week as I found it a bit hard. Although one of my favorite character Homer Simpson would say “You tried your best and failed miserably. The lesson is, never try“, with Data Management and Analytics module I have started using/learning R again.
I have started my re-learning progress with CodeSchool‘s Try R online course. It was a good reminder for different features of R and I’ve learnt creating different graphs, using factors etc. during that 8 chapters of R adventure.
After completing that eight chapter I was ready to get real life data and conquer the world with my beautiful data stories. Obviously, it didn’t happen, yet! I have joined a few DBS Analytics Society meetings on Saturdays and started to analyse different data sets with R. Although I could have done most of those analysis in Excel in a short time, this time I am willing to learn R so I am still wrestling with it.
While I was looking for interesting data sets to analyze, I have found that Reddit and Kaggle.com websites were really useful to find different data sets. Also fivethirtyeight.com provides a lot of different data sets in their GitHub account but they are very good to find out everything from a data set so there are not many things that you could add to the story they tell.
For my first attempt to analyze data with R, I have decided to go with Simpsons data from kaggle.com and I could easily say that reading this article by Todd Schneider motivated me too.
Although there are many different outcomes in that article, I have decided to try something different and wanted to check how many times Simpsons Family characters have been used in title of episodes. Then I will try to compare how many people watched those episodes and what is the IMDb rating of the episodes.
So here is the first part of my code where I have read the data set in to R and then find out the count number of each character:
#Read the data into R
simpsons <- read.csv("simpsons_episodes.csv", stringsAsFactors = FALSE)
#collapse the title by words
titlewords <- paste(simpsons$title, collapse = " " )
#remove the punctuation from words
titlewords <- gsub("[[:punct:]]", "", titlewords)
#count each character, use grepl to include words like Homers etc.
HomerCount <- sum(grepl('Homer', simpsons$title))
BartCount <- sum(grepl('Bart', simpsons$title))
MargeCount <- sum(grepl('Marge', simpsons$title))
LisaCount <- sum(grepl('Lisa', simpsons$title))
MaggieCount <- sum(grepl('Maggie', simpsons$title))
After finding out the episode numbers I have created a data frame to create a graph in ggplot2. I have gone with the basics and didn’t change the colors but it looks colorful enough I guess.
#creating a data frame for barplot
plotdata <- data.frame(
CharacterName = factor(c("Homer","Bart", "Marge", "Lisa", "Maggie"),
levels=c(c("Homer","Bart", "Marge", "Lisa", "Maggie"))),
CountOfEpisodes = c(55, 42, 20, 38, 2))
#creating the graph
ggplot(data=plotdata, aes(x=CharacterName, y=CountOfEpisodes, fill=CharacterName)) +
So as expected Homer is the most used character in episode title with 55 different episodes used his name in title from 600 episodes. Bart and Lisa are very similar, 42 and 38 are the number of episode titles accordingly.
**Maggie is used in only two episode title so I will not include her in views and IMDb rating statistics.
After cleaning and sorting my data I got the average of IMDb ratings and US views by million for each character.
Use a character name and get two million more viewers!!!
Although there doesn’t seem to be a big difference between each character’s view and IMDb numbers, having a character name in title has an affect in average viewing numbers. An episode of Simpsons is watched by 11.84 million people since the start of the show. But interestingly when I checked the average viewing when episode title uses a family character, I have realized that average viewing is increased by 3.68 millions. As you can see in below graph, average viewing of whole Simpsons episodes is higher than the average viewing of the episodes that has a title name without a family character.
Although making the assumption that use Character names in your title and get more viewings is easy, there are obvious other factors that affect that. The probable main reason for the difference between two values is producers! %65 of the episode titles that use family name were aired in early seasons where people were still watching TV. When you take the decline in TV viewings into consideration, it is expected that the viewings in later seasons average view is declined, too.
Who has the biggest IMDb rating?
Finally, I have created four more graphs where you can see each Family characters IMDb rating and viewing numbers by episode that their name have been used in title. It looks like Bart was really popular in early seasons 🙂
As my part of second Continuous assignment I have analyzed the Simpsons data with R and created some graphs. I have learnt a lot of new things while my attempt of analyzing the data and creating the graph. I will keep posting about my challenges with R.
Sources & References:
- Simpsons Data – https://www.kaggle.com/wcukierski/the-simpsons-by-the-data
- Memes – https://frinkiac.com
*All code and data files can be found in my Github account: