'Tis the season to be jolly, and what better way to spend the holidays than by watching Christmas movies.
Amidst all the lists and recommendations for best christmas movies, I decided to settle the score using data. Hopefully, this will make it easier for my friends to pick a movie for our next movie night!
Whether you're a die-hard festive film fanatic or just looking for the perfect movie to accompany your cozy holiday evenings, hopefully this analysis will help you plan your next watch.
1. Trends over Years
Let's understand the trends in movies released every year from 1898 to 2023
yearly_counts <- table(movies$release_year)
# Data frame for plotting
trends_data <- data.frame(
year = as.numeric(names(yearly_counts)),
count_movies = as.numeric(yearly_counts)
)
trends_palette <- c("#ff0000", "#00ff00", "#0000ff", "#ffcc00")
# Visualize
ggplot(trends_data, aes(x = year, y = count_movies)) +
geom_line() +
geom_point(color = rep(trends_palette, length.out = nrow(trends_data)), size = 2, shape = 16) +
labs(
title = "Number of Christmas Movies Released Each Year (2016-2022)",
x = NULL,
y = NULL
) +
theme_minimal() +
theme(
text = element_text(family = "Junicode"),
plot.title = element_text(hjust = 0.5)
) +
scale_x_continuous(breaks = seq(min(trends_data$year), max(trends_data$year), by = 5)) +
scale_y_continuous(breaks = seq(min(trends_data$count_movies), max(trends_data$count_movies), by = 9))
The table function is used to count the number of occurrences of each unique release year.
trends_data: new data frame with columns for the release year (year) and the count of movies released in that year (count_movies).
trends_palette: color palette to have multi-colored data points and give christmas lights effect.
scale_x_continuous and scale_y_continuous: control the ticks on the x and y axes.
Output:
Christmas movies have been on a rise since 2000's, more so in the last decade. It's peak being 88 movies in 2020.
2. Genre distribution
Next, let's see the genres and themes that are most prevalent in Christmas movies.
# Split genres into individual rows
genres_data <- movies %>%
separate_rows(genre, sep = ", ")
# Count the occurrences of each genre and store in dataframe
genre_counts <- genres_data %>%
group_by(genre) %>%
summarize(count_movies = n()) %>%
arrange(desc(count_movies))
# Custom color palette
genre_palette <- c("#741102", "#042D29")
# Visualize
ggplot(genre_counts, aes(x = reorder(genre, -count_movies), y = count_movies)) +
geom_bar(stat = "identity", fill = rep(genre_palette, length.out = nrow(genre_counts))) +
labs(
title = "Most Common Genres for Christmas Movies",
x = NULL,
y = NULL
) +
theme_minimal() +
theme(
text = element_text(family = "Junicode"),
plot.title = element_text(hjust = 0.5),
axis.text.x = element_text(angle = 45, hjust = 1)
)
separate_rows: split rows with multiple genres into individual rows.
group_by and summarize: count the occurrences of each genre and create a summary table.
The table is arranged in descending order based on the count of movies in each genre.
genre_palette: red-green christmas theme color palette for bar chart.
reorder: order the genres based on the count of movies in descending order.
Output:
Comedy, Romance and Drama dominate the Christmas season followed by Family movies. This is quite expected as most people tend to gravitate towards ligh-hearted comedy or rom-coms this time of the year.
3. Top & Lowest Rated/Voted Movies
# Convert votes to numeric
movies$votes_numeric <- as.numeric(gsub(",", "", movies$votes))
# Top movies based on ratings
top_movies <- movies %>%
arrange(-imdb_rating, -votes_numeric) %>%
head(30)
# Filter movies released in the year 2000 and later
filtered_movies <- movies %>%
filter(release_year >= 2000)
# Top movies based on ratings (2000 and later)
top_movies <- filtered_movies %>%
arrange(-imdb_rating, -votes_numeric) %>%
head(50)
# Visualize
ggplot(top_movies, aes(x = reorder(title, imdb_rating), y = imdb_rating)) +
geom_bar(stat = "identity", fill = '#042D29', width = 0.7) +
geom_text(aes(label = imdb_rating), hjust = -0.2, size = 3) + # Add text labels
coord_flip() +
labs(
title = "Top-Rated Christmas Movies",
x = NULL,
y = NULL
) +
theme_minimal() +
theme(
text = element_text(family = "Junicode"),
legend.position = "none",
axis.text.x = element_blank()
)
arrange and head: gets the top 15 movies based on IMDb ratings and votes. If rating is same then votes are considered.
Output:
The first two are episodes from TV shows. We can see Christmas classics here like 'Klaus', 'How the Grinch Stole Christmas', 'It's a Wonderful Life', etc.
Also, do we all finally agree that 'Die Hard' is a Christmas movie ?
I have never heard of the first three and was even more surprised when I realized 'Jingle Vingle' is an Indian movie. I was expecting to see few other movies as the popular one's frequent the recommendation lists. So, I decided to find the most voted (popular) movies.
That looks more like everyone's favorite lists! Funny that the popular and beloved movies aren't rated higher.
Well, now I'm curious to watch 'We Wish You a Turtle Christmas' because of the name :P
4. Underlying themes
Lastly, let's take a look at the movie description and find the most recurring words.
This will help us uncover the common themes, prevalent keywords and core motifs that make Christmas movies a timeless tradition.
corpus <- Corpus(VectorSource(movies$description))
# Preprocess the corpus
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
# Create a tdm
tdm <- TermDocumentMatrix(corpus)
# Convert the tdm to a matrix
m <- as.matrix(tdm)
# Get word frequencies
word_freqs <- sort(rowSums(m), decreasing=TRUE)
# Get top 100 word frequencies
top_word_freqs <- head(word_freqs, 100)
# Custom color palette
my_color_palette <- brewer.pal(8, "Dark2")
# Create a word cloud with top 100 words
wordcloud(
words = names(top_word_freqs),
freq = top_word_freqs,
scale = c(3, 0.5),
min.freq = 1,
colors = my_color_palette
)
wordcloud(words = names(word_freqs), freq = word_freqs, scale = c(3, 0.5), min.freq = 1, colors = brewer.pal(8, "Dark2"))
Create a text corpus from the 'description' column of the movies data frame.
Preprocess to convert text to lowercase, remove punctuation, numbers, and common English stopwords.
Create a tdm matrix from the preprocessed corpus and convert it to a standard matrix, then word frequencies are calculated.
We define a custom color palette using the brewer.pal function.
The wordcloud function is used to create a word cloud with the top 100 words, scaling, and custom color palette.
Output:
Christmas, holiday, family, love, santa, home, town- well that pretty much sums it up!
The one disappointing thing I realized while analyzing is 'The Holiday' is not a part of this dataset. How is a Christmas movies dataset complete without it? That movie has the best plot ;)
Next on my watchlist is 'We Wish You a Turtle Christmas' 🐢🎄 What's on yours ?
Day5 of my #100daysofdata challenge.
Let me know if you have any suggestions or ideas for me.
Complete code available on Github.
Happy analyzing!
Which Die Hard is that? That series is definitely good to watch on Christmas (also any other day)