After a brief hiatus because of my laptop issues; I'm back to the data challenge. Today, I continued to work on the NYT Bestsellers dataset that I'd scraped earlier.
The NYT API had Genre information missing which is quite pivotal for our analysis. I tried out several other book APIs and approaches but none of them gave me the level of granularity that I was looking for (I want to genre to suggest more than just 'Fiction'; since we are already dealing with Fiction books). Goodreads has multiple genres listed for any particular book; but their API does not give that information. So, you guessed it right- I decided to scrape it!
1. Scraping Goodreads for Genres
We begin by creating a function which takes a book title as input and extracts genre information from Goodreads.
scrapeGenres <- function(title) {
title <- gsub(" ", "%20", title, fixed = TRUE)
# construct the url
search_query <- paste0(title)
search_url <- paste0("https://www.goodreads.com/search?q=", URLencode(search_query))
# perform google search to get the goodreads link
search_results <- read_html(search_url, encoding = "latin1")
# extract the link to the first result
first_result_link <- search_results %>%
html_nodes("a.bookTitle") %>%
html_attr("href") %>%
head(1)
# construct the full URL for the first result
full_url <- paste0("https://www.goodreads.com", first_result_link)
# scrape the genre information
genres <- full_url %>%
read_html() %>%
html_nodes(".BookPageMetadataSection__genres a") %>%
html_text()
return(genres)
}
The function starts by replacing spaces in the title with '%20' to create a suitable search query.
It then constructs the Goodreads search query and performs a Google search to get the Goodreads link. Usually the first link on the page is our desired result.
Using the extracted URL, the genre information is scraped.
2. Iterating through book titles
Next, let's iterate through the titles in the NYT dataframe and call the scrapeGenres function for each title.
# iterating through titles in the NYT dataframe
combined_data_with_genres <- combined_data
combined_data_with_genres$Genre1 <- NA
combined_data_with_genres$Genre2 <- NA
combined_data_with_genres$Genre3 <- NA
combined_data_with_genres$Genre4 <- NA
for (i in seq_len(nrow(combined_data_with_genres))) {
title <- combined_data_with_genres$Title[i]
genres <- scrapeGenres(title)
# assign top 4 genres
combined_data_with_genres$Genre1[i] <- genres[1]
combined_data_with_genres$Genre2[i] <- genres[2]
combined_data_with_genres$Genre3[i] <- genres[3]
combined_data_with_genres$Genre4[i] <- genres[4]
# delay to avoid overwhelming the site
Sys.sleep(10)
}
We assign the top 4 genres to corresponding columns in the dataframe.
To avoid overwhelming the site, a delay of 10 seconds is introduced between each request.
3. Final genre column
Although it's good to have more than one genre information for the books; we want to have a final genre column which suggests the main genre. Usually, the genres are quite suggestive unless there are some generic one's like Fiction, Audiobook.
# create a final genre column based on conditions
NYT_df_combined_genre <- combined_data_with_genres %>%
mutate(Genre = case_when(
!(Genre1 %in% c('Fiction', 'Audiobook', 'Contemporary')) ~ Genre1,
!(Genre2 %in% c('Fiction', 'Audiobook', 'Contemporary')) ~ Genre2,
!(Genre3 %in% c('Fiction', 'Audiobook', 'Contemporary')) ~ Genre3,
!(Genre4 %in% c('Fiction', 'Audiobook', 'Contemporary')) ~ Genre4,
TRUE ~ 'Unknown'
))
We use the 'mutate' function from the dplyr package to create a new column 'Genre' based on the scraped genre information.
We prioritize non-generic genres, falling back to a default of 'Unknown' if needed.
Day6 of my #100daysofdata challenge.
Let me know if you have any suggestions or ideas for me or any analysis that you'll like me to perform on this dataset. I'm quite excited to analyse this data.
Complete code available on Github.
Happy analyzing!
Comments