top of page

Scrape NYT Bestsellers via API in R

Lekha Mirjankar

As the usual year end book lists ('Best books of 2023') started popping up, I had a thought of analyzing the books that made the New York Times Best Sellers list. NYT publishes this list weekly so to begin with the analysis we first need to gather all the weekly data together.


Thankfully, they provide an API which will make it a lot easier for us to do so.


1. Obtaining API Key from NYT

To access the NYT Best Sellers API, you need an API key from the NYT Developer Portal. Create an account and follow the instructrions listed here to obtain the key.


2. Setting Up the R Environment

Ensure that the necessary R packages are installed. Execute the following command if not already installed:

install.packages(c("httr", "jsonlite", "magrittr", "openxlsx"))
library(httr)
library(jsonlite)
library(magrittr)
library(openxlsx)

3. Make API request

get_nyt_data <- function(api_key, date) {
  url <- paste0("https://api.nytimes.com/svc/books/v3/lists/", date, "/combined-print-and-e-book-fiction.json?api-key=", api_key)
  res <- GET(url)

  Sys.sleep(12)
  • First, we construct the URL for the API request using the provided date and api_key. It follows the format required by the NYT API.

  • GET: This function is from the httr package and is used to make an HTTP GET request to the specified URL.

  • Sys.sleep: Pauses the execution of the script for the specified number of seconds to avoid overloading the NYT server with too many rapid requests and adhering to rate limits.

You can get all the NYT list names to see which one you need and mention in above code in place of 'combined-print-and-e-book-fiction'.

res = GET("https://api.nytimes.com/svc/books/v3/lists/names.json?api-key=<add your api key here>")

all_lists <- fromJSON(rawToChar(res$content))
all_lists

4. Process Response and Extract Data

  content <- fromJSON(rawToChar(res$content), flatten = TRUE)
  
  list_name <- content$results$list_name
  bestsellers_date <- content$results$bestsellers_date
  published_date <- content$results$published_date
  display_name <- content$results$display_name
  normal_list_ends_at <- content$results$normal_list_ends_at
  books <- content$results$books
  rank <- books$rank
  title <- books$title
  author <- books$author
  • fromJSON: Converts the JSON content of the response to an R object.

  • rawToChar(res$content): Extracts the raw content of the response and converts it to a character string.

  • flatten = TRUE: Flattens the JSON structure, making it easier to work with.

  • Then we extract specific fields from the content.

5. Create DataFrame

  book_data <- data.frame(
    Rank = rank,
    Title = title,
    Author = author,
    BestsellerDate = bestsellers_date,
    PublishedDate = published_date
  )
  
  return(book_data)
}

Lastly we create a data frame (book_data) using the extracted fields. Each field becomes a column in the data frame.


6. Function Call

api_key <- "<add your api key here>"
start_date <- as.Date("2023-01-01")
end_date <- as.Date("2023-12-31")
date_range <- seq(start_date, end_date, by = "weeks")
all_data <- lapply(date_range, function(date) get_nyt_data(api_key, format(date, "%Y-%m-%d")))

combined_data <- do.call(rbind, all_data)

We can make a call to the get_nyt_data function by passing the required arguments.


  • date_range: vector containing a sequence of dates, generated using seq(start_date, end_date, by = "weeks"). Each date represents a week in the year 2023.

  • function(date) get_nyt_data(api_key, format(date, "%Y-%m-%d")): anonymous function that takes a single argument date and calls the get_nyt_data function with the api_key and the formatted date in the "YYYY-MM-DD" format.

  • lapply: used to apply a function to each element of a list/vector. It returns a list where each element is the result of applying the specified function to the corresponding element of the input list.

  • do.call(rbind, all_data): applies the rbind function to the list of data frames all_data

7. Export Data

excel_file_path <- "nyt_bestsellers.xlsx"

write.xlsx(combined_data, file = excel_file_path, sheetName = "Bestsellers 2023", rowNames = TRUE)

We can export the data frame to an Excel file using the write.xlsx function from the openxlsx package.


We have our data ready to be analyzed.

This is day4 of my #100daysofdata challenge.

Let me know if you have any suggestions or ideas for me.


Complete code available on Github.


Happy analyzing!

Recent Posts

See All

Extract Genre from Goodreads

After a brief hiatus because of my laptop issues; I'm back to the data challenge. Today, I continued to work on the NYT Bestsellers...

Comments


bottom of page