Scrape-It-Yourself: Spotify Charts

Learn to scrape any website with rvest, purr and SelectorGadget.

March 20, 2018

DISCLAIMER: This isn’t just some over-the-top article you tweet to make yourself look smart. I’m going to go slow and equip you with the right tools so YOU understand the little differences when the data source changes.

Recently, I’ve been looking for neat projects that teach deeper technical understanding of a practical data science method, rather than showcasing code for a specific instance you cannot replicate.

There have been lots of articles using R’s rvest package to show how easy it can be to scrape things off websites, like reviews, sport stats, book titles, or whatever you please, so I was inspired by my recent visit to Toronto’s Spotify HQ—where I received a controversial red toque you can ask the staff about—to combine my love for music with a hands-on project I can teach.

Here’s what we are going to do:

Create a URL sequence function that will feed our scraper the right directions to find WHERE the data is.
Create a scraper function that will find WHAT attributes we want into a tibble.
Do some neat analysis with dplyr for visualization!
how do I stop making a list, it won’t let me st-

Also we need the following packages:

library(rvest)
library(tidyverse)
library(magrittr)
library(scales)
library(knitr)
library(lubridate)
library(tibble)

Creating the URL Sequence

We want to replicate a URL over the range of pages we want our scraper to pull from. I determined my range will be Daily, Top 200, and Canadian hits from the month of February.

We need to find the variable that defines the way pages are listed on the site and loop it into the constant. A little investigation is all that’s required:

Constant ----------------------------------------------Variable
https://spotifycharts.com/regional/ca/daily/2018-02-01
https://spotifycharts.com/regional/ca/daily/2018-02-02
https://spotifycharts.com/regional/ca/daily/2018-02-03

This is a simple, yet crucial step. I will use a bonus example from Arvid Kingl’s R Web Scraping Tutorial, where he chose to scrape the website TrustPilot for reviews of Amazon. Their URLs were in page=n format as below.

https://www.trustpilot.com/review/www.amazon.com?page=2
https://www.trustpilot.com/review/www.amazon.com?page=3
https://www.trustpilot.com/review/www.amazon.com?page=4

Each site is unique. Be aware of your doings, and you’ll be fine.

1. Fix the Constant

We want to assign our constant to be a character object so that we can use it in our function, combining it with variables later.

url <- "https://spotifycharts.com/regional/ca/daily/"

2. Define the Range for Our Variable

Let’s define the range: we want to create dates from 2018-02-01 to 2018-02-28. We can tackle this problem with the seq() function. R’s intuitive date object lets us define our sequence to be counted by “day.”

timevalues <- seq(as.Date("2018/02/01"), as.Date("2018/02/28"), by = "day")
timevalues[1:3]
 [1] "2018-02-01" "2018-02-02" "2018-02-03"

3. Uniting the Two

We will create a function, unitedata, to feed our variable into our constant URL.

unitedata<- function(x){
 full_url <- paste0(url, x)
 full_url
}

finalurl <- unitedata(timevalues)
[1] "https://spotifycharts.com/regional/ca/daily/2018-02-01" 
[2] "https://spotifycharts.com/regional/ca/daily/2018-02-02"
[3] "https://spotifycharts.com/regional/ca/daily/2018-02-03"

Ok, we have the site mapped and ready to be scraped. So next up is designing our scraper to hunt down the right data.

Shopping for Attributes with SelectorGadget

Take some time to digest what you see here; just know this gets easier the more you do it. Especially when I discovered SelectorGadget. Without this tool, I would need to right-click→Inspect the website and scour the HTML code for attributes. Which isn’t ideal… or fun… or particularly a strength of mine.

Thankfully, SelectorGadget makes this mindless. It lets you select the attributes you want on the page and returns html_node values, like “strong,” that are required for our scraper. See the slideshow below where I select all the variables I want for my pull: Rank, Track, Artist, Streams and Date.

This slideshow requires JavaScript.

Note: Be sure to deselect the Country and Daily/Weekly drop-downs to isolate for Date.

Get That Scraper Running

Great, now let’s gather all of these outputs (see the slideshow above) from SelectorGadget and slap them in the rvest scraper format:

SpotifyScrape <- function(x){
 page <- x
 rank <- page %>% read_html() %>% html_nodes('.chart-table-position') %>% html_text() %>% as.data.frame()
 track <- page %>% read_html() %>% html_nodes('strong') %>% html_text() %>% as.data.frame()
 artist <- page %>% read_html() %>% html_nodes('.chart-table-track span') %>% html_text() %>% as.data.frame()
 streams <- page %>% read_html() %>% html_nodes('td.chart-table-streams') %>% html_text() %>% as.data.frame()
 dates <- page %>% read_html() %>% html_nodes('.responsive-select~ .responsive-select+ .responsive-select .responsive-select-value') %>% html_text() %>% as.data.frame()

#combine, name, and make it a tibble
 chart <- cbind(rank, track, artist, streams, dates)
 names(chart) <- c("Rank", "Track", "Artist", "Streams", "Date")
 chart <- as.tibble(chart)
 return(chart)
}

Note: In future works, you’ll sub out the attribute types and names for each chart.

Now we have written WHERE our scraper will search with the finalurl output, and we have designed HOW it will look using our SpotifyScrape function. Let’s combine the two with purrr‘s map_df function and store the dataset:

spotify <- map_df(finalurl, SpotifyScrape)

TIME TO EXECUTE THE SCRAPER!!!! EVERYONE REMAIN CALM AND get a drink or something, you can finally chill for a minute. You’ve been learning lots of good data science methods. Maybe play The Weeknd, some Justin Bieber or a little Shania Twain? I think she just made another album and I found that kind of odd, but I guess she’s still doing halftime shows at the Grey Cup and other stuff up here in Canada…

Ok time’s up, let’s have a look:

Lovely! That chart looks good… to the average person maybe.

Cleaning the Pull

Immediately, we can see two glaring issues in our chart, and two more are lying within the structure of our data.

spotify %<>% 
mutate(Artist = gsub("by ", "", Artist), 
       Streams = gsub(",", "", Streams), 
       Streams = as.numeric(Streams), 
       Date = as.Date(spotify$Date, "%m/%d/%Y"))

Through the power of our writer/reader dynamic, it looks like we’ve had our first successful taste at scraping a website. We have plenty of data to work with, and our table is now cleaner than Drake’s fade.

Quick Analysis Fun

Speaking of Drake, it looks like he gets a lot of streams on that “God’s Plan“ track. I wonder if he had the most streams in February:

> spotify %>% 
     group_by(Artist) %>% 
     summarise(Total = sum(Streams)) %>% 
     arrange(desc(Total)) %>%
     top_n(25, Total) %>%
 ggplot() +
     geom_col(aes(x = reorder(Artist, Total), y = Total), fill = "forest green") +
     coord_flip() + 
     scale_y_continuous(labels = unit_format("B", 1e-9))

Yup. Drake is King.

Recap

Kingl’s article actually ran a hypothesis test on two companies based on their average reviews. Just shows, there are plenty of things to do with all types of data, and we now know how to load up on lots of it.

Here’s everything we learned. If you TL;DR’d to the bottom, maybe consider scrolling up and reading.

We figured out how to sequence a range of URL pages.
We learned to pick the nodes of data we want from the web using SelectorGadget.
We tailored a scraper to find our five HTML nodes containing data off Spotify Charts.
After a bit of Shania Twain trivia and some data cleaning, we could query and visualize our data successfully.
We became better data scientists!
how do i stop doing the bullet poin-

Thanks for reading!
Find the full GitHub script here.

11 comments on “Scrape-It-Yourself: Spotify Charts”

Pingback: Uncovering Hidden Trends in AirBnB Reviews – DataCritics
Bas

March 28, 2018

This is awesome! It is the first time I used R, really nice introduction to the language and to web scraping. Will use it for my Data Science course.

PS: I also needed to include library(tibble)

LikeLike

Reply
- Jake Daniels
  
  November 20, 2018
  
  Glad it’s helped! I added tibble to the list of packages in the introduction just to be sure.
  
  LikeLike
  
  Reply
Pingback: Build-a-ggplot: The Fall of The Simpsons – datacritics
Paul

August 22, 2018

Tremendous – thanks very much. Been studying pipes and ggplot and this had made so much sense – massive help.

LikeLike

Reply
- Jake Daniels
  
  August 22, 2018
  
  Glad to hear Paul! Writing it helped reinforce my learning of scraping too. Kudos.
  
  LikeLike
  
  Reply
Rowie

August 22, 2018

Hi, Jake! The post is very useful and very well explained. It has made really clear some concepts. By the way, I would like to ask you, if you have another post, or maybe, you can give me an advice to scrap a page like this one: “http://www.bna.com.ar/Personas”, where to access the information you want to scrap, you have to make some clicks: “Ver histórico” (search with ctrl + f) > “Ver Histórico Principal”. Then clicking the calendar to choose the differents day to get the exchange value of each day. Is it posible to do this rvest or another R library?

LikeLike

Reply
- Jake Daniels
  
  August 22, 2018
  
  Hi Rowie, seeing as it lies within a java app, it is not easy to navigate using any URL strings and my expertise falls off. Here is what I’m finding from right-click > Inspect in the calendar app.
  $.datepicker.regional[‘es’] = {
  closeText: ‘Cerrar’,
  prevText: ”,
  currentText: ‘Hoy’,
  monthNames: [‘Enero’, ‘Febrero’, ‘Marzo’, ‘Abril’, ‘Mayo’, ‘Junio’, ‘Julio’, ‘Agosto’, ‘Septiembre’, ‘Octubre’, ‘Noviembre’, ‘Diciembre’],
  etc…..
  
  I think you need to design something flexible that can crawl through these values to gather your desired nodes of info since that is how they are sorted. I hope this helps point your in the right direction to find more resources!
  
  LikeLike
  
  Reply
Jorge (@JorgeArgueta)

October 7, 2018

I’m learning how to scrape with Python in school, however I’ve been coding in R for the pats year and this article fits exactly what I need for my class project.

Thanks for taking the time to put this together!!

LikeLike

Reply
Jorge (@JorgeArgueta)

October 8, 2018

One question, when I used the SelectorGadget took, I was getting different html_nodes. For artist I was getting “span” rather than “.chart-table-track span”, also for stream I was getting “chart-table-streams” instead of “td.chart-table-streams”.

How did you figure out that you needed to alter those? When I tried to do this myself, I was getting errors in the scraping function, seems like my data frame was more than 200 rows…which was the expected row number.

Any tips on how you figure the correct html node?

Thanks.

LikeLike

Reply
- Jake Daniels
  
  October 9, 2018
  
  Hi Jorge!
  
  Yes, so if we check the slideshow you can see where I specifically clicked (and filtered) for the columns. The hint does lie with the number count in SelectorGadget; if it is equal to 200 results, you’ve made the correct adjustments to pull.
  
  We can also investigate the website further by [ Right-Click -> Inspect ] and isolate for the correct element names as well. There is trial and error, but using more tools helps us find it faster.
  
  Hope this helps!
  
  LikeLike
  
  Reply