Intermediate Learning R

Scrape-It-Yourself: Spotify Charts

Learn to scrape any website with rvest, purr and SelectorGadget.

DISCLAIMER: This isn’t just some over-the-top article you tweet to make yourself look smart. I’m going to go slow and equip you with the right tools so YOU understand the little differences when the data source changes.


Recently, I’ve been looking for neat projects that teach deeper technical understanding of a practical data science method, rather than showcasing code for a specific instance you cannot replicate.

There have been lots of articles using R’s rvest package to show how easy it can be to scrape things off websites, like reviews, sport stats, book titles, or whatever you please, so I was inspired by my recent visit to Toronto’s Spotify HQ—where I received a controversial red toque you can ask the staff about—to combine my love for music with a hands-on project I can teach.

Here’s what we are going to do:

  1. Create a URL sequence function that will feed our scraper the right directions to find WHERE the data is.
  2. Create a scraper function that will find WHAT attributes we want into a tibble.
  3. Do some neat analysis with dplyr for visualization!
  4. how do I stop making a list, it won’t let me st-

Also we need the following packages:

library(rvest)
library(tidyverse)
library(magrittr)
library(scales)
library(knitr)
library(lubridate)

 

Creating the URL Sequence


We want to replicate a URL over the range of pages we want our scraper to pull from. I determined my range will be Daily, Top 200, and Canadian hits from the month of February.

We need to find the variable that defines the way pages are listed on the site and loop it into the constant. A little investigation is all that’s required:

Constant ----------------------------------------------Variable
https://spotifycharts.com/regional/ca/daily/2018-02-01
https://spotifycharts.com/regional/ca/daily/2018-02-02
https://spotifycharts.com/regional/ca/daily/2018-02-03

Note: Yes, since I’m looking at February the day is our only variable, but just for good measure we will work with the full date. Checkmate, comment section.

This is a simple, yet crucial step. I will use a bonus example from Arvid Kingl’s R Web Scraping Tutorial, where he chose to scrape the website TrustPilot for reviews of Amazon. Their URLs were in page=n format as below.

https://www.trustpilot.com/review/www.amazon.com?page=2
https://www.trustpilot.com/review/www.amazon.com?page=3
https://www.trustpilot.com/review/www.amazon.com?page=4

Each site is unique. Be aware of your doings, and you’ll be fine.

1. Fix the Constant

We want to assign our constant to be a character object so that we can use it in our function, combining it with variables later.

> url <- "https://spotifycharts.com/regional/ca/daily/"

2. Define the Range for Our Variable

Let’s define the range: we want to create dates from 2018-02-01 to 2018-02-28. We can tackle this problem with the seq() function. R’s intuitive date object lets us define our sequence to be counted by “day.”

> timevalues <- seq(as.Date("2018/02/01"), as.Date("2018/02/28"), by = "day")
> timevalues[1:3]
 [1] "2018-02-01" "2018-02-02" "2018-02-03"

3. Uniting the Two

We will create a function, unitedata, to feed our variable into our constant URL.

> unitedata<- function(x){
 full_url <- paste0(url, x)
 full_url
}
> finalurl <- unitedata(timevalues)
[1] "https://spotifycharts.com/regional/ca/daily/2018-02-01" 
[2] "https://spotifycharts.com/regional/ca/daily/2018-02-02"
[3] "https://spotifycharts.com/regional/ca/daily/2018-02-03"

Note: paste0 function is used instead of paste as its default arguements, sep = “” and collapse, combine the data without the need to define arguments.

Ok, we have the site mapped and ready to be scraped. So next up is designing our scraper to hunt down the right data.

 

Shopping for Attributes with SelectorGadget


Take some time to digest what you see here; just know this gets easier the more you do it. Especially when I discovered SelectorGadget. Without this tool, I would need to right-click→Inspect the website and scour the HTML code for attributes. Which isn’t ideal… or fun… or particularly a strength of mine.

Thankfully, SelectorGadget makes this mindless. It lets you select the attributes you want on the page and returns html_node values, like “strong,” that are required for our scraper. See the slideshow below where I select all the variables I want for my pull: Rank, Track, Artist, Streams and Date.

 

This slideshow requires JavaScript.

Note: Be sure to deselect the Country and Daily/Weekly drop-downs to isolate for Date.

Get That Scraper Running


Great, now let’s gather all of these outputs (see the slideshow above) from SelectorGadget and slap them in the rvest scraper format:

> SpotifyScrape <- function(x){
 page <- x
 rank <- page %>% read_html() %>% html_nodes('.chart-table-position') %>% html_text() %>% as.data.frame()
 track <- page %>% read_html() %>% html_nodes('strong') %>% html_text() %>% as.data.frame()
 artist <- page %>% read_html() %>% html_nodes('.chart-table-track span') %>% html_text() %>% as.data.frame()
 streams <- page %>% read_html() %>% html_nodes('td.chart-table-streams') %>% html_text() %>% as.data.frame()
 dates <- page %>% read_html() %>% html_nodes('.responsive-select~ .responsive-select+ .responsive-select .responsive-select-value') %>% html_text() %>% as.data.frame()

#combine, name, and make it a tibble
 chart <- cbind(rank, track, artist, streams, dates)
 names(chart) <- c("Rank", "Track", "Artist", "Streams", "Date")
 chart <- as.tibble(chart)
 return(chart)
}

Note: In future works, you’ll sub out the attribute types and names for each chart.

Now we have written WHERE our scraper will search with the finalurl output, and we have designed HOW it will look using our SpotifyScrape function. Let’s combine the two with purrr‘s map_df function and store the dataset:

> spotify <- map_df(finalurl, SpotifyScrape)

TIME TO EXECUTE THE SCRAPER!!!! EVERYONE REMAIN CALM AND get a drink or something, you can finally chill for a minute. You’ve been learning lots of good data science methods. Maybe play The Weeknd, some Justin Bieber or a little Shania Twain? I think she just made another album and I found that kind of odd, but I guess she’s still doing halftime shows at the Grey Cup and other stuff up here in Canada…

Ok time’s up, let’s have a look:

outputspofity.PNG

Lovely! That chart looks good… to the average person maybe.

 

Cleaning the Pull


Immediately, we can see two glaring issues in our chart, and two more are lying within the structure of our data.

I’ve hidden the code to fix these problems in the space below, but I invite you to try to fix it yourself first!

> spotify %<>% 
mutate(Artist = gsub("by ", "", Artist), 
Streams = gsub(",", "", Streams), 
Streams = as.numeric(Streams), 
Date = as.Date(spotify$Date, "%m/%d/%Y"))

# Hint: mutate(Artist = gsub("by ",....) , Streams = gsub(",",...)) 
and then look at the column structures and correct them accordingly.

Through the power of our writer/reader dynamic, it looks like we’ve had our first successful taste at scraping a website. We have plenty of data to work with, and our table is now cleaner than Drake’s fade.

 

Quick Analysis Fun


Speaking of Drake, it looks like he gets a lot of streams on that “God’s Plan track. I wonder if he had the most streams in February:

> spotify %>% 
 group_by(Artist) %>% 
 summarise(Total = sum(Streams)) %>% 
 arrange(desc(Total)) %>%
 top_n(25, Total) %>%
 ggplot() +
 geom_col(aes(x = reorder(Artist, Total), y = Total), fill = "forest green") +
 coord_flip() + 
 scale_y_continuous(labels = unit_format("B", 1e-9))

woooo.PNG

Yup. Drake is King.

Recap


Kingl’s article actually ran a hypothesis test on two companies based on their average reviews. Just shows, there are plenty of things to do with all types of data, and we now know how to load up on lots of it.

Here’s everything we learned. If you TL;DR’d to the bottom, maybe consider scrolling up and reading.

  • We figured out how to sequence a range of URL pages.
  • We learned to pick the nodes of data we want from the web using SelectorGadget.
  • We tailored a scraper to find our five HTML nodes containing data off Spotify Charts.
  • After a bit of Shania Twain trivia and some data cleaning, we could query and visualize our data successfully.
  • We became better data scientists!
  • how do i stop doing the bullet poin-

 

Find the full GitHub script here.

7 comments on “Scrape-It-Yourself: Spotify Charts

  1. Pingback: Uncovering Hidden Trends in AirBnB Reviews – DataCritics

  2. This is awesome! It is the first time I used R, really nice introduction to the language and to web scraping. Will use it for my Data Science course.

    PS: I also needed to include library(tibble)

    Like

  3. Pingback: Build-a-ggplot: The Fall of The Simpsons – datacritics

  4. Tremendous – thanks very much. Been studying pipes and ggplot and this had made so much sense – massive help.

    Like

  5. Hi, Jake! The post is very useful and very well explained. It has made really clear some concepts. By the way, I would like to ask you, if you have another post, or maybe, you can give me an advice to scrap a page like this one: “http://www.bna.com.ar/Personas”, where to access the information you want to scrap, you have to make some clicks: “Ver histórico” (search with ctrl + f) > “Ver Histórico Principal”. Then clicking the calendar to choose the differents day to get the exchange value of each day. Is it posible to do this rvest or another R library?

    Like

    • Hi Rowie, seeing as it lies within a java app, it is not easy to navigate using any URL strings and my expertise falls off. Here is what I’m finding from right-click > Inspect in the calendar app.
      $.datepicker.regional[‘es’] = {
      closeText: ‘Cerrar’,
      prevText: ”,
      currentText: ‘Hoy’,
      monthNames: [‘Enero’, ‘Febrero’, ‘Marzo’, ‘Abril’, ‘Mayo’, ‘Junio’, ‘Julio’, ‘Agosto’, ‘Septiembre’, ‘Octubre’, ‘Noviembre’, ‘Diciembre’],
      etc…..

      I think you need to design something flexible that can crawl through these values to gather your desired nodes of info since that is how they are sorted. I hope this helps point your in the right direction to find more resources!

      Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: