Intermediate Learning R

Get Better at Graphing Categorical Data with ggplot2

Visualizing FordGoBike data

Hello, my name is Tiange and I want to extract information from a large dataset and efficiently visualize it with R’s ggplot package. Some of the basic syntax needed can be found on RStudio’s ggplot2 cheatsheet.


Short Introduction to the Data

The data I am using for practice is the Ford GoBike public dataset, which tracked bikes and users between 2017-06-28 and 2017-12-31, found at FordGoBike.com.

If we take a glimpse at the variables in the dataset, we see the following:

FordgGoBike DataThey are two types of users that are the classifiers in this dataset:

Subscribers pay yearly/monthly fees, and if they use a bicycle for less than 45 minutes the ride is free; otherwise, $3 per additional 15 minutes will be charged.
Customers are charged $2 for the first 30 minutes, and if they keep the bike over 30 minutes, it increases to $3 per 15 minutes.

By the end, I will show you how to improve your ggplot graphs by learning new functions and arguments to best visualize the data, including:

  • how to stack bar graphs, with fill
  • how to overlap bar graphs, with position
  • how to combine multi-set data in one graph, with facet_wrap

Turning the data into an awesome ggplot graph

First we will want to perpetually mutate our date and time numerics into categorical ranges that better represent the data.

>library(dplyr) 
>library(ggplot2)  

1) Cutting the days into periods

I want to classify intervals of the day into time periods (morning, noon, etc.) based on start hour to visualize bicycle usage difference. We can do this by extracting the date in hours, then cutting the hours into time intervals that best represent these periods.

The structure of the duration is in seconds and will be changed to a metric that is easier to digest, like minutes.

#seconds to minutes
> fordgobike$duration_min <- round(fordgobike$duration_sec/60,2)

#extracting the hour from the date so we can divide the day into periods
> fordgobike_dur_under30$start_hour <-as.integer(substr(fordgobike_dur_under30$start_time,12,13))

> fordgobike_dur_under30$period <- cut(fordgobike_dur_under30$start_hour,c(00,06,10,15,19,23),
    labels = c("else","morning","noon","afternoon","evening"))

And let’s quickly visualize these totals:

> ggplot(fordgobike, aes(x=period)) +
    geom_bar()

ggplot period

2) Differentiating user_types and their behaviour at different times of the week

We can actually see the usage difference between subscribers and customers by using the geom_bar argument fill to stack the user_type. We can then separate the week into weekdays and weekends to reveal any difference in patterns among user_types per period.

I prefer to use the SQL language to filter data, and sqldf is a great package to perform SQL queries in R.

#storing the names of the week
> fordgobike_dur_under30$week <- weekdays(as.Date(fordgobike_dur_under30$start_time), abbreviate=F)

#splitting into weekday and weekend
> library(sqldf)
> fordgobike_dur_under30_weekdays<-sqldf('
    SELECT * 
    FROM fordgobike_dur_under30 
    WHERE week NOT IN ("Saturday","Sunday")')

> fordgobike_dur_under30_weekends<-sqldf(' 
    SELECT * 
    FROM fordgobike_dur_under30 
    WHERE week IN ("Saturday","Sunday")') 

#bike usage with user_types stacked
> ggplot(fordgobike_dur_under30_weekdays, aes(x=period)) +
    geom_bar(aes(fill=user_type)) + 
    xlab("Different Period Each Day") +  
    ggtitle("Weekdays Bike Usage Based On Different Period") +
    labs(fill="user type")

 

3) Adding labels and overlapping the charts for better perspective 

Now, let’s add some text elements to our graph. To add percentage marks, we must modify the geom_text function in ggplot. For more information regarding geom_text and percentages, visit this stackoverflow resolution.

> ggplot(fordgobike_dur_under30_weekends %>% 
    count(start_hour, user_type) %>%
    mutate(pct=n/sum(n),ypos = cumsum(n) - 0.5*n),
        aes(start_hour, n, fill=user_type)) +
    geom_bar(stat="identity") +
    geom_text(aes(label=paste0(sprintf("%1.1f", pct*100), "%")), 
        position=position_stack(), size=4, vjust=1)

#vjust: make the percentage marks right under the line

 

Based on the above plots, we can see that:

  • On weekdays, the peak hours are 8-9 a.m. and 5-6 p.m.; there aren’t so many customers using bicycles other than those times. So, subscribers may be people living in the city who need bicycles for commuting to work.
  • On weekends, most people use bicycles between 10 a.m. and 4 p.m. Weekend usage of bicycles is much more lower than on weekdays.

This is a better graph, but the usage difference between customers and subscribers is hard to see. To improve the graph further, we can unstack the bars so that user_type overlaps, giving better insight into the scale. We do this with the position argument in geom_bar, setting it to “identity.”

> ggplot(fordgobike_dur_under30_weekdays) + 
    geom_bar(aes(x=start_hour, fill=user_type, col=user_type), 
        colour = "lightblue", alpha = 0.5, position = "identity") + 
    scale_fill_manual(values = c("black", "pink")) + 
    xlab("Weekday StartHour") + 
    ggtitle("Weekdays Start Hour")

#colour: the colour of bar outline
#position = “identity”: overlaps the data
#alpha: transparency level on scale of 0 - 1
#scale_fill_manual: we can override with custom colour fill

 

This reveals more perspective on the difference in volume between subscribers and customers, especially on weekdays. On weekends, the users both have similar habits. However, the volume is much lower because it seems most use Ford GoBikes to commute during the weekdays.

Unsurprisingly, a majority of weekday users appear to be subscribers commuting to and from work. This can also be shown when we investigate the riding intervals:

 

Most users ride between 10 and 25 minutes on weekends. Usage over 25 minutes is mainly by customers instead of subscribers.


Grouping multiple plots into one graph with facet_wrap

Before, we were looking at the dataset in the span of a day. To quickly visualize how user behaviour compares on a larger scale (for example, by month) we can utilize the facet_wrap function in ggplot.

> ggplot(station_name_paired, aes(x = start_hour, y = count_t)) +
    geom_bar(aes(fill = user_type), stat = "identity", position = "dodge") +
    facet_wrap(~month, scales = "fixed")

11

The first problem here is that the scale on the y-axis poorly visualizes the data in months with low volume. In order to see the data in months like September or December, we change the scales argument to “free.”

> ggplot(station_name_paired, aes(x = start_hour, y = count_t)) + 
    geom_bar(aes(fill = user_type), stat = "identity", position = "dodge") + 
    facet_wrap(~month, scales = "free")

44

For another example, we can adjust the code to group by days of the week:

> ggplot(station_name_paired, aes(x = start_hour, y = count_t)) +
    geom_bar(aes(fill = user_type), stat = "identity", position = position_dodge(0.9)) +
    facet_wrap(~week, scales = "free") +
    xlab("Start Hour") +
    ggtitle("Start Hour in different weekdays")

33


Conclusion

In this practice, we learned to manipulate dates and times and used ggplot to explore our dataset. To improve our graphs, we used the fill factor variable and vjust to label percentage marks in geom_bar. We even deduced a few things about the behaviours of our customers and subscribers.

There are some questions we could explore more:

  • How does the weather and rider age affect the usage of bicycles?
  • What kind of people are riding for 30 minutes or even longer? Should we offer different services to these customers to increase sales?
  • How can we can increase the accuracy of start time to hour and minute, instead of start hour only?
  • How many bicycles are being used at each dock? Can we make a prediction model based on this information?

Look out for more teachings from me using this data!

Thank you for reading.

1 comment on “Get Better at Graphing Categorical Data with ggplot2

  1. How you visualize the data is very fascinating. Thanks for sharing your project with us along with tips!

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s

%d bloggers like this: