Hello, my name is Tiange and I want to extract information from a large dataset and efficiently visualize it with R’s ggplot package. Some of the basic syntax needed can be found on RStudio’s ggplot2 cheatsheet.
Short Introduction to the Data
The data I am using for practice is the Ford GoBike public dataset, which tracked bikes and users between 2017-06-28 and 2017-12-31, found at FordGoBike.com.
If we take a glimpse at the variables in the dataset, we see the following:
They are two types of users that are the classifiers in this dataset:
Subscribers pay yearly/monthly fees, and if they use a bicycle for less than 45 minutes the ride is free; otherwise, $3 per additional 15 minutes will be charged.
Customers are charged $2 for the first 30 minutes, and if they keep the bike over 30 minutes, it increases to $3 per 15 minutes.
By the end, I will show you how to improve your ggplot graphs by learning new functions and arguments to best visualize the data, including:
- how to stack bar graphs, with fill
- how to overlap bar graphs, with position
- how to combine multi-set data in one graph, with facet_wrap
Turning the data into an awesome ggplot graph
First we will want to perpetually mutate our date and time numerics into categorical ranges that better represent the data.
1) Cutting the days into periods
I want to classify intervals of the day into time periods (morning, noon, etc.) based on start hour to visualize bicycle usage difference. We can do this by extracting the date in hours, then cutting the hours into time intervals that best represent these periods.
The structure of the duration is in seconds and will be changed to a metric that is easier to digest, like minutes.
#seconds to minutes > fordgobike$duration_min <- round(fordgobike$duration_sec/60,2) #extracting the hour from the date so we can divide the day into periods > fordgobike_dur_under30$start_hour <-as.integer(substr(fordgobike_dur_under30$start_time,12,13)) > fordgobike_dur_under30$period <- cut(fordgobike_dur_under30$start_hour,c(00,06,10,15,19,23), labels = c("else","morning","noon","afternoon","evening"))
And let’s quickly visualize these totals:
> ggplot(fordgobike, aes(x=period)) + geom_bar()
2) Differentiating user_types and their behaviour at different times of the week
We can actually see the usage difference between subscribers and customers by using the geom_bar argument fill to stack the user_type. We can then separate the week into weekdays and weekends to reveal any difference in patterns among user_types per period.
I prefer to use the SQL language to filter data, and sqldf is a great package to perform SQL queries in R.
#storing the names of the week > fordgobike_dur_under30$week <- weekdays(as.Date(fordgobike_dur_under30$start_time), abbreviate=F) #splitting into weekday and weekend > library(sqldf) > fordgobike_dur_under30_weekdays<-sqldf(' SELECT * FROM fordgobike_dur_under30 WHERE week NOT IN ("Saturday","Sunday")') > fordgobike_dur_under30_weekends<-sqldf(' SELECT * FROM fordgobike_dur_under30 WHERE week IN ("Saturday","Sunday")') #bike usage with user_types stacked > ggplot(fordgobike_dur_under30_weekdays, aes(x=period)) + geom_bar(aes(fill=user_type)) + xlab("Different Period Each Day") + ggtitle("Weekdays Bike Usage Based On Different Period") + labs(fill="user type")
3) Adding labels and overlapping the charts for better perspective
Now, let’s add some text elements to our graph. To add percentage marks, we must modify the geom_text function in ggplot. For more information regarding geom_text and percentages, visit this stackoverflow resolution.
> ggplot(fordgobike_dur_under30_weekends %>% count(start_hour, user_type) %>% mutate(pct=n/sum(n),ypos = cumsum(n) - 0.5*n), aes(start_hour, n, fill=user_type)) + geom_bar(stat="identity") + geom_text(aes(label=paste0(sprintf("%1.1f", pct*100), "%")), position=position_stack(), size=4, vjust=1) #vjust: make the percentage marks right under the line
Based on the above plots, we can see that:
- On weekdays, the peak hours are 8-9 a.m. and 5-6 p.m.; there aren’t so many customers using bicycles other than those times. So, subscribers may be people living in the city who need bicycles for commuting to work.
- On weekends, most people use bicycles between 10 a.m. and 4 p.m. Weekend usage of bicycles is much more lower than on weekdays.
This is a better graph, but the usage difference between customers and subscribers is hard to see. To improve the graph further, we can unstack the bars so that user_type overlaps, giving better insight into the scale. We do this with the position argument in geom_bar, setting it to “identity.”
> ggplot(fordgobike_dur_under30_weekdays) + geom_bar(aes(x=start_hour, fill=user_type, col=user_type), colour = "lightblue", alpha = 0.5, position = "identity") + scale_fill_manual(values = c("black", "pink")) + xlab("Weekday StartHour") + ggtitle("Weekdays Start Hour") #colour: the colour of bar outline #position = “identity”: overlaps the data #alpha: transparency level on scale of 0 - 1 #scale_fill_manual: we can override with custom colour fill
This reveals more perspective on the difference in volume between subscribers and customers, especially on weekdays. On weekends, the users both have similar habits. However, the volume is much lower because it seems most use Ford GoBikes to commute during the weekdays.
Unsurprisingly, a majority of weekday users appear to be subscribers commuting to and from work. This can also be shown when we investigate the riding intervals:
Most users ride between 10 and 25 minutes on weekends. Usage over 25 minutes is mainly by customers instead of subscribers.
Grouping multiple plots into one graph with facet_wrap
Before, we were looking at the dataset in the span of a day. To quickly visualize how user behaviour compares on a larger scale (for example, by month) we can utilize the facet_wrap function in ggplot.
> ggplot(station_name_paired, aes(x = start_hour, y = count_t)) + geom_bar(aes(fill = user_type), stat = "identity", position = "dodge") + facet_wrap(~month, scales = "fixed")
The first problem here is that the scale on the y-axis poorly visualizes the data in months with low volume. In order to see the data in months like September or December, we change the scales argument to “free.”
> ggplot(station_name_paired, aes(x = start_hour, y = count_t)) + geom_bar(aes(fill = user_type), stat = "identity", position = "dodge") + facet_wrap(~month, scales = "free")
For another example, we can adjust the code to group by days of the week:
> ggplot(station_name_paired, aes(x = start_hour, y = count_t)) + geom_bar(aes(fill = user_type), stat = "identity", position = position_dodge(0.9)) + facet_wrap(~week, scales = "free") + xlab("Start Hour") + ggtitle("Start Hour in different weekdays")
In this practice, we learned to manipulate dates and times and used ggplot to explore our dataset. To improve our graphs, we used the fill factor variable and vjust to label percentage marks in geom_bar. We even deduced a few things about the behaviours of our customers and subscribers.
There are some questions we could explore more:
- How does the weather and rider age affect the usage of bicycles?
- What kind of people are riding for 30 minutes or even longer? Should we offer different services to these customers to increase sales?
- How can we can increase the accuracy of start time to hour and minute, instead of start hour only?
- How many bicycles are being used at each dock? Can we make a prediction model based on this information?
Look out for more teachings from me using this data!
Thank you for reading.