Intermediate Perspective R

Examining Young Driver Risk: Is there evidence that age is not a primary determinant of driver risk?

Evaluating the findings from my 2018 Capstone

It’s common knowledge that young drivers are perceived to be high-risk, and thus pay significantly more in car insurance premiums. In fact, last I checked, many car rental companies still do not allow renters under twenty-five years of age. Yes, there have been many studies supporting this assertion. However, I still wanted to explore this subject as my capstone project, which rounded out my big data certification from Ryerson University in Toronto.

My three-plus month project sought to uncover some positive signs in defense of this younger demographic, even if there is no evidence supporting an alternative hypothesis that young drivers are not riskier than other drivers. If there are notably surprising trends, it could inspire innovation within the automobile insurance community in product development—and this is what fed my passion in this project. This project was completely scripted in R and was published on GitHub.

The Data…

(Feel free to preview the Analysis & Results in the next section first.)

I analyzed a dataset from Transport Canada (via Kaggle) that captured every individual who was involved in a car accident in Canada from 1999–2014. It was a csv file with 5.8 million rows—each row representing a person. There were 22 variables, such as year, day of week, time, road condition, weather, person’s age, their position in the vehicle, vehicle type, collision configuration, etc. Many of these fields had dozens of possible values, so one of the first things I did was simplify this data by mapping some of the values. Details on that here. Next are some other major steps taken to prepare the data:

  • Build a couple of identifiers so I could group people into vehicles, and vehicles into collision incidents. These identifiers were concatenations of several of the original 22 attribute values. I assumed that if people were in a collision under common road conditions, weather, road formation, collision configurations, etc., and on the same day of week at the same time, then they were in the same collision incident. (Date and location would have been helpful, but they were not part of this dataset.)
  • Build in driver age by tagging each person with the age of the driver in their vehicle.
  • In dealing with missing values, I removed entire collision incidents if there was any material information missing, such as a driver’s age or anyone’s injuries. Collisions involving any vehicle types outside of private passenger, light van or truck were also removed.

At this point, I had reduced the 5.8 million records by half. Finally, I extracted one record per collision occurrence, capturing the minimum age of any driver involved and the maximum injury level of anyone involved (“not injured,” “injured” or “fatal”). The rest of the analysis was done using randomized samples of this much-slimmer (by R standards) occurrence-level dataset of 700,000 rows and 8 attributes.

Analysis & Results…

The traditional high risk associated with young drivers held strong in the face of evidence gathered in my study. However, I was able to uncover some positive anecdotes in favour of our young drivers:

Association rule mining showed that driving on a weekday, mid-block or at an intersection in local traffic with no traffic signal lead to collision injuries just as often when the youngest driver involved was between 17 and 25. The apriori algorithm, often used to find associations between items purchased together in store transactions, showed the most “support” for when and where an accident would happen over driver ages. I would have expected minimum driver age to be on top, so this was a nice surprise.

A priori algorithm from the ‘arule’ R package, on reduced per occurrence data



The chi-square test of independence between driver age and collision severity showed strong evidence that age and collision severity are related, albeit decreasing between 1999 and 2013. 30,000 random collisions per year were used to calculate a chi-square test statistic between minimum driver age groups and the three levels of injury severity. Essentially, the higher the statistic, the more evidence against the assertion that there is no relationship between the two variables (a null hypothesis). What we have here is a decreasing statistic (at least up to 2013), albeit still fairly high for many years; in other words, the relationship is going from “extremely related” to “very related.”

Test Statistic extracted from the ‘chisq.test’ R function for each year



The frequency distributions brought to light that 17- to 25-year-old drivers were decreasing their involvement in injury and fatal collisions. In fact, their rate of decrease was the strongest among all other age groups. The top bronze line in both charts below represents the 17–25 age group.

Number of fatal and injury collisions, by age of youngest driver involved




The traditional high risk associated with young drivers holds true. However, there were some positive stories in the data that put less weight on a driver’s young age as a predictor variable to accidents.

This study scraped the surface of scrutinizing the traditional risk-assessment of young drivers. It helps warrant some consideration to revisiting traditional automobile insurance underwriting practices. For example:

  • Increase the weight of factors related to actual driving habits and environment. Our association rules showed that city roads and intersections without traffic signals on weekdays have major predictive power for car accidents, as well as driver age.
  • Find a way to make car insurance more accessible to young drivers and build a pipeline of future quality clients. We saw they’re consistently contributing less to serious car accidents over the last sixteen years.

The insurance industry has been “gaining traction” on collecting telematics, as well as experimenting with usage-based insurance. As competitive as it is in any industry, data science is a tool to uncover hidden opportunities. Much can be done with enough accumulation of telematics data, both in insurance underwriting and in raising awareness of safe driving habits. For the broader insurance industry, data science can also help discover a new approach to writing a risk with increased effectiveness (or figuring out how to write something that wasn’t written before), and help gain a leg-up on their competitors.

Student of Big Data science: Machine Learning & Business Intelligence Tools | Practitioner of Insurance Data Analytics

0 comments on “Examining Young Driver Risk: Is there evidence that age is not a primary determinant of driver risk?

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: