This is a follow-up article. You can download the Canadian Underwriter insPRESS article headlines I used to conduct this analysis here. In my previous article, I webscraped their data using Python.
After some data collection and wrangling projects, I decided it was time to explore trends and insights in Canadian Underwriter data using text mining and sentiment analysis.
Some of my colleagues have already done compelling work in this area, which I reference at the end of this article. So I thought I’d explore a couple of additional functions in the commonly used tm and wordcloud packages in R. Using the words contained in article headlines, I created the following visualizations:
- A comparison word cloud to visualize the change in commonly used words over the years, with some deep diving on word associations and word counts.
- A plot to track positive/negative sentiment scores over the years.
A fair bit of groundwork, relatively speaking, was required before I could run the functions to see what these visualizations would uncover.
Preparing the Data
First, I imported and structured the data in a format such that the packaged functions in tm and wordcloud will yield the output and stratification I’m aiming for, given the input. As shown below, I transformed the list of 2,000+ headlines, as scraped with Python, to a list of ten elements in R named docArray. Each element is a concatenation of all headlines for a given year. You’ll see later that it will be used as my input to build the visualizations.
Second, I built my own functions to streamline common function calls later in my analysis. For example, the tm package has a number of handy functions to remove stop words, punctuation and specified words. Though important, they don’t present material information with regards to the substance of my text.
I also removed some specific words like “Canada,” because most content on Canadian Underwriter magazine is related to Canada and insurance. It’s obvious that these words will be predominant in our headlines, so there’s no new information to gain.
I synthesized all these text-tidying functions into one generic function called sweepTxt (shown below) so that I can conduct this cleanup with efficiency later on. I also built a similar function which does the same thing, except propagated through a group of text passages, say throughout docArray.
Lastly, I also pre-scripted a function that outputs the top ten word counts (powered by the tm package) for some deep dives if needed.
The complete R code responsible for the above groundwork and the following analysis is shared on GitHub.
Comparison Word Cloud
I originally wanted to visualize a word cloud. However, I wanted to stratify this summary by year to identify frequent topics headlined through the years on Canadian Underwriter. Gathering all headlines for a given year into one element of a list of eight years (docArray from 2010-2017) was the data structure I needed for the tm function to build this comparison word cloud as I intended.
Each colour here represents a year. For example, 2010’s frequently used words in article headlines are in grey, 2017’s in teal. It’s a neat visualization, but perhaps due to the limited depth of the data in just the headlines, there’s nothing too groundbreaking here. Just a couple observations while we’re here:
- It appears that in 2010, Canadian Underwriter continued its efforts from the 2009 launch of its online presence. A lot of mentions of now, online, live, and updates.
- In 2013, “Burns & Wilcox” appeared on many headlines. Looking up some news articles, they were acquired by Kaufman Financial that year. Checking out their website, the late Herbert W. Kaufman actually established Burns & Wilcox in 1969. Sounds like a fascinating 40+ year history in between.
- There’s a glimpse of “cyber” in 2015. It wasn’t as predominant as I thought it would be in the last few years. Instead, we see a lot of “forensic engineering” in 2017. Ok… off the top of my head, the industry could be using more advanced technology to reverse-engineer losses to learn more from them, improving preventative measures. This is the positivity that I love to see in the industry—evolving from a business of selling indemnity contracts to a business of truly managing risk.
“Automate” caught my attention in the 2017 headlines. A quick inspection into its word association revealed that it’s in reference to automated payments—payments probably being rolled out at many companies, and my bet is on the claims payments side. Sounds like good news to the consumer.
This is how I measured how positive or negative the overall headline sentiments were, through the years. Remember the docArray? For this, I deconstructed the list of headlines into a list of headline words. I then counted how many positive words were used, less the number of negative words in a year’s headlines, and calculated a score. This is where a very honourable mention is due. Many sentiment analyses are powered by this Opinion Lexicon, among others. The Opinion Lexicon is a list of positive and negative words compiled by a group of academics (since around 2004) and made free to download and use in the open data community! There are even other lexicons that map words to feelings, such as joyful or sad. Much more on their work can be found in the hyperlink above. Plotting the scores of Canadian Underwriter headline words shows a very positive trend since 2013.
Additional credit is due here to Jalayer Academy; I used his YouTube channel’s videos’ step-by-step examples to understand the basic parameters of some features in the tm and wordcloud packages in R, and I applied them to my specific scenarios. I was able to adapt fairly easily because the videos and their narration are extremely effective; plus, the packages are well designed. More time and brain cells were definitely spent on structuring the data as described in the groundwork section of my article.
Although the exploratory insights from this project are light, it was fascinating to build a couple of visualizations from the data collected from Canadian Underwriter. Future potential for this project could be to scrape not just headlines but also entire articles, and to rebuild these visualizations with much more data, or perhaps a different dataset all together. I’ll share a quick inspection of the 2018 YTD word count:
Either way, these are some very useful data collection and text mining tools that I must ensure I don’t lose touch with.
Here are some more text mining and sentiment analysis articles from my fellow DataCritics:
Organizing Your First Text Analytics Project
What’s Your Twitter Story? #TextMining #SentimentAnalysis
Looking for Love in Bronte’s Wuthering Heights
Uncovering Hidden Trends in AirBnB Reviews
#elxn2018 in R: Ontario Tweeters’ Top Priority is Healthcare
Pingback: Collecting Canadian Underwriter Headlines and NFL Box Scores with Python – datacritics