Free Intermediate NLP Perspective R Text Analytics

What’s Your Twitter story? #TextMining #SentimentAnalysis

This mini-project was inspired by Case study: comparing Twitter archives, a chapter in the wonderful book Text Mining with R written by Julia Silge and David Robinson. All code used can be found in this book, unless otherwise noted.


I happen to be a (very) small contributor to the 500 million daily tweets that are shared on Twitter. On multiple accounts no less. As such, after @terchablued recommended I read Text Mining with R as a first step into the text mining, natural language processing (NLP) and sentiment analysis part of the data analytics world, I decided to take a look at my very first Twitter account!

Before we proceed any further, a quick disclaimer about the Twitter account in question. What had started off as a personal account spiralled uncontrollably changed into a sort-of fandom account. My anything-and-everything-goes account.

It’s mostly where I go to scream about One Direction.

5bb443424a1ad71603c43d67f5af1a04da6bb3c8
OH MY GAWDDD WAAAAAAAAAAAAAAAAAAAA *deep breath* aaaAAAAAAAAAAAAAAA #OneDirectionBestFans!@#@!!!!

*clears throat* On to the serious business now.

When I picked up this project, I wanted to answer two questions in particular:

  1. Has anything changed when it comes to my most tweeted topic on this account? (Which you will quickly discover is One Direction.)
  2. Am I tweeting more positively or negatively than before?

The “before” from my second question is highly subjective. What would be defined as “before”?

To get things started, I plotted my timeline and got the following:

timeline

Evidently, you can see two big clumps and almost a three-year gap.

Not so evidently, I had some very spotty presence between October 2013 and December 2016; I had four retweets in 2015 that don’t even show up on this graph.

Based on this observation, I split my data into two subsets:

  • Past: All tweets starting from when I created this account, up to 2015, when I graduated from college.
  • Present: All tweets from 2016 to March 4, 2018 (when I downloaded my archive).

We also need to take into account that ZAYN left One Direction in March 2015 (NOT LIKE I EVEN MISS HIM) and if it affected my top words even though I had not tweeted anything in 2015 or 2016.


Let’s have “fun” cleaning the data!!

Feel free to skim to the next section if you don’t want to hear me gripe about contractions…
BAKA!!! ノ(⋋▂⋌)ノ Not those contractions!!

Before I proceeded to unnesting and tokenization, I wanted to change all contractions to their full forms; otherwise, I was ending up with partial tokens such as “don” (part of don’t) and “ve” (from could’ve, would’ve, should’ve, etc.)

I found a solution to this in a DataCamp tutorial article on NLP and machine learning in R for lyric analysis:

> library(tm)

> fixcontractions <- function(a) {
 a <- gsub("'","'", a) #the devil that are smart apostrophes being changed into normal ones
 a <- gsub("won't", "will not", a)
 a <- gsub("can't", "cannot", a)
 a <- gsub("n't", " not", a)
 a <- gsub("'ll"," will", a)
 a <- gsub("'re", " are", a)
 a <- gsub("'ve"," have", a)
 a <- gsub("'m", " am", a)
 a <- gsub("'d","", a) #'d can be "had" or "would" so I did not expand at all
 a <- gsub("'s", "", a) #could be a possessive noun, doesn't need to be expanded
 
 return(a)
}

The above code has been slightly modified for my purposes. Thank you, Debbie!

At first, nothing was transforming. After much testing, I noticed the apostrophes themselves were not being picked up. Later, I realized they were what I like to call “fancy” apostrophes. I added in a line to the very top of the function (because order matters!) to change all the fancy apostrophes to the normal ones R understands, so the rest of the code can work.

SPOILER: It didn’t!!

Or at least I thought it didn’t…

Fortunately, it worked on some test strings and data frames. So after spending the better part of two days, I gave it one last shot. I tested my subsetted past data frame and noticed it had transformed the data, but when I tried it on the present data frame, nothing was changing in the first fifteen tweets, which were all from 2018.

I scrolled a bit further and noticed that most 2017 tweets were being transformed and it was just the recent tweets that made me think nothing was affected by my function. The two parts of contractions (“don” and “ve”) mentioned earlier still make it into the Top 15 words in present. I can’t be sure of the exact reasoning, but it could have something to do with how Twitter is saving data.

Let me know in the comments if you can shine some light on this.

UPDATE: I figured out the cause (with some help) and was able to fix the non-transforming tweets successfully! *YAY* The perfectionist in me just wouldn’t let me post this article with those pesky partial contraction tokens. To see the difference this fix made, check out the before and after side-by-side graphics of the Top 15 past and present words in the next section. Github code will updated with all code in the correct order to fix the issue as well as some more detailed notes. Link below!


Visualizations and Answers: Also if you skipped >:|

Following the case study, we then unnest, tokenize and calculate frequencies. For the sake of simplicity, I also removed all retweets in order to only focus on my own.

frequencies plot
A quick understanding of this plot in terms of my first question: while I may have used @zaynmalik a lot in the past dataframe, note how far to the left and high up it is from the red line. I still use @louis_tomlinson and @niallofficial in present as much as I had prior. Those two handles are near the top and close to the red line, suggesting almost equal frequency in terms of past and present.

Would this have anything to do with ZAYN leaving the GREATEST BOY BAND ON EARTH?

Maybe. Or maybe it just has to do with the fact that I’ve loved Louis and Niall just a little more than the rest since the very beginning. But shhh, don’t tell Harry and Liam because I love them too. ❤

No, but how does Zayn really fit into any of this?

If you remember, I mentioned earlier that I was not very active on this account from October 2013 to December 2016. This is because Zayn “Management” announced his leaving One Direction on March 25, 2015.

I didn’t rush to Twitter and get angry or cry—which I may have done later that night into my pillow *ahem*—but when I did come back in 2016, even 1D had been on their own hiatus for a year at the time.

As such, in the above plot you can see @onedirection is also sitting in the past, not too far from @zaynmalik. This does not mean I don’t like One Direction or Zayn anymore, I just tweet about them a lot less. In fact, I’m eagerly awaiting the 1D reunion in 2020 and looking forward to hearing more of all five boys’ solo work! Until then, I can only be excited about my wonderful plot of all the differences between my tweets.

This leads us to my second curiousity about whether my language has become more positive or negative in my tweets. I can even further uncover insights about my differences when I use sentiment analysis, so bear with me for just a little longer.

In sentiment analysis, tokenized words are assigned to one or more “sentiments” depending on the lexicon used . My fellow colleague, @terchablued, recently published an article on the very topic of lexicons and is worth the read.

I will be using the BING lexicon on my subsetted tidy_past and tidy_present dataframes to extract the most positive and negative words from each subset.

Here are the “Frequently Used” word clouds I ended up with:

PAST | PRESENT | OVERALL

And these are the Top 25-30 words (positive and negative) from BING:

PAST | PRESENT | OVERALL

Based on the positive and negative word clouds, we can confidently conclude I tweeted almost equally positive and negative words in the past and actually tweet with more positive words in the present.  Based on the general word clouds from my overall past AND present data, Louis and Niall tend to be my focus. Surprise, surprise!

Another good visual to compare the Top 15 past and present words is the horizontal bar chart:

The salmon-coloured bars refer to words I am more likely to use in the present, and the turquoise/teal bars refer to words I was more likely to use in the past. Notice how I used words like “midterm” and “physics” in the past, and words like “muslims” and “chicago” (and its variant #chicagoadventures—from my summer trip last year) show up more in the present.

It really shows that I’ve transitioned from my school days and am more aware of things going on in the world, hence the word “muslims” showing up at the very top—a topic that I may venture into in the future.


Wrapping It Up

So there you have it! A little spin on the case study that inspired me to turn this into a project.

A cool follow-up to this timeline analysis would be to webscrape my likes on the account and see what insights I can make into that in comparison to my actual tweets, since there are a lot of times when I don’t bother to tweet or retweet and just like a bunch of things instead! (The 15.7K likes vs. the 3.8K tweets I just analyzed from this account are proof of that!)

Here’s an article-appropriate parting gift:

I hope this mini-project of mine has intrigued you to check out your own Twitter(s)!

Be sure to check out all the code on my GitHub.

You can follow me on Twitter at @nazneen2411. I want to say it’s my proper work Twitter account, but I just shared a programming meme on it so… #imtrying.

Until next time!

'Big Data, Data Analytics, and Predictive Analytics' student at Ryerson University.

2 comments on “What’s Your Twitter story? #TextMining #SentimentAnalysis

  1. Very impressive! I took the same course too…and did some analysis on news channel….
    https://www.linkedin.com/pulse/sentiment-analysis-news-channel-tweets-ravichandran-mohan/

    Like

  2. Pingback: Text Mining and Sentiment Analysis with Canadian Underwriter Magazine Headlines – datacritics

Leave a comment