For this blog post, I would like to share my exploration of three different lexicons in R’s tidytext from my last post on sentiment analysis. This is also an opportunity to re-ground oneself in tidy data1 principles, and showcase the tidytext package. The simplicity and efficiency of tidytext will allow you to get creative with your analysis using three very different output options.
Using Brontë’s Wuthering Heights, the table below illustrates the output of the three lexicons. I have also included, under each of the headers, a topline description of each lexicon.
As you can see by each output generated, the lexicon will impact how you summarize and assess your project.
It is important to note that the tidytext package is firmly grounded in tidy data principles, where “each variable is a column, each observation is a row, and each type of observational unit is a table.”1 The unnest_tokens() function breaks text into individual tokens (tokenization) with a tidy data structure; hence the output above. Couple this with the built-in stop_words function, and your text project is ready for analysis in minutes (see this in action with the github link below). All of this made me hungry for an overall approach to text mining which I hope will help you as well.
As a structuralist, I found a succinct workflow in Text Mining with R by Julia Silge & David Robinson, further affirming my love for tidytext.
As per the workflow, the next step is to summarize (and assess) the output from the three sentiment lexicons.
#summarize arc > lxw1 %>% > inner_join(get_sentiments("nrc")) %>% > count(sentiment,sort=T) #summarize "bing" > lxw1 %>% > inner_join(get_sentiments("bing")) %>% > count(sentiment,sort=T) #summarize "affin" opted for sort=F to see distribution across all scores > lxw1 %>% > inner_join(get_sentiments("afinn")) %>% > count(score,sort=F)
I hope this quick preview will help jumpstart your next text sentiment project and help you visualize where you want to take it. Don’t forget the workflow guide above.
As for those of you wondering if something was missed, yes, I have stopped short of the visualization which is another whole discussion! Happy text mining!
1Hadley Wickham, “Tidy Data,” Journal of Statistical Software (Volume 59, 2014).