Free Learning NLP R Text Analytics

Organizing Your First Text Analytics Project

Using Natural Language tools to uncover conversational data.

Text analytics or text mining is the analysis of “unstructured” data contained in natural language text using various methods, tools and techniques.
mining_learning_analysisThe popularity of text mining today is driven by statistics and the availability of unstructured data. With the growing popularity of social media and with the internet as a central location for all sorts of important conversations, text mining offers a low-cost method to gauge public opinion.

This was my inspiration to learn about text analytics and write this blog and share my learnings with my fellow data scientists! ūüôā My key reference for this blog is DataCamp’s beautifully designed course Text Mining – Bag of Words.

Below are the six main steps for a text mining project. In this blog, I will focus on Steps 3, 4, 5 and 6 and discuss the key packages and functions in R which can be used for these steps.

1. Problem Definition

Identifying the specific goals or objectives for any project is key to its success. One needs to have domain understanding to define the problem statement appropriately.

For this article, I will be asking whether Amazon or Google has a better pay perception according to online reviews, and which has a better work-life balance according to current employee reviews.

2. Identifying the Text Sources

There can be multiple ways to collect employees reviews, from websites like Glassdoor and Indeed to articles published with workplace reviews, or even through focus group interviews of employees.

3. Text Organization

This involves the multiple steps for cleaning and pre-processing your text. There are two main packages in R which can be used to perform this: qdap and tm.

Points to Remember:

  • the tm package works on the text corpus object
  • the qdap package is applied directly to the text vector

x -> vector with positive reviews for Amazon

# qdap cleaning function
> qdap_clean <- function(x)  {
  x <- replace_abbreviations(x)
  x <- replace_contractions(x)
  x <- replace_number(x)
  x <-  replace_ordinal(x)
  x <-  replace_symbol(x)
  x <-  tolower(x)
  return(x)
}

**You can also add more cleaning functions to the above, based on specific requirements.

corpus -> VCorpus(VectorSource(x))

Then use the¬†tm_map() function‚ÄĒprovided by the¬†tm package‚ÄĒto apply cleaning functions to a corpus. Mapping these functions to an entire corpus makes scaling of the cleaning steps very easy.

# tm cleaning function
> clean_corpus <- function(corpus){
 corpus <- tm_map(corpus, stripWhitespace)
 corpus <- tm_map(corpus, removePunctuation)
 corpus <- tm_map(corpus, content_transformer(tolower))
 corpus <- tm_map(corpus, removeWords, c(stopwords("en"), "Google", "Amazon", "company))
 return(corpus)
}

Word stemming and stem completion on a sentence using tm package

The tm package provides the stemDocument() function to get to a word’s root. This function either takes in a character vector and returns a character vector, or takes in a PlainTextDocument and returns a PlainTextDocument.

# Remove punctuation
> rm_punc <- removePunctuation(text_data)

# Create character vector
> n_char_vec <- unlist(strsplit(rm_punc, split = ' '))

# Perform word stemming: stem_doc
> stem_doc <- stemDocument(n_char_vec)

# Re-complete stemmed document: complete_doc
> complete_doc <- stemCompletion(stem_doc, comp_dict)

Point to remember:

Define your own comp_dict which is a custom dictionary containing words you want to use to re-complete the stemmed words.

4. Feature Extraction

After completing the basic cleaning and pre-processing of text, the next step is to extract the key features which can be done in the form of sentiment scoring or extracting n-grams and plotting them. For this purpose, the TermDocumentMatrix (TDM) or DocumentTerm Matrix (DTM) functions come in very handy.

Screen Shot 2018-02-19 at 4.23.01 PM

# Generate TDM
> coffee_tdm <- TermDocumentMatrix(clean_corp)

# Generate DTM
> coffee_dtm <- DocumentTermMatrix(clean_corp)

Points to remember:

You can use TDM when you have more words than documents to be reviewed, as it is easier to read a large number of rows than columns.

You can then convert the results to matrices using the as.matrix() function, and then slice and dice and review parts of these matrices.

Let’s see a simple example of creating a TDM for bigrams:

To create a bigrams TDM, we use TermDocumentMatrix() along with a control argument which receives a list of control functions (please refer to TermDocumentMatrix for more details). Here, a built-in function called tokenizer is used, which helps in tokenizing words as bigrams.

# Create bigram TDM
> amzn_p_tdm <- TermDocumentMatrix(
amzn_pros_corp,
control = list(tokenize = tokenizer)
)
# Create amzn_p_tdm_m
> amzn_p_tdm_m <- as.matrix(amzn_p_tdm) 

# Create amzn_p_freq 
> amzn_p_freq <- rowSums(amzn_p_tdm_m)

5. Feature Analysis

There are multiple ways to analyze the text features. A few of them are discussed below.

a. Barplot

# Sort term_frequency in descending order
> amzn_p_freq <- sort(amzn_p_freq, decreasing = TRUE) > 

# Plot a barchart of the 10 most common words
> barplot(amzn_p_freq[1:10], col = "tan", las = 2)

Screen Shot 2018-02-19 at 4.59.46 PM.png

b. WordCloud

# Plot a wordcloud using amzn_p_freq values
> wordcloud(names(amzn_p_freq), amzn_p_freq, max.words = 25, color = "red")

Screen Shot 2018-02-19 at 4.51.50 PM

To further learn different ways to plot wordcloud, please refer to this article which I found quite useful.

c. Cluster Dendograms

This is a simple clustering technique to perform a hierarchical cluster and create a dendrogram to see how connected different phrases are.

# Create amzn_p_tdm2 by removing sparse terms
> amzn_p_tdm2 <- removeSparseTerms(amzn_p_tdm, sparse = .993) > 

# Create hc as a cluster of distance values
> hc <- hclust(dist(amzn_p_tdm2, method = "euclidean"), method = "complete") > 

# Produce a plot of hc
> plot(hc)

Screen Shot 2018-02-19 at 5.10.05 PM

You can see similar topics throughout the dendrogram like “great benefits,” “good pay,” “smart people,” etc.

d. Word Association

This is used to examine top phrases that appear in the word clouds and find associated terms using the findAssocs() function from the tm package.

The code below is used to find the most associated words with the most frequent terms in the positive reviews for Amazon.

# Find associations with Top 2 most frequent words
> findAssocs(amzn_p_tdm, "great benefits", 0.2)
 $`great benefits`
 stock options options four four hundred vacation time
     0.35         0.28         0.27          0.26
 benefits stock   competitive pay      great management     time vacation
     0.22              0.22                  0.22                 0.22
> findAssocs(amzn_p_tdm, "good pay", 0.2)
 $`good pay`
 pay benefits  pay good  good people  work nice
     0.31        0.23        0.22       0.22

e. Comparison Clouds

This is used when you wish to examine two different corpuses of words in one go, rather then analyzing them separately (which can be more time consuming).

The code below compares the positive and negative reviews for Google.

# Create all_goog_corp
> all_goog_corp <- tm_clean(all_goog_corpus) > # Create all_tdm
> all_tdm <- TermDocumentMatrix(all_goog_corp)

<>
 Non-/sparse entries: 2845/1713
 Sparsity : 38%
 Maximal term length: 27
 Weighting : term frequency (tf)

> # Name the columns of all_tdm
> colnames(all_tdm) <- c("Goog_Pros", "Goog_Cons") > # Create all_m
> all_m <- as.matrix(all_tdm) > # Build a comparison cloud
> comparison.cloud(all_m, colors = c("#F44336", "#2196f3"), max.words = 100)

Screen Shot 2018-02-19 at 5.25.11 PM

f. Pyramid Plots

Pyramid plots are used to display a pyramid (as opposed to a horizontal bar) plot and help in easy comparison based on similar phrases.

The code below compares the frequency of positive phrases for Amazon vs Google.

# Create common_words
> common_words <- subset(all_tdm_m, all_tdm_m[,1] > 0 & all_tdm_m[,2] > 0)
> str(common_words)
 num [1:269, 1:2] 1 1 1 1 1 3 2 2 1 1 ...
 - attr(*, "dimnames")=List of 2
 ..$ Terms: chr [1:269] "able work" "actual work" "area traffic" "atmosphere little" ...
 ..$ Docs : chr [1:2] "Amazon Pro" "Google Pro"

# Create difference
> difference <- abs(common_words[,1]- common_words[,2]) >

# Add difference to common_words
> common_words <- cbind(common_words, difference) > head(common_words)
 Amazon Pro Google Pro difference
 able work 1 1 0
 actual work 1 1 0
 area traffic 1 1 0
 atmosphere little 1 1 0
 back forth 1 1 0
 bad work 3 1 2

# Order the data frame from most differences to least
> common_words <- common_words[order(common_words[,"difference"],decreasing = TRUE),]

# Create top15_df
> top15_df <- data.frame(x = common_words[1:15,1], y = common_words[1:15,2], labels = rownames(common_words[1:15,]))

# Create the pyramid plot
> pyramid.plot(top15_df$x, top15_df$y,
 labels = top15_df$labels, gap = 12,
 top.labels = c("Amzn", "Pro Words", "Google"),
 main = "Words in Common", unit = NULL)
 [1] 5.1 4.1 4.1 2.1

Screen Shot 2018-02-19 at 5.29.16 PM6. Drawing Conclusions

Based on the above visual (“Words in Common” pyramid plot), overall Amazon looks to have a better work environment and work-life balance than Google. Working hours seem to be higher at Amazon, but perhaps they provide other benefits to restore the work-life balance. We would need to collect more reviews to make a better conclusion.

So, finally we come to the end of this blog. We learned how to organize our text analytics project, the different steps involved in cleaning and pre-processing and finally how to visualize the features and draw conclusions. I am on my way to completing my text analytics project based on this blog and learnings from DataCamp. I will soon post my GitHub repository for the project to help you further. Our next goal should be to perform sentiment analysis. Till then keep CODING ūüėÄ

Hope you liked this blog. Do share your comments on what you liked and what you would like me to improve in my next blog.

Keep watching this space for more. Cheers!

I am a Big Data & Data Analytics student @Ryerson University. I come from Business & Technology background and have rich global experience in solving clients' Business & Data problems through IT & Analytics solutions. I love programming (in R, SQL and Python), painting and interacting with people. Connect with me on Linkedin: https://www.linkedin.com/in/sakshi-gupta-mba/

5 comments on “Organizing Your First Text Analytics Project

  1. habibullah siddiqui

    Informative

    Liked by 2 people

  2. Parag Gurjar

    Very informative and detailed post ! Very good collection of techniques for someone who is new starter.

    Liked by 1 person

  3. Ronaldo S.A. Batista

    Great post! Thank you for sharing

    Liked by 1 person

  4. Pingback: Uncovering Hidden Trends in AirBnB Reviews – DataCritics

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s

%d bloggers like this: