Exploratory Data Analysis Lexicons NLP Perspective R Sentiment Analysis Text Analytics

Predicting Popularity of The New York Times Comments (Part 1)

My 2018 summer capstone project

Hello everyone! I have been away for a while since I was busy finishing my capstone project for my big data certification from Ryerson University in Toronto. This project was quite challenging with lots of disappointments but immense learning. I spent nearly a month finishing it end-to-end along with report preparation. Please follow the detailed step-by-step R code on my GitHub.

Part 1 will talk about the problem statement and introduce you to the dataset and the different visualizations I came up with to understand the data better. Part 2 will discuss more about predictive modelling, using the text-to-vector framework for Natural Language Processing.

Go give it a read and let me know what you think about it. As always, I love your inputs. Remember: “We learn together, we grow together!”


Introduction

The New York Times (NYT) has a large reader base and plays an important role in shaping public opinion and outlook on current affairs and also in setting the tone of the public discourse, especially in the U.S. The comments sections for articles in the NYT are quite active and give insights to readers’ opinions on the subject matter of the articles. Each comment can receive other readers’ recommendations in the form of upvotes.

Challenges for NYT moderators

  • Up to 700 comments per article with NYT moderators manually reviewing ~12,000 comments in a day.
  • Moderators need to make faster decisions on screening and sorting comments based on their predicted relevance and popularity.
  • Finding an easier way to group similar comments and maintain a useful conversation among readers.

Number of popular comments per article by News Desk

fig-1.png
**The NYT assigns each article to a general category (e.g., Business/Sports) called a News Desk.

Key Research Topics

  • extensive analysis of NYT’s articles and comments
  • popularity prediction of readers’ comments

Approach

approach

**Step-by-step code for this project can be found in my NYT NLP Capstone GitHub repository.

Step 1: Data Collection

NYT articles dataset:

  • The dataset is comprised of nine .csv files for articles published Jan–May 2017 and Jan–Apr 2018 (available on the web, can also be scraped using NYT APIs).
  • Totalling 9,335 different articles with 15 variables.

fig 2

NYT comments dataset:

  • There was another set of nine .csv files containing the collection of comments made on these articles (available on the web, can also be scraped using NYT APIs).
  • Totalling 2,176,364 comments with 34 variables.

fig 2

**All data files used in the project can be found here.

Step 2: Data Cleaning and Pre-processing

Limiting and reducing

  • Due to the sheer volume of data and limited computing resources, I decided to limit the dataset to only the top 6 of 14 available News Desks.
  • I converted certain “character” features to “factors.”
  • I changed the datatype of some of the features, especially UNIX timestamp format to POSIXct.
  • I removed features that were not required for the analysis.

Number of articles published per News Desk

fig-4.png

Volume of comments per News Desk

fig-5.png

Step 3: Sentiment Orientation Score and Calculation

Organizing and cleaning text:

  • Text in the comments, snippets & article headlines was treated using the unnest_tokens function.
  • The text body was split into tokens (single words) in each row of the new dataframe.
  • I removed punctuation and converted tokens to lowercase.

Semantic orientation score determination:

  • A lexicon-based approach was used to distinguish text orientation from semantic orientation of words.
  • The appropriate lexicon (BING) was identified from the available packages in R.
  • I used BING to assign a sentiment score to each headline, snippet of article and article comment.
  • Based on the sentiment score, each headline, snippet and comment was classified into one of three sentiment categories: negative, positive or neutral.

Step 4: Extracting and Analyzing Features

Besides the variables already present in the data, a few more features were derived:

  • aggregate sentiment scores for each article headline, snippet and comment body (negative, positive, neutral)
  • total number of words in each comment
  • total number of sentences in each comment
  • average words per sentence in each comment
  • temporal features: the time difference between when the comment was added and the article publishing date/time
  • day of the week when the comment was added
  • time of the day when the comment was added

Converting recommendations to a binary variable:

Also, for building a predictive model using classification methodology (as discussed in my next post), the target variable “recommendation” (numeric) was converted to a binary variable with possible values of 0 or 1.

The popular vs. non-popular variable was derived from the five-number summary statistic for the pre-converted recommendation variable. Its overall median value was 4;  therefore any comment with <=3 upvotes was marked as non-popular.

0 = Non-Popular, 1 = Popular

fig 9

The final data frame with 26 selected features

fig 10

Step 5: Data Exploration and Visualization

When building any model, we need to understand the correlation between the predictors and the response variable. The visualizations below offered better insight into my data and also studied the relationship between different variables.

Frequency of articles based on sentiment: positive, neutral or negative

Picture1

Do certain words in article headlines elicit more comments?

Picture2

Comment popularity across the top six News Desks

picture1.png

Correlation between number of comments and article word counts

Picture2

Correlation between comment popularity and article sentiments

picture3.png

Correlation between comments and temporal features

Picture5

Most commonly used words in comments

picture6.png

Exploring correlation between text features and comment popularity

Anova tests were run to determine the statistical significance of correlation between the response variable and the numerical predictors. The correlation was statistically significant for all three predictors, as seen below.

Number of words per sentence in comments vs. comment popularity

Picture8

Number of sentences in comments vs. comment popularity

Picture9

Article-comment time gap vs. comment popularity

Picture10

Data Exploration Summary

  • The most popular News Desks for the NYT are OpEd, National, Foreign, Washington, Business and Editorial. Moderators can focus on these categories when moderating comments added by readers.
  • In both 2017 and 2018, articles tended to have more negative sentiments than positive. This can be linked to the political situation prevalent in the United States and the world.
  • For articles with the most comments as well as the the most commonly used words in these comments, the top-25 terms were similar and include Trump, Russia, refugees, health and secrets.
  • Based on comment popularity distribution, National articles were the most read and liked by readers. This can be attributed to the political changes happening in the U.S. during 2017–2018.
  • Most comments were made during mornings and afternoons.
  • Tuesdays, Thursdays and Fridays were the most active days for comments, while the least active days were weekends.
  • There is a strong correlation between the popularity of a comment and a few derived features: the average number of words per sentence in a comment, the average number of sentences in a comment, and the average time gap between article and comment publishing. This implies that these features could be used to predict comment popularity.

Conclusion

Since there is no strong correlation between many of the predictors and the response variable, I chose a different approach to handle this problem which I will discuss in my next blog.

How do you feel about four upvotes being the cut-off for popularity? How do you think increasing or decreasing the threshold would effect the results? Let me know your thoughts in the comments section and keep watching this space for Part 2 🙂

 

I come from Business & Technology background and have rich global experience in solving clients' Business & Data problems through IT & Analytics solutions. I love programming (in R, SQL and Python), painting and interacting with people. Connect with me on Linkedin: https://www.linkedin.com/in/gupta-sakshi/

10 comments on “Predicting Popularity of The New York Times Comments (Part 1)

  1. Manideep Allenki

    I saw this on LinkedIn and have just read. I loved the step 5 about visualization. It was superb. Also I am eager to learn part 2

    Liked by 2 people

    • Sakshi Gupta

      Thanks Manideep. I am so glad you liked it. I am equally eager to share part 2 with you guys 🙂

      Like

  2. Great work Sakshi. #WaitingForPart2

    Liked by 1 person

  3. Tricia Bryski

    I am curious to what extent popularity is a function of how many people viewed the comment. This could be impacted not only by when the comment was posted, but also by how readers consume them (in order of most recent, most popular so far, or NY Times picks). Are you looking to control for this and evaluate comments by the merit of what was written? In my humble opinion, I would have chosen a higher threshold than the median (4) to represent popularity. But this lower threshold should hopefully reduce bias from fewer people having the opportunity to read them.

    Like

  4. This seems pretty close to my project from May-June of this year (https://github.com/bilevg/NYTimes_Comments). I’m not saying you copied, maybe these are simply the best features in the set and we both ended up feature-engineering them. Good work!

    Like

    • Sakshi Gupta

      Great work Gavril. I quickly had a glance through your project. You have some details which I couldn’t use in my project due to paucity of time. Will read it through and see how I could have done better 🙂

      Like

  5. Auggie Heschmeyer

    On you “Comments By Days of the Week” chart, it appears that you have your x- and y-axis labels switched. Otherwise, this a great post. Thank you for sharing your process.

    Like

  6. Really interesting article, I look forward to part 2!

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: