Hello everyone! I have been away for a while since I was busy finishing my capstone project for my big data certification from Ryerson University in Toronto. This project was quite challenging with lots of disappointments but immense learning. I spent nearly a month finishing it end-to-end along with report preparation. Please follow the detailed step-by-step R code on my GitHub.

Part 1 will talk about the problem statement and introduce you to the dataset and the different visualizations I came up with to understand the data better. Part 2 will discuss more about predictive modelling, using the text-to-vector framework for Natural Language Processing.

Go give it a read and let me know what you think about it. As always, I love your inputs. Remember: “We learn together, we grow together!”

Introduction

The New York Times (NYT) has a large reader base and plays an important role in shaping public opinion and outlook on current affairs and also in setting the tone of the public discourse, especially in the U.S. The comments sections for articles in the NYT are quite active and give insights to readers’ opinions on the subject matter of the articles. Each comment can receive other readers’ recommendations in the form of upvotes.

Challenges for NYT moderators

Up to 700 comments per article with NYT moderators manually reviewing ~12,000 comments in a day.
Moderators need to make faster decisions on screening and sorting comments based on their predicted relevance and popularity.
Finding an easier way to group similar comments and maintain a useful conversation among readers.

Number of popular comments per article by News Desk

**The NYT assigns each article to a general category (e.g., Business/Sports) called a **News Desk**.

Key Research Topics

extensive analysis of NYT’s articles and comments
popularity prediction of readers’ comments

Approach

approach

**Step-by-step code for this project can be found in my NYT NLP Capstone GitHub repository.

Step 1: Data Collection

NYT articles dataset:

The dataset is comprised of nine .csv files for articles published Jan–May 2017 and Jan–Apr 2018 (available on the web, can also be scraped using NYT APIs).
Totalling 9,335 different articles with 15 variables.

fig 2

NYT comments dataset:

There was another set of nine .csv files containing the collection of comments made on these articles (available on the web, can also be scraped using NYT APIs).
Totalling 2,176,364 comments with 34 variables.

fig 2

**All data files used in the project can be found here.

Step 2: Data Cleaning and Pre-processing

Limiting and reducing

Due to the sheer volume of data and limited computing resources, I decided to limit the dataset to only the top 6 of 14 available News Desks.
I converted certain “character” features to “factors.”
I changed the datatype of some of the features, especially UNIX timestamp format to POSIXct.
I removed features that were not required for the analysis.

Number of articles published per News Desk

Volume of comments per News Desk

Step 3: Sentiment Orientation Score and Calculation

Organizing and cleaning text:

Text in the comments, snippets & article headlines was treated using the unnest_tokens function.
The text body was split into tokens (single words) in each row of the new dataframe.
I removed punctuation and converted tokens to lowercase.

Semantic orientation score determination:

A lexicon-based approach was used to distinguish text orientation from semantic orientation of words.
The appropriate lexicon (BING) was identified from the available packages in R.
I used BING to assign a sentiment score to each headline, snippet of article and article comment.
Based on the sentiment score, each headline, snippet and comment was classified into one of three sentiment categories: negative, positive or neutral.

Step 4: Extracting and Analyzing Features

Besides the variables already present in the data, a few more features were derived:

aggregate sentiment scores for each article headline, snippet and comment body (negative, positive, neutral)
total number of words in each comment
total number of sentences in each comment
average words per sentence in each comment
temporal features: the time difference between when the comment was added and the article publishing date/time
day of the week when the comment was added
time of the day when the comment was added

Converting recommendations to a binary variable:

Also, for building a predictive model using classification methodology (as discussed in my next post), the target variable “recommendation” (numeric) was converted to a binary variable with possible values of 0 or 1.

The popular vs. non-popular variable was derived from the five-number summary statistic for the pre-converted recommendation variable. Its overall median value was 4; therefore any comment with <=3 upvotes was marked as non-popular.

0 = Non-Popular, 1 = Popular

fig 9

The final data frame with 26 selected features

fig 10

Step 5: Data Exploration and Visualization

When building any model, we need to understand the correlation between the predictors and the response variable. The visualizations below offered better insight into my data and also studied the relationship between different variables.

Frequency of articles based on sentiment: positive, neutral or negative

Do certain words in article headlines elicit more comments?

Comment popularity across the top six News Desks

Correlation between number of comments and article word counts

Correlation between comment popularity and article sentiments

Correlation between comments and temporal features

Most commonly used words in comments

Exploring correlation between text features and comment popularity

Anova tests were run to determine the statistical significance of correlation between the response variable and the numerical predictors. The correlation was statistically significant for all three predictors, as seen below.

Number of words per sentence in comments vs. comment popularity

Number of sentences in comments vs. comment popularity

Article-comment time gap vs. comment popularity

Data Exploration Summary

The most popular News Desks for the NYT are OpEd, National, Foreign, Washington, Business and Editorial. Moderators can focus on these categories when moderating comments added by readers.
In both 2017 and 2018, articles tended to have more negative sentiments than positive. This can be linked to the political situation prevalent in the United States and the world.
For articles with the most comments as well as the the most commonly used words in these comments, the top-25 terms were similar and include Trump, Russia, refugees, health and secrets.
Based on comment popularity distribution, National articles were the most read and liked by readers. This can be attributed to the political changes happening in the U.S. during 2017–2018.
Most comments were made during mornings and afternoons.
Tuesdays, Thursdays and Fridays were the most active days for comments, while the least active days were weekends.
There is a strong correlation between the popularity of a comment and a few derived features: the average number of words per sentence in a comment, the average number of sentences in a comment, and the average time gap between article and comment publishing. This implies that these features could be used to predict comment popularity.

Conclusion

Since there is no strong correlation between many of the predictors and the response variable, I chose a different approach to handle this problem which I will discuss in my next blog.

How do you feel about four upvotes being the cut-off for popularity? How do you think increasing or decreasing the threshold would effect the results? Let me know your thoughts in the comments section and keep watching this space for Part 2 🙂

11 comments on “Predicting Popularity of The New York Times Comments (Part 1)”

Manideep Allenki

August 26, 2018

I saw this on LinkedIn and have just read. I loved the step 5 about visualization. It was superb. Also I am eager to learn part 2

LikeLiked by 2 people

- Sakshi Gupta
  
  August 26, 2018
  
  Thanks Manideep. I am so glad you liked it. I am equally eager to share part 2 with you guys 🙂
  
  LikeLike
  
  - Swati Asthana
    
    June 16, 2019
    
    Hey Sakshi,
    
    Congratulation for such a brilliant work. I am curious to see the Part 2 as well. As I am working on a similar project would be really helpful if you share it.
    
    LikeLike
Mahesh

August 27, 2018

Great work Sakshi. #WaitingForPart2

LikeLiked by 1 person

- Sakshi Gupta
  
  August 27, 2018
  
  Thanks Mahesh!
  
  LikeLike
  
Tricia Bryski

September 4, 2018

I am curious to what extent popularity is a function of how many people viewed the comment. This could be impacted not only by when the comment was posted, but also by how readers consume them (in order of most recent, most popular so far, or NY Times picks). Are you looking to control for this and evaluate comments by the merit of what was written? In my humble opinion, I would have chosen a higher threshold than the median (4) to represent popularity. But this lower threshold should hopefully reduce bias from fewer people having the opportunity to read them.

LikeLike

Gavril Bilev

September 4, 2018

This seems pretty close to my project from May-June of this year (https://github.com/bilevg/NYTimes_Comments). I’m not saying you copied, maybe these are simply the best features in the set and we both ended up feature-engineering them. Good work!

LikeLike

- Sakshi Gupta
  
  September 4, 2018
  
  Great work Gavril. I quickly had a glance through your project. You have some details which I couldn’t use in my project due to paucity of time. Will read it through and see how I could have done better 🙂
  
  LikeLike
  
Auggie Heschmeyer

September 5, 2018

On you “Comments By Days of the Week” chart, it appears that you have your x- and y-axis labels switched. Otherwise, this a great post. Thank you for sharing your process.

LikeLike

- Sakshi Gupta
  
  September 5, 2018
  
  You are right! Thanks for bringing it to my notice and reading it so thoroughly 🙂
  
  LikeLike
  
Jack

September 12, 2018

Really interesting article, I look forward to part 2!

LikeLike