Hello everyone! I have been away for a while since I was busy finishing my capstone project for my big data certification from Ryerson University in Toronto. This project was quite challenging with lots of disappointments but immense learning. I spent nearly a month finishing it end-to-end along with report preparation. Please follow the detailed step-by-step R code on my GitHub.
Part 1 will talk about the problem statement and introduce you to the dataset and the different visualizations I came up with to understand the data better. Part 2 will discuss more about predictive modelling, using the text-to-vector framework for Natural Language Processing.
Go give it a read and let me know what you think about it. As always, I love your inputs. Remember: “We learn together, we grow together!”
The New York Times (NYT) has a large reader base and plays an important role in shaping public opinion and outlook on current affairs and also in setting the tone of the public discourse, especially in the U.S. The comments sections for articles in the NYT are quite active and give insights to readers’ opinions on the subject matter of the articles. Each comment can receive other readers’ recommendations in the form of upvotes.
Challenges for NYT moderators
- Up to 700 comments per article with NYT moderators manually reviewing ~12,000 comments in a day.
- Moderators need to make faster decisions on screening and sorting comments based on their predicted relevance and popularity.
- Finding an easier way to group similar comments and maintain a useful conversation among readers.
Number of popular comments per article by News Desk
Key Research Topics
- extensive analysis of NYT’s articles and comments
- popularity prediction of readers’ comments
**Step-by-step code for this project can be found in my NYT NLP Capstone GitHub repository.
Step 1: Data Collection
NYT articles dataset:
- The dataset is comprised of nine .csv files for articles published Jan–May 2017 and Jan–Apr 2018 (available on the web, can also be scraped using NYT APIs).
- Totalling 9,335 different articles with 15 variables.
NYT comments dataset:
- There was another set of nine .csv files containing the collection of comments made on these articles (available on the web, can also be scraped using NYT APIs).
- Totalling 2,176,364 comments with 34 variables.
**All data files used in the project can be found here.
Step 2: Data Cleaning and Pre-processing
Limiting and reducing
- Due to the sheer volume of data and limited computing resources, I decided to limit the dataset to only the top 6 of 14 available News Desks.
- I converted certain “character” features to “factors.”
- I changed the datatype of some of the features, especially UNIX timestamp format to POSIXct.
- I removed features that were not required for the analysis.
Number of articles published per News Desk
Volume of comments per News Desk
Step 3: Sentiment Orientation Score and Calculation
Organizing and cleaning text:
- Text in the comments, snippets & article headlines was treated using the unnest_tokens function.
- The text body was split into tokens (single words) in each row of the new dataframe.
- I removed punctuation and converted tokens to lowercase.
Semantic orientation score determination:
- A lexicon-based approach was used to distinguish text orientation from semantic orientation of words.
- The appropriate lexicon (BING) was identified from the available packages in R.
- I used BING to assign a sentiment score to each headline, snippet of article and article comment.
- Based on the sentiment score, each headline, snippet and comment was classified into one of three sentiment categories: negative, positive or neutral.
Step 4: Extracting and Analyzing Features
Besides the variables already present in the data, a few more features were derived:
- aggregate sentiment scores for each article headline, snippet and comment body (negative, positive, neutral)
- total number of words in each comment
- total number of sentences in each comment
- average words per sentence in each comment
- temporal features: the time difference between when the comment was added and the article publishing date/time
- day of the week when the comment was added
- time of the day when the comment was added
Converting recommendations to a binary variable:
Also, for building a predictive model using classification methodology (as discussed in my next post), the target variable “recommendation” (numeric) was converted to a binary variable with possible values of 0 or 1.
The popular vs. non-popular variable was derived from the five-number summary statistic for the pre-converted recommendation variable. Its overall median value was 4; therefore any comment with <=3 upvotes was marked as non-popular.
0 = Non-Popular, 1 = Popular
The final data frame with 26 selected features
Step 5: Data Exploration and Visualization
When building any model, we need to understand the correlation between the predictors and the response variable. The visualizations below offered better insight into my data and also studied the relationship between different variables.
Frequency of articles based on sentiment: positive, neutral or negative
Do certain words in article headlines elicit more comments?
Comment popularity across the top six News Desks
Correlation between number of comments and article word counts
Correlation between comment popularity and article sentiments
Correlation between comments and temporal features
Most commonly used words in comments
Exploring correlation between text features and comment popularity
Anova tests were run to determine the statistical significance of correlation between the response variable and the numerical predictors. The correlation was statistically significant for all three predictors, as seen below.
Number of words per sentence in comments vs. comment popularity
Number of sentences in comments vs. comment popularity
Article-comment time gap vs. comment popularity
Data Exploration Summary
- The most popular News Desks for the NYT are OpEd, National, Foreign, Washington, Business and Editorial. Moderators can focus on these categories when moderating comments added by readers.
- In both 2017 and 2018, articles tended to have more negative sentiments than positive. This can be linked to the political situation prevalent in the United States and the world.
- For articles with the most comments as well as the the most commonly used words in these comments, the top-25 terms were similar and include Trump, Russia, refugees, health and secrets.
- Based on comment popularity distribution, National articles were the most read and liked by readers. This can be attributed to the political changes happening in the U.S. during 2017–2018.
- Most comments were made during mornings and afternoons.
- Tuesdays, Thursdays and Fridays were the most active days for comments, while the least active days were weekends.
- There is a strong correlation between the popularity of a comment and a few derived features: the average number of words per sentence in a comment, the average number of sentences in a comment, and the average time gap between article and comment publishing. This implies that these features could be used to predict comment popularity.
Since there is no strong correlation between many of the predictors and the response variable, I chose a different approach to handle this problem which I will discuss in my next blog.
How do you feel about four upvotes being the cut-off for popularity? How do you think increasing or decreasing the threshold would effect the results? Let me know your thoughts in the comments section and keep watching this space for Part 2 🙂