An NLP Analysis of Gentrification Based on Reddit Data

Executive Summary

Natural Language Processing (NLP) techniques were critical in extracting insights from Reddit users’ opinions on various aspects related to gentrification given our predefined business objectives. At the beginning of this analysis, we observed a significant reduction in the number of submissions, due to posts being unclean. Showing the importance of data cleaning and processing. We then dived into understanding the text distribution for subsequent NLP analysis, such as the top words in submissions by frequency. The analysis revealed that stopwords, frequently occurring but not contributing to meaning, dominated the results. To address this, we considered Term Frequency Inverse Document Frequency (TF-IDF) as a better measure. TF-IDF wordclouds displayed city names as prominent topics, along with location-related terms and inquiries about recommendations. Notably, words related to COVID-19 discourse appeared in some subreddit top words.

Sentiment analysis was used to determine the relationship between Reddit users’ sentiment about a city and certain topics related to gentrification, such as tourism, public transit, and rent. The exploration showcased surprisingly a high number of negative comments compared to positive or neutral comments going from 30% to 60%. Posts about “tourists”, “public transit”, and “rent” garnered more negative posts from 2021 to 2022. Furthermore, exploration techniques with external data were used to investigate whether there is any correlation betweeen median rent prices and sentiments on rent. Rent prices were visualized over time, revealing an overall increase. Analyzing the sentiment of posts containing the word “rent” alongside rent prices per city, Atlanta stood out with a higher percentage of negative sentiment posts. Interestingly, New York, despite having the highest median rent prices, exhibited the lowest percentage of negative sentiment posts about “rent.” Examining sentiment changes over time, Washington, DC, experienced a notable increase in negative posts about rent in 2023, indicating a shift in sentiment dynamics. Based on these analyses, it seems like the raw rent price did not have a strong correlation with negative sentiment but rather all cities’ subreddits leaned towards having negative sentiment associated with rent.

In summary, the analysis provides valuable insights into the trends, sentiments, and relationships within the Reddit data, laying the groundwork for further exploration and understanding of user opinions towards gentrification impacts, specially rent.

Analysis Report

Data collection

The external data was collected as previously described on the EDA section by pulling the U.S. Census Bureau data on rent prices within these cities over the past few years.

The transformed data to produce the following visualizations can be downloaded by clicking here.

The code to reproduce this analysis can be found by clicking the website’s show code tab before the visualizations, and detailed code and additional plots can be found in the repository folder “code.”

Exploration and analysis

Prior to beginning an NLP analysis, it is always good practice to familiarize yourself with the text data you are working with. Reddit submission data is inherently full of text and therefore an initial analysis of the text was conducted. In this section, we will first look at how many submissions contain unnecessary text (e.g., removed, deleted) and remove unnecessary stop words. We will also analyze the distribution of the length of text by subreddit and by year to see if there are any significant differences. Next, we will take a first glance at the top words used in the submissions.Finally, we will create a TF-IDF visual to display the most significant words using a text vectorizer.

When analyzing the Reddit submission text, it was apparent that there were a lot of submission texts that were deleted, removed, or were simply empty. Since these posts do not actually contain text, we will analyze how often they occurred and exclude them from the preliminary text analysis.

Code

import pandas as pd
import numpy as np
import plotly.express as px
import matplotlib.pyplot as plt
import json
import seaborn as sns
import os
import plotly.graph_objects as go
import plotly.offline as pyo
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot

Code

data_path = "../data/csv/"
img_path = "../data/plots/"

reddit_df = pd.read_csv(data_path + "reddit_df.csv")

# Subset on those values
sub_df = reddit_df[reddit_df['selftext'].isin(['[deleted]', '[removed]', ''])]
sub_df = pd.DataFrame(sub_df.groupby('selftext')['selftext'].count())
sub_df.rename(columns={'selftext':'Count of Submissions'}, inplace = True)
sub_df.reset_index(inplace = True)
sub_df.rename(columns={'selftext':'Type'}, inplace = True)
sub_df.loc[0, 'Type'] = 'Empty Submission'
sub_df.loc[1, 'Type'] = 'Deleted Submission'
sub_df.loc[2, 'Type'] = 'Removed Submission'
sub_df

	Type	Count of Submissions
0	Empty Submission	24059.0
1	Deleted Submission	67669.0
2	Removed Submission	NaN

Code

print(f'Total number of rows before removing deleted or removed text: {reddit_df.shape}')
new_df = reddit_df[~reddit_df['selftext'].isin(['[deleted]', '[removed]', ''])]
print(f'Total number of rows after removing deleted or removed text: {new_df.shape}')

Total number of rows before removing deleted or removed text: (217394, 68)
Total number of rows after removing deleted or removed text: (125666, 68)

We see a large reduction of submission rows due to the posts being empty, deleted, removed by moderators. The affected rows totaled nearly 174K. This provides insight into the trends of these subreddits because it shows that submissions being removed or deleted is not an uncommon occurrence. Additionally, there appears to be an issue with the fact that Reddit somehow allows submissions to be empty.

Furthermore, we will analyze the distribution of the submission string lengths. Subreddit posts lengths can vary significantly depending on the author, topic, and subreddit. To get a better understanding of the text for the NLP analysis, we will take a look at the varying string lengths across different angles. Examples of different angles include the year and the subreddit. We create a column showing the length of the submission string. This column will then be used to show the varying distributions of the submission string lengths.

The distribution plots above show that across all years and all subreddits, the lengths of the submission text tends to stay below 1000 characters, with the mean centered around 0. Across the years, the distribution at these lengths tends to be high, with the exception for 2023 since the year’s data was not complete. In all three plots, the Atlanta subreddit appeared to have a bimodal distribution where the curve showed two bell curves, one near zero and one near 1000. This means that in the Atlanta subreddit, some submission lengths were short, but many were also long .

In the following section, we are going to look at the top words in the submissions by frequency. Similar to the analysis above, this can be analyzed by looking at the top words across the subreddit type. Looking at the top words is essential in the preliminary text analysis because it may provide some insight into the topics discussed in the subreddit and hint at the types of sentiment in the data.

Since this is a preliminary view of the text data, we will be using a simple strategy of splitting the text (e.g., whitespace, new line). After visualizing the top words in the subreddits, it became clear that the most frequent words are stop words. Stopwords are words that occur in text frequently, but do not contribute to the meaning of it. Examples of this include words like “And” and “The”. This exploration result shows that to get meaningful insights of the text data, we will need to include stopword removal as part of the NLP pipeline conducted later on.

A better way to measure the top words in a body if text is TF-IDF. Term Frequency Inverse Document Frequency (TF-IDF) measures the importance of a word by looking at the word’s frequency relative to its document. If a word is truly important, it will appear more across documents, not just one. Since stopwords appeared frequently in the previous visualization, they will be excluded in this section and in the rest of the NLP analysis.

The wordclouds above display the top words in the subreddits according to TF-IDF. It’s no surprise that the top words appear to be the city names, which are the main topics of the subreddits. Words pertaining to geographical location such as “Streets” and “Places” tended to show up in the top words, which may be attributed to individuals inquiring about locations. Words like “Recommendation” were among the top words as well because people inquire about recommendations pertaining to restaurants or events. City-specific words, such as “DMV” (D.C., Maryland, Virginia) in the Washington, D.C. subreddit appeared as well. Another interesting finding is that words such as “19” and “vaccine” appeared in some of the subreddit top words, likely due to discourse surrounding the COVID-19 pandemic.

Furthermore, we predicted the sentiment of the different posts we have per subreddit, the next visualizations helps us understand the overall sentiment of the users towards the selected subreddits.

Code

city_rent_sentiment_2 = pd.read_csv(data_path + "heatmap.csv")
city_rent_sentiment_2.drop(columns=["Unnamed: 0","sentiment"])

	subreddit	year	count	percentage
0	Atlanta	2021-01-01	89	45.408163
1	Atlanta	2022-01-01	68	39.306358
2	Atlanta	2023-01-01	26	47.272727
3	Seattle	2021-01-01	372	44.128114
4	Seattle	2022-01-01	479	45.359848
5	Seattle	2023-01-01	114	40.860215
6	nyc	2021-01-01	48	36.090226
7	nyc	2022-01-01	27	39.705882
8	nyc	2023-01-01	5	38.461538
9	washingtondc	2021-01-01	264	42.307692
10	washingtondc	2022-01-01	260	46.931408
11	washingtondc	2023-01-01	92	55.757576

The table shows the percentage of comments per subreddit over the different years, showing the amount of negative thoughts of users. Additionally, to visualize the overall sentiment per city, the sentiment map below shows the density of posts by sentiment by subreddits. A deeper color indicates more negative comments.

Code

import plotly.io as pio
pio.renderers.default = "plotly_mimetype+notebook"

trace1 = go.Scatter(
    x=city_rent_sentiment_2['percentage'], y=city_rent_sentiment_2['subreddit'], mode='markers', name='points',
    marker=dict(color='rgb(102,0,0)', size=2, opacity=0.4)
)
trace2 = go.Histogram2dContour(
    x=city_rent_sentiment_2['percentage'], y=city_rent_sentiment_2['subreddit'], name='density', ncontours=30,
    colorscale='Hot', reversescale=True, showscale=False,
    hovertemplate='<br>perc. negative posts: %{x}<br>Subreddit: %{y}'
)

data = [trace1, trace2]

layout = go.Layout(
    showlegend=False,
    autosize=False,
    width=600,
    height=650,
    xaxis=dict(
        domain=[0, 1],
        showgrid=False,
        zeroline=False
    ),
    yaxis=dict(
        domain=[0, 1],
        showgrid=False,
        zeroline=False
    ),
    margin=dict(
        t=50
    ),
    hovermode='closest',
    bargap=0,
    xaxis2=dict(
        domain=[0.85, 1],
        showgrid=False,
        zeroline=False
    ),
    yaxis2=dict(
        domain=[0.85, 1],
        showgrid=False,
        zeroline=False
    )
)

fig = go.Figure(data=data, layout=layout)
font_dict=dict(family='Arial',
               color='black'
               )
fig.update_layout(font=font_dict)

fig.update_layout(yaxis_title="Subreddit",xaxis_title="Percentage of negative posts", title="Negative comments by subreddit") 

iplot(fig, filename='2dhistogram-2d-density-plot-subplots')

Normally, across social media we don’t expect to see a high number of negative comments, but the produced heath-map shows how the percentage of negative posts across different subreddits varies between 30% to 60%, for cities such as New York the posts concentrate between 30% to 40%. Similarly Seattle has a concentration although that concentration is located almost in the 50% showing a majority of the comments being negative. Meanwhile Atlanta has a higher range between 30% to 55% similar to Washington D.C. although it’s range goes from 40% to 60%.

Next, we will combine the external data previously explored into our NLP dataset to understand the data better. For example we want to visualize the sentiment of posts containing the word “rent” next to rent prices per city. We started off by first visualizing the changes to the prices of rent over time in each of the cities that are included in this project. We can see that rent has obviously increased over the years for each of these cities, with New York experiencing the most dramatic rise from 2014 to 2015.

While these historic trends provide us useful context over how these cities have changed over time, we wanted to take a deep dive into the distribution of rent prices for one particular year.

plot 5 plot 6

From the U.S. Census data, the most recent year we have data on rent prices was 2021, so we used a histogram to visualize the distribution of rent amounts for each of the cities. To accompany each of these plots, we visualized the sentiment classifications of the posts from those respective cities which contain the word “rent” in them. In order to establish a baseline, the sentiment classifications of posts from each city that did not contain the word “rent” were also plotted. One can see that the city with the most drastic difference in sentiment between posts that contain the word “rent” versus posts without “rent” is Atlanta, where posts containing the word “rent” have a negative sentiment more than 40% of the time, but all other posts only contain negative sentiment about 20% of the time. Interestingly enough, New York - which has the highest median rent prices and experienced the sharpest increase in rent - had the lowest percentage of negative sentiment posts associated with “rent”.

Thinking back to our business goal of “Determine if the sentiment of Reddit users’ opinions on a city is correlated with its rent prices”, we see that about 40% of posts about “rent” within each of these cities’ subreddits were associated with negative sentiment, regardless of the different between rent prices between all of the cities. This seems to imply that the sentiment of Reddit users may not be correlated with the actual USD amount of rent, but rather just the fact that the city has had increased rent over the years.

We also wanted to look further into whether this sentiment has changed over time, since we know that rent prices continue to increase year after year. When looking at whether the percentage of posts about “rent” that are negative have changed over time, we see that Washington, DC has experienced the highest increase in negative posts about rent. In fact, in 2021, majority of Reddit posts about “rent” within the “washingtondc” subreddit has positive sentiment, and those positions flipped now in 2023. For all other cities, majority of posts about “rent” are still associated with positive sentiment.