Analysis of Gentrification Based on Reddit Data

More than 135,000 people were displaced between 2000 and 2012 [1] in the United States.

Gentrification is a demographic and economic shift that displaces established working-class communities and communities of color in favor of wealthier newcomers and companies. Displacement happens when long-time or original neighborhood residents move from a gentrified area because of higher rents, mortgages, and property taxes. It is a housing, economic, and health issue that affects a community’s history and culture.

Analyzing data related to gentrification, especially people’s opinions, is crucial to understand the real impact on communities. It provides insights into how residents are affected by rising costs and changes in their neighborhoods, focusing on the emotional and social aspects of the problem. Finally it can help to find solutions to ensure more equitable urban development.

Objective

The objective of this project is to analyze the opinions of Reddit users to understand how gentrification has impacted different cities in the United States. The analysis will focus on the four cities with the highest percentage of gentrified neighborhoods from 2000 to 2013, according to [1]:

New York, New York
Atlanta, Georgia
Washington, D.C.
Seattle, Washington

To achieve the objective, the following questions will be addressed using various statistical analysis, Natural Language Processing, and Machine Learning techniques:

Is sentiment of users opinions on a city related with its rent prices?
Is sentiment of users opinions on a city related with transportation and Walkability?
Does median income have an impact on gentrification sentiments?
Does political beliefs have an impact on sentiments?
What impact does this have to do on cost of living? Restaurants, gas prices, grocery stores, etc.
Does sentiment towards gentrification align with housing development?
What kinds of changes cause changes in sentiment of gentrification?
Does the amount of responses within a subreddit correlate with the population of the city?
Do more gentrified cities have a more positive or negative sentiment towards short-term rentals (e.g. Airbnb, Vrbo)
Does perceived gentrification align with actual gentrification?
Does tourism increased trends align with gentrification trends?
Does increased tourism increase revenue for the city?

Sources:

[1] Study by the National Community Reinvestment Coalition. March 2019. https://www.usnews.com/news/cities/slideshows/cities-with-the-highest-percentage-of-gentrified-neighborhoods

Data Description

The analysis is performed over two datasets one contains posts from reddits users and a “external dataset” containing U.S. Census Bureau data.

Reddit Data

The reddit dataset contains posts and its metadata from January to March 2023. This analysis will focus in the next threads: ‘washingtondc’, ‘Seattle’, ‘Atlanta’ and ‘nyc’ for each of the top previously mentioned cities.

Census Data

A tell-tale sign of “gentrification” in an area is rising rent prices. In order for us to understand what time period gentrification began in these cities, we pulled U.S. Census Bureau data on rent prices within these cities over the past few years. Although it is natural for rent prices to naturally increase over time given inflation trends, gentrification can cause rent prices to abnormally increase. This typically results in long-time residents no longer being able to afford to live in these areas. Being able to see when these rent spikes occured may give us more insight as to where we’d expect Reddit Posts to change in sentiment, or even in topics, due to the change of residents.

Business goals:

Determine if the sentiment of Reddit users’ opinions on a city is correlated with its rent prices
Investigate if the sentiment of Reddit users’ opinions on a city is influenced by factors such as transportation and walkability
Assess the impact of median income on gentrification sentiments
Explore whether political beliefs have an effect on sentiments towards gentrification
Analyze how gentrification affects the cost of living, including factors like restaurants, gas prices, and grocery stores
Investigate if sentiment towards gentrification aligns with housing development
Identify the types of changes that lead to shifts in sentiment regarding gentrification
Determine if the volume of responses within a subreddit correlates with the population of the city
Investigate whether more gentrified cities exhibit a more positive or negative sentiment towards short-term rentals (e.g., Airbnb, Vrbo)
Evaluate if perceived gentrification aligns with actual gentrification trends, and whether increased tourism is associated with gentrification and city revenue

Technical proposals:

We will conduct sentiment analysis on Reddit posts related to the four cities using a tranformer model from the hugging face library. We will then compare the sentiment scores with U.S. Census Bureau data on rent prices to identify correlations
We will perform NLP techniques to extract information related to transportation and walkability from the Reddit posts. We can use techniques like Named Entity Recognition (NER) for identifying locations and distances. We will then use regression analysis or correlation techniques to determine if there is a relationship between this information and sentiment scores
To investigate the impact of median income on gentrification sentiments, we will integrate U.S. Census Bureau data on median income with sentiment analysis results from the Reddit posts. This may involve techniques like linear regression or other regression models to identify any patterns or trends
To explore the influence of political beliefs on sentiments towards gentrification, we will employ NLP techniques to identify political language within the Reddit posts. This could involve sentiment analysis specifically tailored to political sentiment. We will then use statistical tests or regression models to analyze sentiment scores based on political affiliations
Regarding the impact of gentrification on the cost of living, we will utilize both U.S. Census Bureau data and sentiment analysis results. This may involve multiple regression analysis to assess correlations with factors like restaurants, gas prices, and grocery stores
To examine the alignment between sentiment towards gentrification and housing development, we will apply NLP techniques to extract relevant information from the Reddit posts and compare it with housing development data. We may also use regression techniques to analyze the relationship between sentiment and housing development indicators
To identify changes that affect sentiment regarding gentrification, we will conduct exploratory data analysis on the Reddit posts, looking for significant shifts in sentiment in response to specific events or developments. This can be done using time series analysis
To investigate the correlation between subreddit responses and city population we will analyze the data for any patterns or relationships between the volume of responses and population size through correlation analysis or regression techniques
For the question on sentiment towards short-term rentals in gentrified cities, we will perform sentiment analysis on relevant posts using a hugging face transformer. We will then use statistical tests or regression models to analyze the relationship between sentiment and short-term rental trends
To evaluate the alignment between perceived and actual gentrification trends and their impact on city revenue, we will analyze sentiment data alongside data on gentrification trends and tourism trends using regression analysis or correlation techniques

Exploration of Reddit data

%pip install -U azureml-fsspec mltable

%pip install nltk

%pip install seaborn

import pandas as pd
import numpy as np
import plotly.express as px
import matplotlib.pyplot as plt
import seaborn as sns
import datetime
import nltk
import json
import requests
import os
from pathlib import Path
import re

from azureml.fsspec import AzureMachineLearningFileSystem

# Azure Machine Learning workspace details:
subscription = '58bb8a15-5d27-4d02-a5ca-772d24ae37a8'
resource_group = 'project-rg'
workspace = 'group-02-aml'
datastore_name = 'workspaceblobstore'
path_on_datastore = 'filtered-submissions-all2'

# long-form Datastore uri format:
uri = f'azureml://subscriptions/{subscription}/resourcegroups/{resource_group}/workspaces/{workspace}/datastores/{datastore_name}'
print(uri)
print(path_on_datastore)

azureml://subscriptions/58bb8a15-5d27-4d02-a5ca-772d24ae37a8/resourcegroups/project-rg/workspaces/group-02-aml/datastores/workspaceblobstore
filtered-submissions-all2

# create the filesystem
fs = AzureMachineLearningFileSystem(uri)

# append parquet files in folder to a list
dflist = []
for path in fs.glob(f'{path_on_datastore}/*.parquet'):
    with fs.open(path) as f:
        dflist.append(pd.read_parquet(f))

# concatenate data frames
reddit_df = pd.concat(dflist)

Original dataframe without modifications

print("Shape of the dataframe:",reddit_df.shape)

print("Columns on the dataframe:", reddit_df.columns)

reddit_df.head(2)

Shape of the dataframe: (217394, 68)
Columns on the dataframe: Index(['adserver_click_url', 'adserver_imp_pixel', 'archived', 'author',
       'author_cakeday', 'author_flair_css_class', 'author_flair_text',
       'author_id', 'brand_safe', 'contest_mode', 'created_utc',
       'crosspost_parent', 'crosspost_parent_list', 'disable_comments',
       'distinguished', 'domain', 'domain_override', 'edited', 'embed_type',
       'embed_url', 'gilded', 'hidden', 'hide_score', 'href_url', 'id',
       'imp_pixel', 'is_crosspostable', 'is_reddit_media_domain', 'is_self',
       'is_video', 'link_flair_css_class', 'link_flair_text', 'locked',
       'media', 'media_embed', 'mobile_ad_url', 'num_comments',
       'num_crossposts', 'original_link', 'over_18', 'parent_whitelist_status',
       'permalink', 'pinned', 'post_hint', 'preview', 'promoted',
       'promoted_by', 'promoted_display_name', 'promoted_url', 'retrieved_on',
       'score', 'secure_media', 'secure_media_embed', 'selftext', 'spoiler',
       'stickied', 'subreddit', 'subreddit_id', 'suggested_sort',
       'third_party_trackers', 'third_party_tracking',
       'third_party_tracking_2', 'thumbnail', 'thumbnail_height',
       'thumbnail_width', 'title', 'url', 'whitelist_status'],
      dtype='object')

	adserver_click_url	adserver_imp_pixel	archived	author	author_cakeday	author_flair_css_class	author_flair_text	author_id	brand_safe	contest_mode	...	suggested_sort	third_party_trackers	third_party_tracking	third_party_tracking_2	thumbnail	thumbnail_height	thumbnail_width	title	url	whitelist_status
0	None	None	False	[deleted]	None	None	None	None	None	False	...	None	None	None	None	default	NaN	NaN	Should I move to D.C. or commute from NoVa?		all_ads
1	None	None	False	thewheisk	None	None	None	None	None	False	...	None	None	None	None	default	NaN	NaN	ChatGPT - what should happen to a sitting memb...	https://www.reddit.com/r/Seattle/comments/119d...	all_ads

2 rows × 68 columns

Threads collected:

# ensure that we have data from all of the subreddits we're looking for
reddit_df.subreddit.unique()

array(['washingtondc', 'Seattle', 'Atlanta', 'nyc'], dtype=object)

Conduct Basic Data Quality Checks

First, we wanted to explore what each of the columns from the Reddit submission data contained. We went through each of the columns to see what percentage of that column as populated, and dropped any columns that was more than 90% null after confirming that these columns did not contain any information that we wanted to use within our analysis.

# check for null values
cols_to_drop = []
for i in reddit_df.columns:
    print(i, '\t\t\t', reddit_df[i].isna().sum(), reddit_df[i].isna().sum()/len(reddit_df))
    if (reddit_df[i].isna().sum()/len(reddit_df)) > .90:
        cols_to_drop.append(i)

adserver_click_url           217394 1.0
adserver_imp_pixel           217394 1.0
archived             0 0.0
author           0 0.0
author_cakeday           216628 0.9964764436920982
author_flair_css_class           199912 0.9195837971609152
author_flair_text            196185 0.9024398097463592
author_id            217394 1.0
brand_safe           217394 1.0
contest_mode             0 0.0
created_utc              0 0.0
crosspost_parent             209795 0.9650450334415853
crosspost_parent_list            209795 0.9650450334415853
disable_comments             217394 1.0
distinguished            216232 0.9946548662796582
domain           1880 0.0086478927661297
domain_override              217394 1.0
edited           0 0.0
embed_type           217394 1.0
embed_url            217394 1.0
gilded           0 0.0
hidden           0 0.0
hide_score           0 0.0
href_url             217394 1.0
id           0 0.0
imp_pixel            217394 1.0
is_crosspostable             0 0.0
is_reddit_media_domain           0 0.0
is_self              0 0.0
is_video             0 0.0
link_flair_css_class             130259 0.5991839701187706
link_flair_text              123324 0.5672833656862655
locked           0 0.0
media            202858 0.9331352291231588
media_embed              0 0.0
mobile_ad_url            217394 1.0
num_comments             0 0.0
num_crossposts           0 0.0
original_link            217394 1.0
over_18              0 0.0
parent_whitelist_status              0 0.0
permalink            0 0.0
pinned           0 0.0
post_hint            171158 0.7873170372687379
preview              171158 0.7873170372687379
promoted             217394 1.0
promoted_by              217394 1.0
promoted_display_name            217394 1.0
promoted_url             217394 1.0
retrieved_on             47946 0.2205488651940716
score            0 0.0
secure_media             202858 0.9331352291231588
secure_media_embed           0 0.0
selftext             0 0.0
spoiler              0 0.0
stickied             0 0.0
subreddit            0 0.0
subreddit_id             0 0.0
suggested_sort           216985 0.9981186233290708
third_party_trackers             217394 1.0
third_party_tracking             217394 1.0
third_party_tracking_2           217394 1.0
thumbnail            0 0.0
thumbnail_height             118220 0.5438052568148155
thumbnail_width              118220 0.5438052568148155
title            0 0.0
url              1880 0.0086478927661297
whitelist_status             0 0.0

Basic Descriptive Statistics

There are a few columns containing numerical data that we are very interested for our analysis - particularly num_comments and score. From these basic descriptive statistics, we see that the values in these columns have a huge range (from 0 to 5,503 for num_comments and 0 to 57,618 for score), potentially due to a large number of outliers. We wanted to take a look at the raw data for the posts with these outlier values in order to ensure that we were only using clean data for our analysis.

reddit_df = reddit_df.drop(cols_to_drop, axis = 'columns')
reddit_df.describe()

	gilded	num_comments	num_crossposts	score	thumbnail_height	thumbnail_width
count	217394.000000	217394.000000	217394.000000	217394.000000	99174.000000	99174.000000
mean	0.007015	20.465132	0.036689	61.546064	101.353792	134.781697
std	0.094668	65.089161	0.360493	282.495090	33.220967	26.380477
min	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
25%	0.000000	0.000000	0.000000	1.000000	78.000000	140.000000
50%	0.000000	2.000000	0.000000	1.000000	103.000000	140.000000
75%	0.000000	14.000000	0.000000	17.000000	140.000000	140.000000
max	8.000000	5503.000000	73.000000	57618.000000	140.000000	140.000000

reddit_df['selftext'].head()

0                                            [deleted]
1                                            [removed]
2    I have dental insurance but it's only with gre...
3                                            [deleted]
4    From NWAC:\n\nWe’ve received preliminary infor...
Name: selftext, dtype: object

# print submissions with more than 1000 comments (outliers)
print(list(reddit_df.loc[reddit_df.num_comments > 1000, :].selftext.head()))

print(list(reddit_df.loc[reddit_df.num_comments > 1000, :].title.head()))

['Let’s hear it', "I see the crazy house prices. I see them also selling for these prices. But to who? Who are these mystery people buying up these million dollar homes?\n\nI'm in my early thirties as is my partner and while we make good money (120k household) it seems like every year we get further away from home ownership.\n\nNone of our friends own a home, and I don't think any of us are particularly close to being able to afford to buy either.\n\nSo who are these mystery people? What do they do, where are they from?", 'I’m not interested in stalking anyone (exactly what a stalker would say), just curious. I know Dave Matthews is around Wallingford or something. And there’s Bill Gates if you consider him one. Anyone other celebrities who actually live here most of the time?', 'There is no reason to see a naturopath, even as a "supplement" for standard medical care.\n\nThese "naturopathic doctors" have the medical competency of the average Instagram wellness influencer and are not legitimate medical providers, even if the state has approved their licensing as such. Their entire profession is based on peddling unregulated supplements and other non-evidence-based quackery for profit. There are real harms associated with their practice, both because it encourages avoidance of actual evidence-based care and because many of their "treatments" can cause direct harm (e.g., I have seen several patients in this area with severe COVID, including one that died, that were prescribed or encouraged to obtain ivermectin, hydroxychloroquine, etc.)\n\nThe trend towards legitimization of "alternative medicine" and pseudo-science more broadly is certainly not a Seattle-specific phenomenon although seems to be much more prevalent in this area than anywhere that I have practiced.\n\nSee a real physician.\n\nsource: Seattle physician\n\n**EDIT:** There have been several anecdotes about experiences with alternative medicine providers posted in this thread, many of which describe the experience of being listened to in a way that actual medical professionals have not done. There have also been several comments that have pointed out that many people seek out alternative medicines in large part *because* of the many failings of our healthcare system (e.g., inaccessibility, excessive cost, etc.). The appeal of these clinics is almost entirely that they listen to their patients; listening is an essential component of providing patient care but being listened to by someone that is trained to prescribe black kohosh or an herbal tea is not medicine. These anecdotes are also not evidence and it is objective fact that the overwhelming majority of practices and treatments offered by these practitioners are not evidence-based. \n\n&amp;#x200B;', '']
['What was your worst meal of 2022 in DC and what restaurant will you not be going back to?', 'Serious question.... Can anyone in their twenties or thirties afford to buy property here?', 'What celebrities live in Seattle?', 'PSA: There is no reason to see a naturopath, ever.', 'r/Seattle grappling with big tech layoffs']

# print submissions with more than 1000 score (outliers)
print(list(reddit_df.loc[reddit_df.score > 1000, :].selftext.head()))

print(list(reddit_df.loc[reddit_df.score > 1000, :].title.head()))

['Please get it together.', '', '', 'I take metro almost every day. I like to trash WMATA as much as anyone for the horseshit they make me go through sometimes. But the line to Dulles is a massive convenience.\n\nThe silver flyer bus, although only a few minutes from Wiehle-Reston, complicated things enough for me to try and avoid the trip. It was also an additional $5 and you had to wait at the counter to pay. I also hate forking over $50+ each way for a ride share. \n\nLast week, I took a flight out of Dulles that was $250 cheaper than DCA. I came back on a Saturday, and had a $2 flat fare to Capitol South. Each way was an hour and 10 minutes.\n\nI saved a couple hundred bucks, timed my trip more efficiently, and sat/listened to a podcast watching the train scoot through NOVA.', '']
['PSA: if a stoplight is without power IT BECOMES A STOP SIGN.', 'In Washington heights they tour up the roads to do work and revealed the old cobblestone beneath (184 &amp; Pinehurst)', "On the Link headed home from SeaTac and a Sheriff and a dog just boarded. They walked up and down the car and doggo sniffed everyone. I'm guessing it's a bomb dog? I highly doubt there looking for drugs. Never seen this before.", 'The Silver Line to Dulles is a bigger deal than I thought it would be', 'WA Senate passes bill to bar hiring discrimination for cannabis use']

# separate Posts that were removed or deleted
removed_filter = (reddit_df.selftext == '[removed]') | reddit_df.selftext.str.contains('Removed by reddit ')
deleted_filter = (reddit_df.selftext == '[deleted]')
num_removed = len(reddit_df[removed_filter])
num_deleted = len(reddit_df[deleted_filter])
print(f"Num removed: {num_removed}")
print(f"Num deleted: {num_deleted}")
# remove those rows - save in another df in case its needed
removed_df = reddit_df[removed_filter | deleted_filter]
print(f"All rows: {len(reddit_df)}")
reddit_df = reddit_df[(~removed_filter) & (~deleted_filter)]
print(f"New num of rows: {len(reddit_df)}")

Num removed: 67672
Num deleted: 24059
All rows: 217394
New num of rows: 125663

Exploring Number of Posts per City

Given that gentrification is literally related to change of a city over time, we were very interested in doing an initial exploration of what the data looks like over time. One of the questions that we had posed as part of our project plan was, “Does the amount of responses within a subreddit correlate with the population of the city?” One of the reasons we asked this questions is because we are very aware that Reddit as a platform is only accessible to those with internet and technology. It would be interesting if we saw that cities with a larger total population - but a larger population that was below the poverty line (and thus potentially with less access to internet and technology) - had a larger number of Reddit submission posts.

"""
Transforming temporal data
"""

# Convert 'created_utc' to datetime
reddit_df['created_utc'] = pd.to_datetime(reddit_df['created_utc'])

# Extract year from 'created_utc'
reddit_df['year'] = reddit_df['created_utc'].dt.year
reddit_df['year'] = reddit_df['year'].astype('str')

# Extract month from 'created_utc'
reddit_df['month'] = reddit_df['created_utc'].dt.month

# group by year, month, and subreddit, and count the number of Posts
Posts_count = reddit_df.groupby(['year', 'month', 'subreddit']).size().reset_index(name='Num_Posts')
# group by year, month, and subreddit, and count the number of comments
comments_sum_by_year = reddit_df.groupby(['year', 'month', 'subreddit'])['num_comments'].sum().reset_index(name='Total_Comments')
# combine two on year, month, subreddit
combined_df = pd.merge(Posts_count, comments_sum_by_year, on=['year', 'month', 'subreddit'])
# show first rows of combined df
combined_df.head()

	year	month	subreddit	Num_Posts	Total_Comments
0	2021	1	Atlanta	677	13532
1	2021	1	Seattle	1502	33949
2	2021	1	nyc	1234	40512
3	2021	1	washingtondc	1572	37336
4	2021	2	Atlanta	690	11755

post_counts = reddit_df.groupby(['subreddit', 'year']).size().reset_index(name='count')
post_counts

	subreddit	year	count
0	Atlanta	2021	8828
1	Atlanta	2022	7912
2	Atlanta	2023	1866
3	Seattle	2021	21672
4	Seattle	2022	21170
5	Seattle	2023	5092
6	nyc	2021	14506
7	nyc	2022	13509
8	nyc	2023	3224
9	washingtondc	2021	12256
10	washingtondc	2022	12434
11	washingtondc	2023	3194

agg_df = reddit_df.groupby('year')['num_comments'].agg(['mean', 'median', 'min', 'max']).reset_index()
agg_df.columns = ['Year', 'Average Number of Comments', 'Median Number of Comments ', 'Min Number of Comments', 'Max Number of Comments']
agg_df

	Year	Average Number of Comments	Median Number of Comments	Max Number of Comments
0	2021	27.673833	8	3387
1	2022	35.007342	8	5503
2	2023	35.854516	9	1732

Visuals #1 & 2: Number of Posts per City

Findings: Our above stated hypothesis - that larger cities does not necessarily correspond to more Reddit posts - proved to be true, which can be seen in the visualization below. Although New York is the most populated city out of all of the cities within this analysis, we can see that Seattle, Washington has the most number of posts out of all the cities. One can theorize that this is because Seattle has the largest number of employees within the tech sector out of all the cities pulled for (primarilly due to Microsoft and Amazon headquartered in that area). We also broke this down by year to see if there was any change over time.

subreddit_counts = reddit_df.groupby(['subreddit']).size().reset_index(name='count')
subreddit_counts

# Make barplot in Plotly
fig2 = px.bar(subreddit_counts, x="subreddit", y="count", color = 'subreddit', template='plotly_white',
             labels={"count": "Number of Posts", 'subreddit': 'City Subreddit'},
             title="Number of Reddit Posts per City in Entire Dataset")
# Update size
fig2.update_layout(height=500, width=800)
# Remove legend
fig2.update_traces(showlegend=False)

# Make barplot in Plotly
fig = px.bar(post_counts, x="subreddit", y="count", color="year", template='plotly_white',
             labels={"year": "Year", "count": "Number of Posts", 'subreddit': 'City Subreddit'},
             title="Number of Reddit Posts per City, per Year")
# Update size
fig.update_layout(height=500, width=800)

Visual #3 - Number of Posts by Month of Year

We wanted to see whether certain months of the year corresponded with higher Reddit submissions, especially since gentrified cities have such high influx of tourism during vacation months. What we see in the visualization below is that there does seem to be a somewhat cyclic volume of posts which are higher during the summer and winter seasons. We will have to do further NLP analysis to see if these higher volume of posts is due to increased tourism during tourism season.

# The animation_frame will be based on 'year'
fig = px.scatter(
    combined_df, 
    x='month', 
    y='Num_Posts', 
    size='Total_Comments',
    color='subreddit',
    hover_name='subreddit',
    animation_frame='year',
    size_max=40,
    range_y=[combined_df['Num_Posts'].min(), combined_df['Num_Posts'].max()+200],
    title='Monthly Subreddit Activity'
)

# Update the layout to include all months on the x-axis, year 2023 only has three monthes data point
fig.update_xaxes(
    title='Month',
    tickmode='array',
    tickvals=[str(m).zfill(2) for m in range(1, 13)],
    ticktext=[str(m) for m in range(1, 13)]
)

fig.update_layout(
    height=550,  # set the height
    width=850,  # set the width
    title_text='Monthly Number of Posts by Subreddits from 2021-2023 ', 
    title_x=0.5  # Center the title
)

# Update the y-axis title
fig.update_yaxes(title='Number of Posts')

# Show the figure
fig.show()

Visual #4 - Number of Posts Over Time

We wanted to analyze the accumulation of posts over time for each of the cities to see if there were any noticeable trends over the past few years. From this visualization, we were not able to see continual trends for any upward or downward trajectories. Rather, most cities have fluctuations throughout the year (as we saw above), but stay within the same range of posts. Noticeably, there was a larger than normal volume of posts in the washingtondc subreddit in January 2021 - this is most likely due to people posting about the January 6th insurrection at the U.S. Capitol.

sns.set_style("white")
fig, ax = plt.subplots(figsize=(9,5.5))


reddit_df.loc[:, 'created_utc'] = pd.to_datetime(reddit_df.loc[:, 'created_utc'])
reddit_df.loc[:, 'created_date'] = pd.to_datetime(reddit_df.loc[:, 'created_utc'], format = '%Y-%m-%d')
reddit_df.loc[:, 'created_month'] = pd.to_datetime(pd.to_datetime(reddit_df.loc[:, 'created_date']).dt.year.astype(str)+'-'+pd.to_datetime(reddit_df.loc[:, 'created_date']).dt.month.astype(str)+'-01')

group_date = reddit_df.groupby(['subreddit', 'created_month']).size().reset_index().rename({0:'count'}, axis='columns')
group_date.head()

sns.lineplot(x = 'created_month', y = 'count', hue = 'subreddit', data = group_date)
ax.set_title("Number of Posts per Month by City")
ax.set_ylabel("Number of Posts")
ax.set_xlabel("Post Date")
sns.move_legend(ax, "upper left", bbox_to_anchor=(1, 1))

Summary Table # 1: Exploration of comments

# Create table aggregating statistics by subreddit
agg_df = reddit_df.groupby('subreddit')['num_comments'].agg(['mean', 'median', 'min', 'max']).reset_index()

# Rename the columns
agg_df.columns = ['Subreddit', 'Average Number of Comments', 'Median Number of Comments',
                   'Min Number of Comments', 'Max Number of Comments']
agg_df

	Subreddit	Average Number of Comments	Median Number of Comments	Max Number of Comments
0	Atlanta	20.689670	4	883
1	Seattle	32.923833	10	2761
2	nyc	39.478921	5	5503
3	washingtondc	28.479522	9	1991

Summary Table #2: Exploration of scores

# Create table aggregating statistics by subreddit
agg_df2 = reddit_df.groupby('subreddit')['score'].agg(['mean', 'median', 'min', 'max']).reset_index()

# Rename the columns
agg_df2.columns = ['Subreddit', 'Average Reddit Score', 'Median Reddit Score', 'Min Reddit Score', 'Max Reddit Score']
agg_df2

	Subreddit	Average Reddit Score	Median Reddit Score	Max Reddit Score
0	Atlanta	46.051274	3	3134
1	Seattle	115.932491	9	57618
2	nyc	127.841000	9	8363
3	washingtondc	81.006778	10	3789

fig = px.histogram(agg_df2, x='Median Reddit Score', title='Median Reddit Score Histogram', template='plotly_white',
labels={'Median Reddit Score': 'Median Score'})
fig.show()

Feature Engineering

Feature 1: Categorization of Engagement:

Create low, medium, high engagement labels based on number of comments

Since the average number of comments is approximately 10-25 comments, I am going to create low, medium, and high labels as:

Low: 0 <= num_comments < 20
Medium: 20 <= num_comments < 100
High: 100 <= num_comments

# Create labels
reddit_df['engagement_label'] = np.where((reddit_df['num_comments'] >= 0) & (reddit_df['num_comments'] < 20), 'low',
                                np.where((reddit_df['num_comments'] >= 20) & (reddit_df['num_comments'] < 100), 'medium', 'high'))

# Example of engagement label
reddit_df[reddit_df['engagement_label']=='high'][['num_comments', 'engagement_label']].head(2)

	num_comments	engagement_label
4	143	high
12	158	high

Feature 2: Create low, medium, high labels based on Reddit Score

The average Reddit Score is around 84.

Low: 0 <= Reddit Score < 84

Medium: 84 <= Reddit Score < 200

High: 200 <= Reddit Score

# Create labels
reddit_df['score_label'] = np.where((reddit_df['score'] >= 0) & (reddit_df['score'] < 84), 'low',
                                np.where((reddit_df['score'] >= 84) & (reddit_df['score'] < 200), 'medium', 'high'))

Feature 3: Number of words in post

The length of a post can give us a lot of meaningful information about the sentiment or value of a post. We wanted to have a continuous variable so that we’d be able to preserve the wide range of post length.

reddit_df['num_words'] = reddit_df.apply(lambda row: len(row['selftext'].split() + row['title'].split()), axis = 'columns')
reddit_df['num_words'].describe()

count    125663.000000
mean         38.294892
std          84.134336
min           1.000000
25%           9.000000
50%          14.000000
75%          41.000000
max        5537.000000
Name: num_words, dtype: float64

fig, ax = plt.subplots(figsize=(9,5.5))

sns.histplot(x = 'num_words', data = reddit_df[reddit_df.num_words <= 41], bins = 20)
ax.set_title("Number of Words per Submission (Includes Title)")
ax.set_ylabel("Frequency")
ax.set_xlabel("Number of Words")

Text(0.5, 0, 'Number of Words')

fig, ax = plt.subplots(figsize=(12, 5.5))

sns.boxplot(x = 'num_words', data = reddit_df[reddit_df.num_words > 41])
ax.set_title("[OUTLIERS] Number of Words per Submission (Includes Title)")
ax.set_ylabel("Frequency")
ax.set_xlabel("Number of Words")

Text(0.5, 0, 'Number of Words')

Summary Table #3 - Categorization of Engagement Level

engagement_counts = reddit_df.groupby(['subreddit', 'engagement_label']).size().reset_index(name='count')
engagement_counts

	subreddit	engagement_label	count
0	Atlanta	high	996
1	Atlanta	low	13998
2	Atlanta	medium	3612
3	Seattle	high	3588
4	Seattle	low	32592
5	Seattle	medium	11754
6	nyc	high	3386
7	nyc	low	21564
8	nyc	medium	6289
9	washingtondc	high	1873
10	washingtondc	low	19477
11	washingtondc	medium	6534

# Make barplot in Plotly
fig3 = px.bar(engagement_counts, x="subreddit", y="count", color="engagement_label", template='plotly_white',
             labels={"engagement_label": "Engagement Label", "count": "Number of Posts", 'subreddit': 'City Subreddit'},
             title="Engagement Level for Posts by Subreddit")
# Update size
fig3.update_layout(height=500, width=800)

Regex Searches Dummy Variables

Since we are interested in exploring opinions on gentrification, we decided to look for key words within the posts that are related to that issue, such as airbnb, rent, gentrification, transit, and tourist.

reddit_df.loc[:,'airbnb_yes'] = reddit_df.loc[:, 'selftext'].str.lower().str.contains('airbnb')
reddit_df.loc[:,'rent_yes'] = reddit_df.loc[:, 'selftext'].str.lower().str.contains('rent')
reddit_df.loc[:,'gentrification_yes'] = reddit_df.loc[:, 'selftext'].str.lower().str.contains('gentrification')
reddit_df.loc[:,'transit_yes'] = reddit_df.loc[:, 'selftext'].str.lower().str.contains('transit')
reddit_df.loc[:,'tourist_yes'] = reddit_df.loc[:, 'selftext'].str.lower().str.contains('tourist')

list(reddit_df[reddit_df.airbnb_yes].selftext.head())

["We are looking for a large house/ lodge/ cabins type situation to do a family gathering with 15-20 people,  all adults or almost adults,  within 3 hours of Seattle.  Ideally there would be some outdoor and some indoor activities nearby as we hope to do this in April and the weather will be unpredictable.  Suggestions for a cool place you've been? I know there's stuff on airbnb but i don't love their policies.",
 "Fiance and I moved into an apartment in December. \nIt's been inconvenient -- broken heaters, shoddy fixes, and now we have to clean up and move around all of our apartment things due to pest fumigation. We both work from home and have two dogs. We asked for credit to book an airBnb or hotel while things get cleaned; they said no. Apparently, a neighboring unit had caused a pest infestation that we were not informed of before we had signed the lease.\n\nThey gave us incredibly short notice over the weekend and we have to clear out everything in specific areas. We're curious if we were supposed to be informed about a pest infestation.",
 "Me &amp; the gf have been camping at the Midsummer Renaissance Festival at Bonnie Lake for the last 7 years, but this year they sold out all weekends before we even learned they were even on sale (seems only FB users were informed, I'm a newsletter subscriber and haven't heard from them since last year's issues. Seems the challenges with scale continue). \n\nWe've always loved dressing up and hanging out with other renfaire folks, but in our instance it doesn't make sense to try for a day pass, there are no airbnbs or anything available, etc, so we've given up on that particular event. Big bummer but moving on. \n\nAnyone know of anything else in the area for nerds that like to dress up in a semi-immersive fantasy/historical type environment with other folks who like that?",
 "When I was in college, I always thought making near 6 figures would put me on financial easy street. But after a messy divorce, and an agreement to move the kids to DC, I'm shocked at how I just barely make it every month. My rent is $2,000 a month for a two bedroom with no amenities. I'm currently sharing a house with my sister that she owns. This is about what she was making with the space on Airbnb. I do have $2,000 a month in legal fees that I hope will be paid off by March. \n\nBut once I pay off my legal fees I'm going to move out of my sisters. It's just hard living with her and sharing a kitchen. lol. She is actually very wonderful, but my kids and I just need our own space. So most likely I'll be looking at having to pay more once I move out. My kids go to school in Bethesda, MD so I might look at Rockville or Gaithersburg. \n\nLast month between my kids, and myself we spent $650 on groceries and eating out. So maybe that is an area I can work on. \n\nI honestly only spent $300 on Christmas presents. But then I had a fucking $1,500 expense to fix my car. \n\nI spent $145 on some entertainment stuff for the kids. This includes our netflix account. \n\nI spent $200 on clothing items, and some stuff for basketball that my daughter is involved in. \n\nI'm also switching my Social Work license from Michigan to DC and have to take a test. Overall, I've had to pay $700 to switch my license over which I wasn't expecting. \n\nI'm just dumfounded to be making so much money I feel, and yet still feel to be living on the edge month to month. I try not to get down on myself, and believe the system is just so fucked up. My other sister lives in Ann Arbor, MI and pays $1800 with house insurance towards a place she'll own. I'll never own a place in the DMV I feel. Hell, people didn't even want to rent to me because my credit score was 650 and I have a $1000,000 in debt. $80,000 of that is college debt. Some landlords kept telling me they'd only rent to working professionals, and I found out that was code for a young person whose parents are paying the bill. \n\nI guess I'm just ranting, but I just never thought I'd make this much money to then be near broke every month. I guess my Ex really got me in convincing me to move to DC. I just wanted the divorce to be over. People kept telling me I'd make more money out here, but I haven't found that the increase in pay cancels out the cost of living. My sister who is also a social worker in Michigan, makes just as much as me.",
 "I found some places here beside downtown: [https://www.realtor.com/realestateandhomes-search/Seattle\\_WA/type-mfd-mobile-home/price-na-350000/commute-400-9th-Ave-N-\\_Seattle\\_WA,47.622875,-122.339419,30m,drive,traffic](https://www.realtor.com/realestateandhomes-search/Seattle_WA/type-mfd-mobile-home/price-na-350000/commute-400-9th-Ave-N-_Seattle_WA,47.622875,-122.339419,30m,drive,traffic)\n\nDo you think it'd be worthwhile to do this and just buy a place outright and rent it later as an AirBNB rather than pay rent? I'm thinking of claiming it as primary/principal residence."]

# reddit_df[reddit_df.loc[:,'airbnb_yes']].num_comments.plot(kind = 'box')
sns.set_style("white")
fig, ax = plt.subplots(figsize=(9,5.5))

sns.boxplot(data = reddit_df[reddit_df.num_comments < 100],
            x = 'airbnb_yes',
            y = 'num_comments',
            hue = 'airbnb_yes')

ax.set_title("Difference Between Comments on Posts Mentioning Airbnb vs. Not")
ax.set_ylabel("Number of Comments")
ax.set_xlabel("Mention of Airbnb in Post")

Text(0.5, 0, 'Mention of Airbnb in Post')

# reddit_df[reddit_df.loc[:,'airbnb_yes']].num_comments.plot(kind = 'box')
sns.set_style("white")
fig, ax = plt.subplots(figsize=(9,5.5))

upperbound_outlier = 200

sns.boxplot(data = reddit_df[reddit_df.score < upperbound_outlier],
            x = 'airbnb_yes',
            y = 'score',
            hue = 'airbnb_yes')

ax.set_title("Difference Between Post Score on Posts Mentioning Airbnb vs. Not")
ax.set_ylabel("Post Score")
ax.set_xlabel("Mention of Airbnb in Post")

Text(0.5, 0, 'Mention of Airbnb in Post')

regex_words = ['airbnb', 'rent', 'gentrification', 'transit', 'tourist']
for word in regex_words:
    reddit_df.loc[:, word+'_yes'] = reddit_df.loc[:, 'selftext'].str.lower().str.contains(word)

regex_cols = [i+'_yes' for i in regex_words]

keyword_counts = reddit_df.groupby(regex_cols).size().reset_index()
keyword_counts = keyword_counts.rename({0: 'Posts w/ Keyword'}, axis = 'columns')

Summary Table #4 - Frequency of Keywords in Posts

def combine_keywords(row):
    keyword_str = []
    for i in ['airbnb_yes' ,'rent_yes', 'gentrification_yes' ,'transit_yes', 'tourist_yes']:
        if row[i]:
            keyword_str.append(i[:i.index("_")])
    if keyword_str == []:
        keyword_str = 'does not contain any keywords'
    else:
        keyword_str = ', '.join(keyword_str)
    return keyword_str

keyword_counts.loc[:, 'Keyword Included'] = keyword_counts[regex_cols].apply(lambda row: combine_keywords(row), axis = 'columns')
keyword_counts = keyword_counts[['Keyword Included', 'Posts w/ Keyword']]
keyword_counts = keyword_counts.sort_values(by = 'Posts w/ Keyword', ascending = False)
keyword_counts

	Keyword Included	Posts w/ Keyword
0	does not contain any keywords	119854
6	rent	4878
2	transit	375
1	tourist	233
8	rent, transit	132
12	airbnb	72
7	rent, tourist	46
14	airbnb, rent	36
4	gentrification	11
9	rent, transit, tourist	8
10	rent, gentrification	7
13	airbnb, tourist	4
5	gentrification, tourist	2
3	transit, tourist	2
11	rent, gentrification, transit	1
15	airbnb, rent, transit, tourist	1
16	airbnb, rent, gentrification, tourist	1

fig = px.bar(keyword_counts[keyword_counts != 'does not contain any keywords'],
                y='Keyword Included',
                x='Posts w/ Keyword',
                template='plotly_white',
                title='Frequency of Posts with Keyword',
             labels={'Posts w/ Keyword': 'Number of Posts'}, 
             text='Posts w/ Keyword', 
             hover_name='Keyword Included', 
             color_discrete_sequence=['blue'],
             range_x = [0, 5000]
             )

# fig.update_xaxes(tickangle=90)
fig.show()

External Data to Join - Exploration of Census data

As mentioned above, we wanted to visualize the rent over time due to the change of rent due to gentrification.

Exploration of rent

# rent as percentage
data_path = '../data/csv/'
rent_perc_df = pd.read_csv(data_path + 'rent_as_percent_of_income.csv')
rent_perc_df.loc[:, 'Year'] = pd.to_datetime(rent_perc_df.loc[:, 'Year'])

rent_dollars_df = pd.read_csv(data_path + 'rent_in_dollars.csv')
rent_dollars_df.loc[:, 'Year'] = pd.to_datetime(rent_dollars_df.loc[:, 'Year'])

print(rent_perc_df.columns)
print(rent_dollars_df.columns)

Index(['Unnamed: 0', 'Geography', 'Geographic Area Name',
       'Rent as Percent of Income', 'Year', 'County'],
      dtype='object')
Index(['Unnamed: 0', 'Geography', 'Geographic Area Name', 'Rent in USD',
       'Year', 'County'],
      dtype='object')

Median gross rent as a percentage of household income

We looked at the rent distribution of each city in 2010 versus 2021 to see if there were dramatic differences. Surprisingly, there were not.

fig, ax = plt.subplots(2, 1, figsize=(9,11))
sns.set(rc={'figure.figsize':(11.7,8.27)})

year_filter = (rent_perc_df.Year.dt.year == 2010)

sns.boxplot(x = 'County',
            y = 'Rent as Percent of Income',
            data = rent_perc_df[year_filter],
            ax = ax[0])
ax[0].set_title("Rent as Percentage of Income for 2010")

year_filter = (rent_perc_df.Year.dt.year == 2021)
sns.boxplot(x = 'County',
            y = 'Rent as Percent of Income',
            data = rent_perc_df[year_filter],
            ax = ax[1])
ax[1].set_title("Rent as Percentage of Income for 2021")

Text(0.5, 1.0, 'Rent as Percentage of Income for 2021')

Rent in Dollar Amounts

We can see that rent has obviously increased over the years for each of these cities, with New York experiencing the most dramatic rise from 2014 to 2015. We unfortunately don’t have Reddit data from this year, but this will still be useful information for us to take into consideration moving forward.

fig, ax = plt.subplots(figsize=(10, 7))
r = sns.lineplot(rent_dollars_df,
                x = 'Year',
                y = 'Rent in USD',
                hue = 'County')

ax.set_title('U.S. Counties of Most Gentrified Cities - Rent Prices from 2010 - 2021', fontdict={'size':15})

ylabels = ['${:,.0f}'.format(y) for y in r.get_yticks()]
r.set_yticklabels(ylabels)

/tmp/ipykernel_4268/1787382017.py:10: UserWarning:

FixedFormatter should only be used together with FixedLocator

[Text(0, 800.0, '$800'),
 Text(0, 1000.0, '$1,000'),
 Text(0, 1200.0, '$1,200'),
 Text(0, 1400.0, '$1,400'),
 Text(0, 1600.0, '$1,600'),
 Text(0, 1800.0, '$1,800'),
 Text(0, 2000.0, '$2,000'),
 Text(0, 2200.0, '$2,200')]