%pip install -U azureml-fsspec mltable
Analysis of Gentrification Based on Reddit Data
More than 135,000 people were displaced between 2000 and 2012 [1] in the United States.
Gentrification is a demographic and economic shift that displaces established working-class communities and communities of color in favor of wealthier newcomers and companies. Displacement happens when long-time or original neighborhood residents move from a gentrified area because of higher rents, mortgages, and property taxes. It is a housing, economic, and health issue that affects a community’s history and culture.
Analyzing data related to gentrification, especially people’s opinions, is crucial to understand the real impact on communities. It provides insights into how residents are affected by rising costs and changes in their neighborhoods, focusing on the emotional and social aspects of the problem. Finally it can help to find solutions to ensure more equitable urban development.
Objective
The objective of this project is to analyze the opinions of Reddit users to understand how gentrification has impacted different cities in the United States. The analysis will focus on the four cities with the highest percentage of gentrified neighborhoods from 2000 to 2013, according to [1]:
- New York, New York
- Atlanta, Georgia
- Washington, D.C.
- Seattle, Washington
To achieve the objective, the following questions will be addressed using various statistical analysis, Natural Language Processing, and Machine Learning techniques:
- Is sentiment of users opinions on a city related with its rent prices?
- Is sentiment of users opinions on a city related with transportation and Walkability?
- Does median income have an impact on gentrification sentiments?
- Does political beliefs have an impact on sentiments?
- What impact does this have to do on cost of living? Restaurants, gas prices, grocery stores, etc.
- Does sentiment towards gentrification align with housing development?
- What kinds of changes cause changes in sentiment of gentrification?
- Does the amount of responses within a subreddit correlate with the population of the city?
- Do more gentrified cities have a more positive or negative sentiment towards short-term rentals (e.g. Airbnb, Vrbo)
- Does perceived gentrification align with actual gentrification?
- Does tourism increased trends align with gentrification trends?
- Does increased tourism increase revenue for the city?
Sources:
[1] Study by the National Community Reinvestment Coalition. March 2019. https://www.usnews.com/news/cities/slideshows/cities-with-the-highest-percentage-of-gentrified-neighborhoods
Data Description
The analysis is performed over two datasets one contains posts from reddits users and a “external dataset” containing U.S. Census Bureau data.
Reddit Data
The reddit dataset contains posts and its metadata from January to March 2023. This analysis will focus in the next threads: ‘washingtondc’, ‘Seattle’, ‘Atlanta’ and ‘nyc’ for each of the top previously mentioned cities.
Census Data
A tell-tale sign of “gentrification” in an area is rising rent prices. In order for us to understand what time period gentrification began in these cities, we pulled U.S. Census Bureau data on rent prices within these cities over the past few years. Although it is natural for rent prices to naturally increase over time given inflation trends, gentrification can cause rent prices to abnormally increase. This typically results in long-time residents no longer being able to afford to live in these areas. Being able to see when these rent spikes occured may give us more insight as to where we’d expect Reddit Posts to change in sentiment, or even in topics, due to the change of residents.
Business goals:
- Determine if the sentiment of Reddit users’ opinions on a city is correlated with its rent prices
- Investigate if the sentiment of Reddit users’ opinions on a city is influenced by factors such as transportation and walkability
- Assess the impact of median income on gentrification sentiments
- Explore whether political beliefs have an effect on sentiments towards gentrification
- Analyze how gentrification affects the cost of living, including factors like restaurants, gas prices, and grocery stores
- Investigate if sentiment towards gentrification aligns with housing development
- Identify the types of changes that lead to shifts in sentiment regarding gentrification
- Determine if the volume of responses within a subreddit correlates with the population of the city
- Investigate whether more gentrified cities exhibit a more positive or negative sentiment towards short-term rentals (e.g., Airbnb, Vrbo)
- Evaluate if perceived gentrification aligns with actual gentrification trends, and whether increased tourism is associated with gentrification and city revenue
Technical proposals:
- We will conduct sentiment analysis on Reddit posts related to the four cities using a tranformer model from the hugging face library. We will then compare the sentiment scores with U.S. Census Bureau data on rent prices to identify correlations
- We will perform NLP techniques to extract information related to transportation and walkability from the Reddit posts. We can use techniques like Named Entity Recognition (NER) for identifying locations and distances. We will then use regression analysis or correlation techniques to determine if there is a relationship between this information and sentiment scores
- To investigate the impact of median income on gentrification sentiments, we will integrate U.S. Census Bureau data on median income with sentiment analysis results from the Reddit posts. This may involve techniques like linear regression or other regression models to identify any patterns or trends
- To explore the influence of political beliefs on sentiments towards gentrification, we will employ NLP techniques to identify political language within the Reddit posts. This could involve sentiment analysis specifically tailored to political sentiment. We will then use statistical tests or regression models to analyze sentiment scores based on political affiliations
- Regarding the impact of gentrification on the cost of living, we will utilize both U.S. Census Bureau data and sentiment analysis results. This may involve multiple regression analysis to assess correlations with factors like restaurants, gas prices, and grocery stores
- To examine the alignment between sentiment towards gentrification and housing development, we will apply NLP techniques to extract relevant information from the Reddit posts and compare it with housing development data. We may also use regression techniques to analyze the relationship between sentiment and housing development indicators
- To identify changes that affect sentiment regarding gentrification, we will conduct exploratory data analysis on the Reddit posts, looking for significant shifts in sentiment in response to specific events or developments. This can be done using time series analysis
- To investigate the correlation between subreddit responses and city population we will analyze the data for any patterns or relationships between the volume of responses and population size through correlation analysis or regression techniques
- For the question on sentiment towards short-term rentals in gentrified cities, we will perform sentiment analysis on relevant posts using a hugging face transformer. We will then use statistical tests or regression models to analyze the relationship between sentiment and short-term rental trends
- To evaluate the alignment between perceived and actual gentrification trends and their impact on city revenue, we will analyze sentiment data alongside data on gentrification trends and tourism trends using regression analysis or correlation techniques
Exploration of Reddit data
%pip install nltk
%pip install seaborn
import pandas as pd
import numpy as np
import plotly.express as px
import matplotlib.pyplot as plt
import seaborn as sns
import datetime
import nltk
import json
import requests
import os
from pathlib import Path
import re
from azureml.fsspec import AzureMachineLearningFileSystem
# Azure Machine Learning workspace details:
= '58bb8a15-5d27-4d02-a5ca-772d24ae37a8'
subscription = 'project-rg'
resource_group = 'group-02-aml'
workspace = 'workspaceblobstore'
datastore_name = 'filtered-submissions-all2'
path_on_datastore
# long-form Datastore uri format:
= f'azureml://subscriptions/{subscription}/resourcegroups/{resource_group}/workspaces/{workspace}/datastores/{datastore_name}'
uri print(uri)
print(path_on_datastore)
azureml://subscriptions/58bb8a15-5d27-4d02-a5ca-772d24ae37a8/resourcegroups/project-rg/workspaces/group-02-aml/datastores/workspaceblobstore
filtered-submissions-all2
# create the filesystem
= AzureMachineLearningFileSystem(uri)
fs
# append parquet files in folder to a list
= []
dflist for path in fs.glob(f'{path_on_datastore}/*.parquet'):
with fs.open(path) as f:
dflist.append(pd.read_parquet(f))
# concatenate data frames
= pd.concat(dflist) reddit_df
Original dataframe without modifications
print("Shape of the dataframe:",reddit_df.shape)
print("Columns on the dataframe:", reddit_df.columns)
2) reddit_df.head(
Shape of the dataframe: (217394, 68)
Columns on the dataframe: Index(['adserver_click_url', 'adserver_imp_pixel', 'archived', 'author',
'author_cakeday', 'author_flair_css_class', 'author_flair_text',
'author_id', 'brand_safe', 'contest_mode', 'created_utc',
'crosspost_parent', 'crosspost_parent_list', 'disable_comments',
'distinguished', 'domain', 'domain_override', 'edited', 'embed_type',
'embed_url', 'gilded', 'hidden', 'hide_score', 'href_url', 'id',
'imp_pixel', 'is_crosspostable', 'is_reddit_media_domain', 'is_self',
'is_video', 'link_flair_css_class', 'link_flair_text', 'locked',
'media', 'media_embed', 'mobile_ad_url', 'num_comments',
'num_crossposts', 'original_link', 'over_18', 'parent_whitelist_status',
'permalink', 'pinned', 'post_hint', 'preview', 'promoted',
'promoted_by', 'promoted_display_name', 'promoted_url', 'retrieved_on',
'score', 'secure_media', 'secure_media_embed', 'selftext', 'spoiler',
'stickied', 'subreddit', 'subreddit_id', 'suggested_sort',
'third_party_trackers', 'third_party_tracking',
'third_party_tracking_2', 'thumbnail', 'thumbnail_height',
'thumbnail_width', 'title', 'url', 'whitelist_status'],
dtype='object')
adserver_click_url | adserver_imp_pixel | archived | author | author_cakeday | author_flair_css_class | author_flair_text | author_id | brand_safe | contest_mode | ... | suggested_sort | third_party_trackers | third_party_tracking | third_party_tracking_2 | thumbnail | thumbnail_height | thumbnail_width | title | url | whitelist_status | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | None | None | False | [deleted] | None | None | None | None | None | False | ... | None | None | None | None | default | NaN | NaN | Should I move to D.C. or commute from NoVa? | all_ads | |
1 | None | None | False | thewheisk | None | None | None | None | None | False | ... | None | None | None | None | default | NaN | NaN | ChatGPT - what should happen to a sitting memb... | https://www.reddit.com/r/Seattle/comments/119d... | all_ads |
2 rows × 68 columns
Threads collected:
# ensure that we have data from all of the subreddits we're looking for
reddit_df.subreddit.unique()
array(['washingtondc', 'Seattle', 'Atlanta', 'nyc'], dtype=object)
Conduct Basic Data Quality Checks
First, we wanted to explore what each of the columns from the Reddit submission data contained. We went through each of the columns to see what percentage of that column as populated, and dropped any columns that was more than 90% null after confirming that these columns did not contain any information that we wanted to use within our analysis.
# check for null values
= []
cols_to_drop for i in reddit_df.columns:
print(i, '\t\t\t', reddit_df[i].isna().sum(), reddit_df[i].isna().sum()/len(reddit_df))
if (reddit_df[i].isna().sum()/len(reddit_df)) > .90:
cols_to_drop.append(i)
adserver_click_url 217394 1.0
adserver_imp_pixel 217394 1.0
archived 0 0.0
author 0 0.0
author_cakeday 216628 0.9964764436920982
author_flair_css_class 199912 0.9195837971609152
author_flair_text 196185 0.9024398097463592
author_id 217394 1.0
brand_safe 217394 1.0
contest_mode 0 0.0
created_utc 0 0.0
crosspost_parent 209795 0.9650450334415853
crosspost_parent_list 209795 0.9650450334415853
disable_comments 217394 1.0
distinguished 216232 0.9946548662796582
domain 1880 0.0086478927661297
domain_override 217394 1.0
edited 0 0.0
embed_type 217394 1.0
embed_url 217394 1.0
gilded 0 0.0
hidden 0 0.0
hide_score 0 0.0
href_url 217394 1.0
id 0 0.0
imp_pixel 217394 1.0
is_crosspostable 0 0.0
is_reddit_media_domain 0 0.0
is_self 0 0.0
is_video 0 0.0
link_flair_css_class 130259 0.5991839701187706
link_flair_text 123324 0.5672833656862655
locked 0 0.0
media 202858 0.9331352291231588
media_embed 0 0.0
mobile_ad_url 217394 1.0
num_comments 0 0.0
num_crossposts 0 0.0
original_link 217394 1.0
over_18 0 0.0
parent_whitelist_status 0 0.0
permalink 0 0.0
pinned 0 0.0
post_hint 171158 0.7873170372687379
preview 171158 0.7873170372687379
promoted 217394 1.0
promoted_by 217394 1.0
promoted_display_name 217394 1.0
promoted_url 217394 1.0
retrieved_on 47946 0.2205488651940716
score 0 0.0
secure_media 202858 0.9331352291231588
secure_media_embed 0 0.0
selftext 0 0.0
spoiler 0 0.0
stickied 0 0.0
subreddit 0 0.0
subreddit_id 0 0.0
suggested_sort 216985 0.9981186233290708
third_party_trackers 217394 1.0
third_party_tracking 217394 1.0
third_party_tracking_2 217394 1.0
thumbnail 0 0.0
thumbnail_height 118220 0.5438052568148155
thumbnail_width 118220 0.5438052568148155
title 0 0.0
url 1880 0.0086478927661297
whitelist_status 0 0.0
Basic Descriptive Statistics
There are a few columns containing numerical data that we are very interested for our analysis - particularly num_comments
and score
. From these basic descriptive statistics, we see that the values in these columns have a huge range (from 0 to 5,503 for num_comments and 0 to 57,618 for score), potentially due to a large number of outliers. We wanted to take a look at the raw data for the posts with these outlier values in order to ensure that we were only using clean data for our analysis.
= reddit_df.drop(cols_to_drop, axis = 'columns')
reddit_df reddit_df.describe()
gilded | num_comments | num_crossposts | score | thumbnail_height | thumbnail_width | |
---|---|---|---|---|---|---|
count | 217394.000000 | 217394.000000 | 217394.000000 | 217394.000000 | 99174.000000 | 99174.000000 |
mean | 0.007015 | 20.465132 | 0.036689 | 61.546064 | 101.353792 | 134.781697 |
std | 0.094668 | 65.089161 | 0.360493 | 282.495090 | 33.220967 | 26.380477 |
min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 78.000000 | 140.000000 |
50% | 0.000000 | 2.000000 | 0.000000 | 1.000000 | 103.000000 | 140.000000 |
75% | 0.000000 | 14.000000 | 0.000000 | 17.000000 | 140.000000 | 140.000000 |
max | 8.000000 | 5503.000000 | 73.000000 | 57618.000000 | 140.000000 | 140.000000 |
'selftext'].head() reddit_df[
0 [deleted]
1 [removed]
2 I have dental insurance but it's only with gre...
3 [deleted]
4 From NWAC:\n\nWe’ve received preliminary infor...
Name: selftext, dtype: object
# print submissions with more than 1000 comments (outliers)
print(list(reddit_df.loc[reddit_df.num_comments > 1000, :].selftext.head()))
print(list(reddit_df.loc[reddit_df.num_comments > 1000, :].title.head()))
['Let’s hear it', "I see the crazy house prices. I see them also selling for these prices. But to who? Who are these mystery people buying up these million dollar homes?\n\nI'm in my early thirties as is my partner and while we make good money (120k household) it seems like every year we get further away from home ownership.\n\nNone of our friends own a home, and I don't think any of us are particularly close to being able to afford to buy either.\n\nSo who are these mystery people? What do they do, where are they from?", 'I’m not interested in stalking anyone (exactly what a stalker would say), just curious. I know Dave Matthews is around Wallingford or something. And there’s Bill Gates if you consider him one. Anyone other celebrities who actually live here most of the time?', 'There is no reason to see a naturopath, even as a "supplement" for standard medical care.\n\nThese "naturopathic doctors" have the medical competency of the average Instagram wellness influencer and are not legitimate medical providers, even if the state has approved their licensing as such. Their entire profession is based on peddling unregulated supplements and other non-evidence-based quackery for profit. There are real harms associated with their practice, both because it encourages avoidance of actual evidence-based care and because many of their "treatments" can cause direct harm (e.g., I have seen several patients in this area with severe COVID, including one that died, that were prescribed or encouraged to obtain ivermectin, hydroxychloroquine, etc.)\n\nThe trend towards legitimization of "alternative medicine" and pseudo-science more broadly is certainly not a Seattle-specific phenomenon although seems to be much more prevalent in this area than anywhere that I have practiced.\n\nSee a real physician.\n\nsource: Seattle physician\n\n**EDIT:** There have been several anecdotes about experiences with alternative medicine providers posted in this thread, many of which describe the experience of being listened to in a way that actual medical professionals have not done. There have also been several comments that have pointed out that many people seek out alternative medicines in large part *because* of the many failings of our healthcare system (e.g., inaccessibility, excessive cost, etc.). The appeal of these clinics is almost entirely that they listen to their patients; listening is an essential component of providing patient care but being listened to by someone that is trained to prescribe black kohosh or an herbal tea is not medicine. These anecdotes are also not evidence and it is objective fact that the overwhelming majority of practices and treatments offered by these practitioners are not evidence-based. \n\n​', '']
['What was your worst meal of 2022 in DC and what restaurant will you not be going back to?', 'Serious question.... Can anyone in their twenties or thirties afford to buy property here?', 'What celebrities live in Seattle?', 'PSA: There is no reason to see a naturopath, ever.', 'r/Seattle grappling with big tech layoffs']
# print submissions with more than 1000 score (outliers)
print(list(reddit_df.loc[reddit_df.score > 1000, :].selftext.head()))
print(list(reddit_df.loc[reddit_df.score > 1000, :].title.head()))
['Please get it together.', '', '', 'I take metro almost every day. I like to trash WMATA as much as anyone for the horseshit they make me go through sometimes. But the line to Dulles is a massive convenience.\n\nThe silver flyer bus, although only a few minutes from Wiehle-Reston, complicated things enough for me to try and avoid the trip. It was also an additional $5 and you had to wait at the counter to pay. I also hate forking over $50+ each way for a ride share. \n\nLast week, I took a flight out of Dulles that was $250 cheaper than DCA. I came back on a Saturday, and had a $2 flat fare to Capitol South. Each way was an hour and 10 minutes.\n\nI saved a couple hundred bucks, timed my trip more efficiently, and sat/listened to a podcast watching the train scoot through NOVA.', '']
['PSA: if a stoplight is without power IT BECOMES A STOP SIGN.', 'In Washington heights they tour up the roads to do work and revealed the old cobblestone beneath (184 & Pinehurst)', "On the Link headed home from SeaTac and a Sheriff and a dog just boarded. They walked up and down the car and doggo sniffed everyone. I'm guessing it's a bomb dog? I highly doubt there looking for drugs. Never seen this before.", 'The Silver Line to Dulles is a bigger deal than I thought it would be', 'WA Senate passes bill to bar hiring discrimination for cannabis use']
# separate Posts that were removed or deleted
= (reddit_df.selftext == '[removed]') | reddit_df.selftext.str.contains('Removed by reddit ')
removed_filter = (reddit_df.selftext == '[deleted]')
deleted_filter = len(reddit_df[removed_filter])
num_removed = len(reddit_df[deleted_filter])
num_deleted print(f"Num removed: {num_removed}")
print(f"Num deleted: {num_deleted}")
# remove those rows - save in another df in case its needed
= reddit_df[removed_filter | deleted_filter]
removed_df print(f"All rows: {len(reddit_df)}")
= reddit_df[(~removed_filter) & (~deleted_filter)]
reddit_df print(f"New num of rows: {len(reddit_df)}")
Num removed: 67672
Num deleted: 24059
All rows: 217394
New num of rows: 125663
Exploring Number of Posts per City
Given that gentrification is literally related to change of a city over time, we were very interested in doing an initial exploration of what the data looks like over time. One of the questions that we had posed as part of our project plan was, “Does the amount of responses within a subreddit correlate with the population of the city?” One of the reasons we asked this questions is because we are very aware that Reddit as a platform is only accessible to those with internet and technology. It would be interesting if we saw that cities with a larger total population - but a larger population that was below the poverty line (and thus potentially with less access to internet and technology) - had a larger number of Reddit submission posts.
"""
Transforming temporal data
"""
# Convert 'created_utc' to datetime
'created_utc'] = pd.to_datetime(reddit_df['created_utc'])
reddit_df[
# Extract year from 'created_utc'
'year'] = reddit_df['created_utc'].dt.year
reddit_df['year'] = reddit_df['year'].astype('str')
reddit_df[
# Extract month from 'created_utc'
'month'] = reddit_df['created_utc'].dt.month
reddit_df[
# group by year, month, and subreddit, and count the number of Posts
= reddit_df.groupby(['year', 'month', 'subreddit']).size().reset_index(name='Num_Posts')
Posts_count # group by year, month, and subreddit, and count the number of comments
= reddit_df.groupby(['year', 'month', 'subreddit'])['num_comments'].sum().reset_index(name='Total_Comments')
comments_sum_by_year # combine two on year, month, subreddit
= pd.merge(Posts_count, comments_sum_by_year, on=['year', 'month', 'subreddit'])
combined_df # show first rows of combined df
combined_df.head()
year | month | subreddit | Num_Posts | Total_Comments | |
---|---|---|---|---|---|
0 | 2021 | 1 | Atlanta | 677 | 13532 |
1 | 2021 | 1 | Seattle | 1502 | 33949 |
2 | 2021 | 1 | nyc | 1234 | 40512 |
3 | 2021 | 1 | washingtondc | 1572 | 37336 |
4 | 2021 | 2 | Atlanta | 690 | 11755 |
= reddit_df.groupby(['subreddit', 'year']).size().reset_index(name='count')
post_counts post_counts
subreddit | year | count | |
---|---|---|---|
0 | Atlanta | 2021 | 8828 |
1 | Atlanta | 2022 | 7912 |
2 | Atlanta | 2023 | 1866 |
3 | Seattle | 2021 | 21672 |
4 | Seattle | 2022 | 21170 |
5 | Seattle | 2023 | 5092 |
6 | nyc | 2021 | 14506 |
7 | nyc | 2022 | 13509 |
8 | nyc | 2023 | 3224 |
9 | washingtondc | 2021 | 12256 |
10 | washingtondc | 2022 | 12434 |
11 | washingtondc | 2023 | 3194 |
= reddit_df.groupby('year')['num_comments'].agg(['mean', 'median', 'min', 'max']).reset_index()
agg_df = ['Year', 'Average Number of Comments', 'Median Number of Comments ', 'Min Number of Comments', 'Max Number of Comments']
agg_df.columns agg_df
Year | Average Number of Comments | Median Number of Comments | Min Number of Comments | Max Number of Comments | |
---|---|---|---|---|---|
0 | 2021 | 27.673833 | 8 | 0 | 3387 |
1 | 2022 | 35.007342 | 8 | 0 | 5503 |
2 | 2023 | 35.854516 | 9 | 0 | 1732 |
Visuals #1 & 2: Number of Posts per City
Findings: Our above stated hypothesis - that larger cities does not necessarily correspond to more Reddit posts - proved to be true, which can be seen in the visualization below. Although New York is the most populated city out of all of the cities within this analysis, we can see that Seattle, Washington has the most number of posts out of all the cities. One can theorize that this is because Seattle has the largest number of employees within the tech sector out of all the cities pulled for (primarilly due to Microsoft and Amazon headquartered in that area). We also broke this down by year to see if there was any change over time.
= reddit_df.groupby(['subreddit']).size().reset_index(name='count')
subreddit_counts
subreddit_counts
# Make barplot in Plotly
= px.bar(subreddit_counts, x="subreddit", y="count", color = 'subreddit', template='plotly_white',
fig2 ={"count": "Number of Posts", 'subreddit': 'City Subreddit'},
labels="Number of Reddit Posts per City in Entire Dataset")
title# Update size
=500, width=800)
fig2.update_layout(height# Remove legend
=False) fig2.update_traces(showlegend
# Make barplot in Plotly
= px.bar(post_counts, x="subreddit", y="count", color="year", template='plotly_white',
fig ={"year": "Year", "count": "Number of Posts", 'subreddit': 'City Subreddit'},
labels="Number of Reddit Posts per City, per Year")
title# Update size
=500, width=800) fig.update_layout(height
Visual #3 - Number of Posts by Month of Year
We wanted to see whether certain months of the year corresponded with higher Reddit submissions, especially since gentrified cities have such high influx of tourism during vacation months. What we see in the visualization below is that there does seem to be a somewhat cyclic volume of posts which are higher during the summer and winter seasons. We will have to do further NLP analysis to see if these higher volume of posts is due to increased tourism during tourism season.
# The animation_frame will be based on 'year'
= px.scatter(
fig
combined_df, ='month',
x='Num_Posts',
y='Total_Comments',
size='subreddit',
color='subreddit',
hover_name='year',
animation_frame=40,
size_max=[combined_df['Num_Posts'].min(), combined_df['Num_Posts'].max()+200],
range_y='Monthly Subreddit Activity'
title
)
# Update the layout to include all months on the x-axis, year 2023 only has three monthes data point
fig.update_xaxes(='Month',
title='array',
tickmode=[str(m).zfill(2) for m in range(1, 13)],
tickvals=[str(m) for m in range(1, 13)]
ticktext
)
fig.update_layout(=550, # set the height
height=850, # set the width
width='Monthly Number of Posts by Subreddits from 2021-2023 ',
title_text=0.5 # Center the title
title_x
)
# Update the y-axis title
='Number of Posts')
fig.update_yaxes(title
# Show the figure
fig.show()
Visual #4 - Number of Posts Over Time
We wanted to analyze the accumulation of posts over time for each of the cities to see if there were any noticeable trends over the past few years. From this visualization, we were not able to see continual trends for any upward or downward trajectories. Rather, most cities have fluctuations throughout the year (as we saw above), but stay within the same range of posts. Noticeably, there was a larger than normal volume of posts in the washingtondc
subreddit in January 2021 - this is most likely due to people posting about the January 6th insurrection at the U.S. Capitol.
"white")
sns.set_style(= plt.subplots(figsize=(9,5.5))
fig, ax
'created_utc'] = pd.to_datetime(reddit_df.loc[:, 'created_utc'])
reddit_df.loc[:, 'created_date'] = pd.to_datetime(reddit_df.loc[:, 'created_utc'], format = '%Y-%m-%d')
reddit_df.loc[:, 'created_month'] = pd.to_datetime(pd.to_datetime(reddit_df.loc[:, 'created_date']).dt.year.astype(str)+'-'+pd.to_datetime(reddit_df.loc[:, 'created_date']).dt.month.astype(str)+'-01')
reddit_df.loc[:,
= reddit_df.groupby(['subreddit', 'created_month']).size().reset_index().rename({0:'count'}, axis='columns')
group_date
group_date.head()
= 'created_month', y = 'count', hue = 'subreddit', data = group_date)
sns.lineplot(x "Number of Posts per Month by City")
ax.set_title("Number of Posts")
ax.set_ylabel("Post Date")
ax.set_xlabel("upper left", bbox_to_anchor=(1, 1)) sns.move_legend(ax,
Summary Table # 1: Exploration of comments
# Create table aggregating statistics by subreddit
= reddit_df.groupby('subreddit')['num_comments'].agg(['mean', 'median', 'min', 'max']).reset_index()
agg_df
# Rename the columns
= ['Subreddit', 'Average Number of Comments', 'Median Number of Comments',
agg_df.columns 'Min Number of Comments', 'Max Number of Comments']
agg_df
Subreddit | Average Number of Comments | Median Number of Comments | Min Number of Comments | Max Number of Comments | |
---|---|---|---|---|---|
0 | Atlanta | 20.689670 | 4 | 0 | 883 |
1 | Seattle | 32.923833 | 10 | 0 | 2761 |
2 | nyc | 39.478921 | 5 | 0 | 5503 |
3 | washingtondc | 28.479522 | 9 | 0 | 1991 |
Summary Table #2: Exploration of scores
# Create table aggregating statistics by subreddit
= reddit_df.groupby('subreddit')['score'].agg(['mean', 'median', 'min', 'max']).reset_index()
agg_df2
# Rename the columns
= ['Subreddit', 'Average Reddit Score', 'Median Reddit Score', 'Min Reddit Score', 'Max Reddit Score']
agg_df2.columns agg_df2
Subreddit | Average Reddit Score | Median Reddit Score | Min Reddit Score | Max Reddit Score | |
---|---|---|---|---|---|
0 | Atlanta | 46.051274 | 3 | 0 | 3134 |
1 | Seattle | 115.932491 | 9 | 0 | 57618 |
2 | nyc | 127.841000 | 9 | 0 | 8363 |
3 | washingtondc | 81.006778 | 10 | 0 | 3789 |
= px.histogram(agg_df2, x='Median Reddit Score', title='Median Reddit Score Histogram', template='plotly_white',
fig ={'Median Reddit Score': 'Median Score'})
labels fig.show()
Feature Engineering
Feature 1: Categorization of Engagement:
Create low, medium, high engagement labels based on number of comments
Since the average number of comments is approximately 10-25 comments, I am going to create low, medium, and high labels as:
- Low: 0 <= num_comments < 20
- Medium: 20 <= num_comments < 100
- High: 100 <= num_comments
# Create labels
'engagement_label'] = np.where((reddit_df['num_comments'] >= 0) & (reddit_df['num_comments'] < 20), 'low',
reddit_df['num_comments'] >= 20) & (reddit_df['num_comments'] < 100), 'medium', 'high'))
np.where((reddit_df[
# Example of engagement label
'engagement_label']=='high'][['num_comments', 'engagement_label']].head(2) reddit_df[reddit_df[
num_comments | engagement_label | |
---|---|---|
4 | 143 | high |
12 | 158 | high |
Feature 2: Create low, medium, high labels based on Reddit Score
The average Reddit Score is around 84.
Low: 0 <= Reddit Score < 84
Medium: 84 <= Reddit Score < 200
High: 200 <= Reddit Score
# Create labels
'score_label'] = np.where((reddit_df['score'] >= 0) & (reddit_df['score'] < 84), 'low',
reddit_df['score'] >= 84) & (reddit_df['score'] < 200), 'medium', 'high')) np.where((reddit_df[
Feature 3: Number of words in post
The length of a post can give us a lot of meaningful information about the sentiment or value of a post. We wanted to have a continuous variable so that we’d be able to preserve the wide range of post length.
'num_words'] = reddit_df.apply(lambda row: len(row['selftext'].split() + row['title'].split()), axis = 'columns')
reddit_df['num_words'].describe() reddit_df[
count 125663.000000
mean 38.294892
std 84.134336
min 1.000000
25% 9.000000
50% 14.000000
75% 41.000000
max 5537.000000
Name: num_words, dtype: float64
= plt.subplots(figsize=(9,5.5))
fig, ax
= 'num_words', data = reddit_df[reddit_df.num_words <= 41], bins = 20)
sns.histplot(x "Number of Words per Submission (Includes Title)")
ax.set_title("Frequency")
ax.set_ylabel("Number of Words") ax.set_xlabel(
Text(0.5, 0, 'Number of Words')
= plt.subplots(figsize=(12, 5.5))
fig, ax
= 'num_words', data = reddit_df[reddit_df.num_words > 41])
sns.boxplot(x "[OUTLIERS] Number of Words per Submission (Includes Title)")
ax.set_title("Frequency")
ax.set_ylabel("Number of Words") ax.set_xlabel(
Text(0.5, 0, 'Number of Words')
Summary Table #3 - Categorization of Engagement Level
= reddit_df.groupby(['subreddit', 'engagement_label']).size().reset_index(name='count')
engagement_counts engagement_counts
subreddit | engagement_label | count | |
---|---|---|---|
0 | Atlanta | high | 996 |
1 | Atlanta | low | 13998 |
2 | Atlanta | medium | 3612 |
3 | Seattle | high | 3588 |
4 | Seattle | low | 32592 |
5 | Seattle | medium | 11754 |
6 | nyc | high | 3386 |
7 | nyc | low | 21564 |
8 | nyc | medium | 6289 |
9 | washingtondc | high | 1873 |
10 | washingtondc | low | 19477 |
11 | washingtondc | medium | 6534 |
# Make barplot in Plotly
= px.bar(engagement_counts, x="subreddit", y="count", color="engagement_label", template='plotly_white',
fig3 ={"engagement_label": "Engagement Label", "count": "Number of Posts", 'subreddit': 'City Subreddit'},
labels="Engagement Level for Posts by Subreddit")
title# Update size
=500, width=800) fig3.update_layout(height
Regex Searches Dummy Variables
Since we are interested in exploring opinions on gentrification, we decided to look for key words within the posts that are related to that issue, such as airbnb, rent, gentrification, transit, and tourist.
'airbnb_yes'] = reddit_df.loc[:, 'selftext'].str.lower().str.contains('airbnb')
reddit_df.loc[:,'rent_yes'] = reddit_df.loc[:, 'selftext'].str.lower().str.contains('rent')
reddit_df.loc[:,'gentrification_yes'] = reddit_df.loc[:, 'selftext'].str.lower().str.contains('gentrification')
reddit_df.loc[:,'transit_yes'] = reddit_df.loc[:, 'selftext'].str.lower().str.contains('transit')
reddit_df.loc[:,'tourist_yes'] = reddit_df.loc[:, 'selftext'].str.lower().str.contains('tourist')
reddit_df.loc[:,
list(reddit_df[reddit_df.airbnb_yes].selftext.head())
["We are looking for a large house/ lodge/ cabins type situation to do a family gathering with 15-20 people, all adults or almost adults, within 3 hours of Seattle. Ideally there would be some outdoor and some indoor activities nearby as we hope to do this in April and the weather will be unpredictable. Suggestions for a cool place you've been? I know there's stuff on airbnb but i don't love their policies.",
"Fiance and I moved into an apartment in December. \nIt's been inconvenient -- broken heaters, shoddy fixes, and now we have to clean up and move around all of our apartment things due to pest fumigation. We both work from home and have two dogs. We asked for credit to book an airBnb or hotel while things get cleaned; they said no. Apparently, a neighboring unit had caused a pest infestation that we were not informed of before we had signed the lease.\n\nThey gave us incredibly short notice over the weekend and we have to clear out everything in specific areas. We're curious if we were supposed to be informed about a pest infestation.",
"Me & the gf have been camping at the Midsummer Renaissance Festival at Bonnie Lake for the last 7 years, but this year they sold out all weekends before we even learned they were even on sale (seems only FB users were informed, I'm a newsletter subscriber and haven't heard from them since last year's issues. Seems the challenges with scale continue). \n\nWe've always loved dressing up and hanging out with other renfaire folks, but in our instance it doesn't make sense to try for a day pass, there are no airbnbs or anything available, etc, so we've given up on that particular event. Big bummer but moving on. \n\nAnyone know of anything else in the area for nerds that like to dress up in a semi-immersive fantasy/historical type environment with other folks who like that?",
"When I was in college, I always thought making near 6 figures would put me on financial easy street. But after a messy divorce, and an agreement to move the kids to DC, I'm shocked at how I just barely make it every month. My rent is $2,000 a month for a two bedroom with no amenities. I'm currently sharing a house with my sister that she owns. This is about what she was making with the space on Airbnb. I do have $2,000 a month in legal fees that I hope will be paid off by March. \n\nBut once I pay off my legal fees I'm going to move out of my sisters. It's just hard living with her and sharing a kitchen. lol. She is actually very wonderful, but my kids and I just need our own space. So most likely I'll be looking at having to pay more once I move out. My kids go to school in Bethesda, MD so I might look at Rockville or Gaithersburg. \n\nLast month between my kids, and myself we spent $650 on groceries and eating out. So maybe that is an area I can work on. \n\nI honestly only spent $300 on Christmas presents. But then I had a fucking $1,500 expense to fix my car. \n\nI spent $145 on some entertainment stuff for the kids. This includes our netflix account. \n\nI spent $200 on clothing items, and some stuff for basketball that my daughter is involved in. \n\nI'm also switching my Social Work license from Michigan to DC and have to take a test. Overall, I've had to pay $700 to switch my license over which I wasn't expecting. \n\nI'm just dumfounded to be making so much money I feel, and yet still feel to be living on the edge month to month. I try not to get down on myself, and believe the system is just so fucked up. My other sister lives in Ann Arbor, MI and pays $1800 with house insurance towards a place she'll own. I'll never own a place in the DMV I feel. Hell, people didn't even want to rent to me because my credit score was 650 and I have a $1000,000 in debt. $80,000 of that is college debt. Some landlords kept telling me they'd only rent to working professionals, and I found out that was code for a young person whose parents are paying the bill. \n\nI guess I'm just ranting, but I just never thought I'd make this much money to then be near broke every month. I guess my Ex really got me in convincing me to move to DC. I just wanted the divorce to be over. People kept telling me I'd make more money out here, but I haven't found that the increase in pay cancels out the cost of living. My sister who is also a social worker in Michigan, makes just as much as me.",
"I found some places here beside downtown: [https://www.realtor.com/realestateandhomes-search/Seattle\\_WA/type-mfd-mobile-home/price-na-350000/commute-400-9th-Ave-N-\\_Seattle\\_WA,47.622875,-122.339419,30m,drive,traffic](https://www.realtor.com/realestateandhomes-search/Seattle_WA/type-mfd-mobile-home/price-na-350000/commute-400-9th-Ave-N-_Seattle_WA,47.622875,-122.339419,30m,drive,traffic)\n\nDo you think it'd be worthwhile to do this and just buy a place outright and rent it later as an AirBNB rather than pay rent? I'm thinking of claiming it as primary/principal residence."]
# reddit_df[reddit_df.loc[:,'airbnb_yes']].num_comments.plot(kind = 'box')
"white")
sns.set_style(= plt.subplots(figsize=(9,5.5))
fig, ax
= reddit_df[reddit_df.num_comments < 100],
sns.boxplot(data = 'airbnb_yes',
x = 'num_comments',
y = 'airbnb_yes')
hue
"Difference Between Comments on Posts Mentioning Airbnb vs. Not")
ax.set_title("Number of Comments")
ax.set_ylabel("Mention of Airbnb in Post") ax.set_xlabel(
Text(0.5, 0, 'Mention of Airbnb in Post')
# reddit_df[reddit_df.loc[:,'airbnb_yes']].num_comments.plot(kind = 'box')
"white")
sns.set_style(= plt.subplots(figsize=(9,5.5))
fig, ax
= 200
upperbound_outlier
= reddit_df[reddit_df.score < upperbound_outlier],
sns.boxplot(data = 'airbnb_yes',
x = 'score',
y = 'airbnb_yes')
hue
"Difference Between Post Score on Posts Mentioning Airbnb vs. Not")
ax.set_title("Post Score")
ax.set_ylabel("Mention of Airbnb in Post") ax.set_xlabel(
Text(0.5, 0, 'Mention of Airbnb in Post')
= ['airbnb', 'rent', 'gentrification', 'transit', 'tourist']
regex_words for word in regex_words:
+'_yes'] = reddit_df.loc[:, 'selftext'].str.lower().str.contains(word)
reddit_df.loc[:, word
= [i+'_yes' for i in regex_words]
regex_cols
= reddit_df.groupby(regex_cols).size().reset_index()
keyword_counts = keyword_counts.rename({0: 'Posts w/ Keyword'}, axis = 'columns') keyword_counts
Summary Table #4 - Frequency of Keywords in Posts
def combine_keywords(row):
= []
keyword_str for i in ['airbnb_yes' ,'rent_yes', 'gentrification_yes' ,'transit_yes', 'tourist_yes']:
if row[i]:
"_")])
keyword_str.append(i[:i.index(if keyword_str == []:
= 'does not contain any keywords'
keyword_str else:
= ', '.join(keyword_str)
keyword_str return keyword_str
'Keyword Included'] = keyword_counts[regex_cols].apply(lambda row: combine_keywords(row), axis = 'columns')
keyword_counts.loc[:, = keyword_counts[['Keyword Included', 'Posts w/ Keyword']]
keyword_counts = keyword_counts.sort_values(by = 'Posts w/ Keyword', ascending = False)
keyword_counts keyword_counts
Keyword Included | Posts w/ Keyword | |
---|---|---|
0 | does not contain any keywords | 119854 |
6 | rent | 4878 |
2 | transit | 375 |
1 | tourist | 233 |
8 | rent, transit | 132 |
12 | airbnb | 72 |
7 | rent, tourist | 46 |
14 | airbnb, rent | 36 |
4 | gentrification | 11 |
9 | rent, transit, tourist | 8 |
10 | rent, gentrification | 7 |
13 | airbnb, tourist | 4 |
5 | gentrification, tourist | 2 |
3 | transit, tourist | 2 |
11 | rent, gentrification, transit | 1 |
15 | airbnb, rent, transit, tourist | 1 |
16 | airbnb, rent, gentrification, tourist | 1 |
= px.bar(keyword_counts[keyword_counts != 'does not contain any keywords'],
fig ='Keyword Included',
y='Posts w/ Keyword',
x='plotly_white',
template='Frequency of Posts with Keyword',
title={'Posts w/ Keyword': 'Number of Posts'},
labels='Posts w/ Keyword',
text='Keyword Included',
hover_name=['blue'],
color_discrete_sequence= [0, 5000]
range_x
)
# fig.update_xaxes(tickangle=90)
fig.show()
External Data to Join - Exploration of Census data
As mentioned above, we wanted to visualize the rent over time due to the change of rent due to gentrification.
Exploration of rent
# rent as percentage
= '../data/csv/'
data_path = pd.read_csv(data_path + 'rent_as_percent_of_income.csv')
rent_perc_df 'Year'] = pd.to_datetime(rent_perc_df.loc[:, 'Year'])
rent_perc_df.loc[:,
= pd.read_csv(data_path + 'rent_in_dollars.csv')
rent_dollars_df 'Year'] = pd.to_datetime(rent_dollars_df.loc[:, 'Year'])
rent_dollars_df.loc[:,
print(rent_perc_df.columns)
print(rent_dollars_df.columns)
Index(['Unnamed: 0', 'Geography', 'Geographic Area Name',
'Rent as Percent of Income', 'Year', 'County'],
dtype='object')
Index(['Unnamed: 0', 'Geography', 'Geographic Area Name', 'Rent in USD',
'Year', 'County'],
dtype='object')
Median gross rent as a percentage of household income
We looked at the rent distribution of each city in 2010 versus 2021 to see if there were dramatic differences. Surprisingly, there were not.
= plt.subplots(2, 1, figsize=(9,11))
fig, ax set(rc={'figure.figsize':(11.7,8.27)})
sns.
= (rent_perc_df.Year.dt.year == 2010)
year_filter
= 'County',
sns.boxplot(x = 'Rent as Percent of Income',
y = rent_perc_df[year_filter],
data = ax[0])
ax 0].set_title("Rent as Percentage of Income for 2010")
ax[
= (rent_perc_df.Year.dt.year == 2021)
year_filter = 'County',
sns.boxplot(x = 'Rent as Percent of Income',
y = rent_perc_df[year_filter],
data = ax[1])
ax 1].set_title("Rent as Percentage of Income for 2021") ax[
Text(0.5, 1.0, 'Rent as Percentage of Income for 2021')
Rent in Dollar Amounts
We can see that rent has obviously increased over the years for each of these cities, with New York experiencing the most dramatic rise from 2014 to 2015. We unfortunately don’t have Reddit data from this year, but this will still be useful information for us to take into consideration moving forward.
= plt.subplots(figsize=(10, 7))
fig, ax = sns.lineplot(rent_dollars_df,
r = 'Year',
x = 'Rent in USD',
y = 'County')
hue
'U.S. Counties of Most Gentrified Cities - Rent Prices from 2010 - 2021', fontdict={'size':15})
ax.set_title(
= ['${:,.0f}'.format(y) for y in r.get_yticks()]
ylabels r.set_yticklabels(ylabels)
/tmp/ipykernel_4268/1787382017.py:10: UserWarning:
FixedFormatter should only be used together with FixedLocator
[Text(0, 800.0, '$800'),
Text(0, 1000.0, '$1,000'),
Text(0, 1200.0, '$1,200'),
Text(0, 1400.0, '$1,400'),
Text(0, 1600.0, '$1,600'),
Text(0, 1800.0, '$1,800'),
Text(0, 2000.0, '$2,000'),
Text(0, 2200.0, '$2,200')]