Machine Learning Analysis of Gentrification Based on Reddit Data

1. Executive Summary

Gentrification is a socioeconomic process that often has a negative effect on vulnerable communities and manifests in ways such as displacement, higher living costs, and physical changes in neighborhoods. Reddit, like all social media platforms, is a place where individuals can discuss their opinions and city subreddits are a common place for residents to vent their frustrations. The goal of this milestone to gain insight into city subreddits and use machine learning to answer the following two Business Questions:

  1. Can we predict the sentiment of city subreddit posts using supervised machine learning models and if so, what features drive sentiment classification?

  2. Can we predict what city subreddit a submission post belongs to using supervised machine learning models and if so, what features drive city subreddit classification?

Using SparkML, our team was able to successfully answer these two questions. For Business Question 1, we trained both a Support Vector Machine (SVM) and Random Forest model to identify the sentiment of the subreddit posts. The table below summarizes the final results for both models.

Table 1.1 Model Results for Sentiment Classification

Model Accuracy Precision Recall F1
SVM 68% 62% 68% 57%
Random Forest 68% 65% 68% 58%

The similarity in the metrics shows that both models had similar performance in identifying the sentiments of the subreddit submissions. For Business Question 2, we trained both a Decision Tree and Random Forest model to identify the city subreddit of the subreddit posts. Table 1.2 below summarizes the final results for both models.

Table 1.2 Model Results for Sentiment Classification

Model Accuracy Precision Recall F1
Decision Tree 43% 41% 43% 40%
Random Forest 40% 39% 40% 36%

The model results in Table 1.1 show that we can successfully identify sentiment using a Random Forest model and a SVM model, with the Random Forest performing the best. We also found that features such as the presence of certain keywords, Reddit scoring, and more had a positive effect on the ability to automatically identify sentiment. The model results in Table 1.2, on the other hand, showed that neither the Decision Tree and Random Forest models were adept at predicting subreddit classification. We found that more information was needed to properly identify the subreddits.

2. Analysis Report

2.1 Sentiment Analysis

2.1.1 Data Preparation

When beginning any machine learning process, the first step in modeling is the data preparation. The first step in our process was reading in our Reddit dataset with posts from the nyc, washingtondc, Seattle, and Atlanta subreddits. This dataset already contained the sentiment column that was created using a pre-trained model during the NLP milestone. Our technical goal was to perform binary classification on the Reddit submissions to identify whether the sentiment was positive or negative. Therefore, we removed rows that did not have these two sentiments. While our technical goal in the beginning was to use sentence embeddings and our full dataset, we had to exclude embeddings from our analysis and sample 10% of the original data due to memory and Azure money limitations.

2.1.2 Model Preparation

The next step in modeling involved setting up the SparkML pipeline. First, the sentiment label was transformed into binary values using a StringIndexer. The tabular features for the classification problem were then identified for modeling:

  1. num_words: Number of words in submission post
  2. num_comments: Number of comments on submission post
  3. score: Reddit score
  4. airbnb: Airbnb keyword boolean
  5. rent: Rent keyword boolean
  6. gentrification: Gentrification keyword boolean
  7. public_transit: Public transit keyword boolean

Next, the tabular features were assembled into vector format using the VectorAssembler. The dataset was then split into a train and test dataset using an 80-20 split. Two SparkML models, LinearSV and RandomForestClassifier, were then initialized and ready for fitting on the training dataset.

2.1.3 Model Execution and Evaluation

Once the data and models were prepared, the models were fit to the training set. The SVM model was trained with a max iteration of 3 and regularization of 0.1 to reduce overfitting. The Random Forest model was trained with 50 decision trees to reduce the variance in the results. Once the models were trained, we analyzed the feature importance, confusion matrices, and evaluated the models using performance metrics. Feature importance is the influence each feature has for the model predictive power. For SVM, this impact is the contribution each feature has on the SVM linear decision boundary. For Random Forest, feature importance is the measure of impact on increasing the purity of the predicted classes within the decision trees. Image 2.1 below shows the feature importance for both models.

Image 2.1 Feature Importance Comparison Between SVM and Random Forest

Image 2.1 Feature Importance Comparison Between SVM and Random Forest

The chart indicates a discrepancy in how the SVM and Random Forest models weigh the importance of different features when predicting sentiment. The Random Forest model places a considerable emphasis on public_transit, rent, and score, while the SVM model considers score and num_words to be the most influential features. This difference may stem from the inherent characteristics of how these models process features and make decisions. It also highlights that feature importance is model-dependent and can vary significantly between different types of models.

The next step of model evaluation consisted of getting the predictions for the testing dataset and comparing them to the labels using a confusion matrix.

Image 2.2 SVM and Random Forest Confusion Test Matrices

Image 2.2 SVM and Random Forest Confusion Test Matrices Image 2.2 SVM and Random Forest Confusion Test Matrices

The Random Forest model has a slightly higher number of true positives for Class 0 and a lower number of false negatives for Class 1, indicating a marginally better performance in correctly classifying Class 0. The SVM model has fewer false positives for Class 1, indicating a marginally better specificity for Class 1. Both models have a high true positive rate, but the Random Forest model has a slightly better performance overall, with more correct predictions for Class 0 and fewer misclassifications for Class 1. Finally, the last step in model evaluation was calculating standard classification metrics such as accuracy, recall, precision, and F1 on the testing data. These metrics measure the predictive power for the classifiers. Revisiting the table from the introduction, we can summarize the model performance below:

Table 1.1 Test Model Results for Sentiment Classification

Model Accuracy Precision Recall F1
SVM 68% 62% 68% 57%
Random Forest 68% 65% 68% 58%

The comparison of SVM and Random Forest models for sentiment prediction shows that both models exhibit similar accuracy at 68%, suggesting an equal proportion of correct predictions for sentiment classification. However, the Random Forest slightly surpasses the SVM in discriminative power, as indicated by a higher F1 score.

2.2 Subreddit Classification

2.2.1 Data Preparation

Similar to the data preparation for data analysis, we started with the raw data from each of our subreddits. We then removed rows that did not contain any text in the posts. We chose to keep posts which were indicated as removed or deleted by Reddit, since they still contained feature data. We ended up subsetting to only the columns that were integer or float, that were majority populated. We then applied a universal sentence embedder to the post text, but then ran into memory constraints. In order to solve those issues, we sampled 10% of the original data.

2.2.2 Model Preparation

The next step in modeling involved setting up the SparkML pipeline. First, the subreddit labels were transformed into numerical values using a StringIndexer, and then they were converted into one hot encodings in order to remove any hierarchical value from the labels. The tabular features for the classification problem were then identified for modeling:

  1. num_words: Number of words in submission post
  2. num_comments: Number of comments on submission post
  3. score: Reddit score
  4. airbnb: Airbnb keyword boolean
  5. rent: Rent keyword boolean
  6. gentrification: Gentrification keyword boolean
  7. public_transit: Public transit keyword boolean

Next, the tabular features were assembled into vector format using the VectorAssembler. The dataset was then split into a train and test dataset using an 80-20 split. Two SparkML models, DecisionTree and RandomForestClassifier, were then initialized and ready for fitting on the training dataset. We chose these two models in order to see whether we’d be able to gain more predictive power by using an ensemble method, since a Random Forest model is an ensemble of decision trees.

2.2.3 Model Execution and Evaluation

Once the data and models were prepared, the models were fit to the training set. The Decision Tree model utilized a single decision tree to make the classifications. The Random Forest model, on the other hand, was trained with 20 decision trees to reduce the variance in the results. Once the models were trained, we again analyzed the feature importance, confusion matrices, and evaluated the models using performance metrics. Image 2.1 below shows the feature importance for both the Decision Tree and Random Forest model.

Image 2.1 Feature Importance Comparison Between Decision Tree and Random Forest Image 2.1 Feature Importance Comparison Between Decision Tree and Random Forest

The above plot illustrates that num_comments is the most significant feature in predicting subreddit categories for both the Decision Tree and Random Forest models, followed by num_words, which also shows substantial importance. Other features such as score, rent, airbnb, gentrification, public_transit, and tourists have little to no impact on the model’s predictions, with some being completely disregarded by both models. The consistency in feature importance for ‘num_comments’ and ‘num_words’ across both models suggests that these factors are likely strong predictors for the subreddit classification task. The next step of model evaluation consisted of getting the predictions for the testing dataset and comparing them to the labels using a confusion matrix.

Image 2.2 Decision Tree and Random Forest Confusion Test Matrices Image 2.2 Decision Tree and Random Forest Confusion Test Matrices Image 2.2 Decision Tree and Random Forest Confusion Test Matrices

The Random Forest model has improved the classification for classes 0 and 3 significantly, with a slight improvement in class 1. However, for class 2, the Random Forest model has a lower number of correct predictions but a higher number of correct predictions for class 3, possibly due to the trade-off between these two classes. Misclassifications between classes 2 and 3 remain high in both models, suggesting that these classes are more difficult to distinguish from each other with the given features. Overall, the Random Forest model seems to generalize better and may be more robust than the single Decision Tree model, but there might still be room for improvement, especially in distinguishing between classes 2 and 3. Finally, we calculated accuracy, recall, precision, and F1 on the testing data. Revisiting the table from the introduction, we can summarize the model performance below:

Table 2.1 Test Model Results for Subreddit Classification

Model Accuracy Precision Recall F1
Decision Tree 43% 41% 43% 40%
Random Forest 40% 39% 40% 36%

The Decision Tree model outperforms the Random Forest model across all reported metrics, albeit by a small margin. This suggests that for this particular task, with the given features and dataset, the simpler Decision Tree model is slightly better suited than the more complex Random Forest model. However, the overall performance of both models is quite similar and relatively low, indicating either a challenging classification problem or that there may be room for improvement in model selection, feature engineering, or parameter tuning.

3. Conclusion

In conclusion, we saw that with the limited number of numerical features used, we were able to more accurately classify sentiment over subreddit labels. Given that the numerical features inputted into the model have a lot to do with Redditor’s interactions with those post (e.g. num_comments, score), or hot topics having to do with these cities (e.g. our boolean variables for tourism or public transit), it is reasonable to believe that these limited number of features were still able to classify sentiment with almost 70% accuracy. Connecting this back to our business goal, it doesn’t seem like the topics we chose revolving around gentrification had a large impact on sentiment in relation to metadata about the posts, which we can see by the feature importance for both sentiment models. In addition, the features we used for predicting subreddit were not able to reliably separate each of the different classes. Because we were not able to use sentence embeddings to analyze the actual text content of the posts or the titles, we were not able to directly address our business goal of analyzing whether topics of gentrified cities tend to become homogeneous over time. It is also possible that there exist similarities in the submission posts between the four cities (e.g., post length, content) that make it difficult for the models with the limited features we have to successfully distinguish between them. Overall, we showed that it is possible to predict the sentiment and subreddits using machine learning, thereby answering both business questions. Both classification problems can be further improved in the future by utilizing sentence context via embeddings, trying different hyperparameters, or using different models.