109 lines
4.6 KiB
Markdown
109 lines
4.6 KiB
Markdown
```toml
|
|
content_type = "project"
|
|
title = "What goes into a successful Reddit post?"
|
|
date = "2022 6 14"
|
|
```
|
|
|
|
In an attempt to find out what about a Reddit post makes it successful I will use some
|
|
classification models to try to determine which features have the highest influence on
|
|
making a correct prediction. In particular I use
|
|
[Random Forest](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)
|
|
and
|
|
[KNNeighbors](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)
|
|
classifiers. Then I'll score the results and see what the highest predictors are.
|
|
|
|
To find what goes into making a successful Reddit post we'll have to do a few things,
|
|
first of which is collecting data:
|
|
|
|
### Introducing Scrapey!
|
|
|
|
[Scrapey](https://old.doordesk.net/projects/reddit/scrapey.html) is my scraper script that takes a snapshot
|
|
of Reddit/r/all hot and saves the data to a .csv file including a calculated age for
|
|
each post about every 12 minutes. Run time is about 2 minutes per iteration and each
|
|
time adds about 100 unique posts to the list while updating any post it's already seen.
|
|
|
|
I run this in the background in a terminal and it updates my data set every ~12 minutes.
|
|
I have records of all posts within about 12 minutes of them disappearing from /r/all.
|
|
|
|
### EDA
|
|
|
|
[Next I take a quick look to see what looks useful](https://old.doordesk.net/projects/reddit/EDA.html), what
|
|
doesn't, and check for outliers that will throw off the model. There were a few outliers
|
|
to drop from the num_comments column.
|
|
|
|
Chosen Features:
|
|
|
|
* Title
|
|
* Subreddit
|
|
* Over_18
|
|
* Is_Original_Content
|
|
* Is_Self
|
|
* Spoiler
|
|
* Locked
|
|
* Stickied
|
|
* Num_Comments (Target)
|
|
|
|
Then I split the data I'm going to use into two dataframes (numeric and non) to prepare
|
|
for further processing.
|
|
|
|
### Clean
|
|
|
|
[Cleaning the data further](https://old.doordesk.net/projects/reddit/clean.html) consists of:
|
|
|
|
* Scaling numeric features between 0-1
|
|
* Converting '_' and '-' to whitespace
|
|
* Removing any non a-z or A-Z or whitespace
|
|
* Stripping any leftover whitespace
|
|
* Deleting any titles that were reduced to empty strings
|
|
|
|
### Model
|
|
|
|
If the number of comments of a post is greater than the median total number of comments
|
|
then it's assigned a 1, otherwise a 0. This is the target column. I then try some
|
|
lemmatizing, it doesn't seem to add much. After that I create and join some dummies,
|
|
then split and feed the new dataframe into
|
|
[Random Forest](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)
|
|
and [NNeighbors](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)
|
|
classifiers. Both actually scored the same with
|
|
[cross validation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html)
|
|
so I mainly used the forest.
|
|
|
|
[Notebook Here](https://old.doordesk.net/projects/reddit/model.html)
|
|
|
|
### Conclusion
|
|
|
|
Some Predictors from Top 25:
|
|
|
|
* Is_Self
|
|
* Subreddit_Memes
|
|
* OC
|
|
* Over_18
|
|
* Subreddit_Shitposting
|
|
* Is_Original_Content
|
|
* Subreddit_Superstonk
|
|
|
|
Popular words: 'like', 'just', 'time', 'new', 'oc', 'good', 'got', 'day', 'today', 'im',
|
|
'dont', and 'love'.
|
|
|
|
People on Reddit (at least in the past few days) like their memes, porn, and talking
|
|
about their day. And it's preferred if the content is original and self posted. So yes,
|
|
post your memes to memes and shitposting, tag them NSFW, use some words from the list,
|
|
and rake in all that sweet karma!
|
|
|
|
But it's not that simple, this is a fairly simple model, with simple data. To go beyond
|
|
this I think the comments would have to be analyzed.
|
|
[Lemmatisation](https://en.wikipedia.org/wiki/Lemmatisation) I thought would
|
|
be the most influential piece, and I still think that thinking is correct. But in this
|
|
case it doesn't apply because there is no real meaning to be had from reddit post
|
|
titles, at least to a computer. (or I did something wrong)
|
|
|
|
There's a lot more seen by a human than just the text in the title, there's often an
|
|
image attached, most posts reference a recent/current event, they could be an inside
|
|
joke of sorts. For some posts there could be emojis in the title, and depending on their
|
|
combination they can take on a meaning completely different from their individual
|
|
meanings. The next step from here I believe is to analyze the comments section of these
|
|
posts because in this moment I think that's the easiest way to truly describe the
|
|
meaning of a post to a computer. With what was gathered here I'm only to get 10% above
|
|
baseline and I think that's all there is to be had here, I mean we can tweak for a few
|
|
percent probably but I don't think there's much left on the table.
|
|
|