Predict the success of a reddit post based on the title alone
Find a file
2022-07-27 13:09:15 -04:00
data init 2022-07-27 13:09:15 -04:00
.gitignore init 2022-07-27 13:09:15 -04:00
clean.ipynb init 2022-07-27 13:09:15 -04:00
EDA.ipynb init 2022-07-27 13:09:15 -04:00
model.ipynb init 2022-07-27 13:09:15 -04:00
README.md init 2022-07-27 13:09:15 -04:00
scrapey.ipynb init 2022-07-27 13:09:15 -04:00

What Goes Into a Successful Reddit Post?

Here I'll pull and explore some data from Reddit

I'll attempt to find out what about a Reddit post makes it successful using classification. In particular a Random Forest model and KNNeighbors model.

Then I'll score the results and see what the highest predictors are

To find what goes into making a successful Reddit post we'll have to do a few things, first of which is collecting data:

Introducing Scrapey!

Scrapey takes a snapshot of Reddit/r/all hot, and saves the data to a .csv file including a calculated age for each post about every 12 minutes. Run time is about 2 minutes per iteration and each time adds about 100 unique posts to the list while updating any post it's already seen.

I run this in the background in a terminal and it updates my data set every ~12 minutes. I have records of all posts within about 12 minutes of them disappearing from /r/all.

EDA

Chosen Features:

  • Title
  • Subreddit
  • Over_18
  • Is_Original_Content
  • Is_Self
  • Spoiler
  • Locked
  • Stickied
  • Num_Comments (Target)

EDA

Clean

  • Scale numeric features between 0-1
  • Convert '_' and '-' to whitespace
  • Remove any not a-z or A-Z or whitespace
  • Strip any leftover whitespace
  • Delete any titles that were reduced to empty strings

Clean

Model

  • Split data to train/test samples
  • Test RandomForestClassifier, KNeighborsClassifier
  • My preferred is RandomForest

Model

Conclusion

Long story short, if you want to karma farm then post OC memes and porn.

Some Predictors from Top 25:

  • Is_Self
  • Subreddit_Memes
  • OC
  • Over_18
  • Subreddit_Shitposting
  • Is_Original_Content
  • Subreddit_Superstonk

Popular words:

'like', 'just', 'time', 'new', 'oc', 'good', 'got', 'day', 'today', 'im', 'dont', and 'love'.

People on Reddit (at least in the past few days) like their memes, porn, and talking about their day. And it's preferred if the content is original and self posted.

So yes, post your memes to memes and shitposting, tag them NSFW, use some words from the list, and rake in all that sweet karma!

But it's not that simple, this is a fairly simple model, with simple data. To go beyond this I think the comments would have to be analyzed. Tokenization I thought would be the most influential piece, and I still think that thinking is correct. But in this case it doesn't apply because there is no real meaning to be had from reddit post titles, at least to a computer. There's a lot more seen by a human than just the text in the title, there's often an image attached, most posts reference a recent/current event, they could be an inside joke of sorts. For some posts there could be emojis in the title, and depending on their combination they can take on a meaning completely different from their individual meanings.

The next step from here I believe is to analyze the comments section of these posts because in this moment I think that's the easiest way to truly describe the meaning of a post to a computer. With what was gathered here I'm only to get 10% above baseline and I think that's all there is to be had here, I mean we can tweak for a few percent probably but I don't think there's much left on the table.