127 lines
5.8 KiB
HTML
127 lines
5.8 KiB
HTML
|
<article>
|
||
|
<p>
|
||
|
In an attempt to find out what about a Reddit post makes it successful I will use some
|
||
|
classification models to try to determine which features have the highest influence on
|
||
|
making a correct prediction. In particular I use
|
||
|
<a
|
||
|
href="https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html"
|
||
|
>Random Forest</a
|
||
|
>
|
||
|
and
|
||
|
<a
|
||
|
href="https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html"
|
||
|
>KNNeighbors</a
|
||
|
>
|
||
|
classifiers. Then I'll score the results and see what the highest predictors are.
|
||
|
</p>
|
||
|
<p>
|
||
|
To find what goes into making a successful Reddit post we'll have to do a few things,
|
||
|
first of which is collecting data:
|
||
|
</p>
|
||
|
<h3>Introducing Scrapey!</h3>
|
||
|
<p>
|
||
|
<a href="https://doordesk.net/projects/reddit/scrapey.html">Scrapey</a> is my scraper script that takes a snapshot
|
||
|
of Reddit/r/all hot and saves the data to a .csv file including a calculated age for
|
||
|
each post about every 12 minutes. Run time is about 2 minutes per iteration and each
|
||
|
time adds about 100 unique posts to the list while updating any post it's already seen.
|
||
|
</p>
|
||
|
<p>
|
||
|
I run this in the background in a terminal and it updates my data set every ~12 minutes.
|
||
|
I have records of all posts within about 12 minutes of them disappearing from /r/all.
|
||
|
</p>
|
||
|
<h3>EDA</h3>
|
||
|
<p>
|
||
|
<a href="https://doordesk.net/projects/reddit/EDA.html">Next I take a quick look to see what looks useful</a>, what
|
||
|
doesn't, and check for outliers that will throw off the model. There were a few outliers
|
||
|
to drop from the num_comments column.
|
||
|
</p>
|
||
|
Chosen Features:
|
||
|
<ul>
|
||
|
<li>Title</li>
|
||
|
<li>Subreddit</li>
|
||
|
<li>Over_18</li>
|
||
|
<li>Is_Original_Content</li>
|
||
|
<li>Is_Self</li>
|
||
|
<li>Spoiler</li>
|
||
|
<li>Locked</li>
|
||
|
<li>Stickied</li>
|
||
|
<li>Num_Comments (Target)</li>
|
||
|
</ul>
|
||
|
<p>
|
||
|
Then I split the data I'm going to use into two dataframes (numeric and non) to prepare
|
||
|
for further processing.
|
||
|
</p>
|
||
|
<h3>Clean</h3>
|
||
|
<p><a href="https://doordesk.net/projects/reddit/clean.html">Cleaning the data further</a> consists of:</p>
|
||
|
<ul>
|
||
|
<li>Scaling numeric features between 0-1</li>
|
||
|
<li>Converting '_' and '-' to whitespace</li>
|
||
|
<li>Removing any non a-z or A-Z or whitespace</li>
|
||
|
<li>Stripping any leftover whitespace</li>
|
||
|
<li>Deleting any titles that were reduced to empty strings</li>
|
||
|
</ul>
|
||
|
<h3>Model</h3>
|
||
|
<p>
|
||
|
If the number of comments of a post is greater than the median total number of comments
|
||
|
then it's assigned a 1, otherwise a 0. This is the target column. I then try some
|
||
|
lemmatizing, it doesn't seem to add much. After that I create and join some dummies,
|
||
|
then split and feed the new dataframe into
|
||
|
<a
|
||
|
href="https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html"
|
||
|
>Random Forest</a
|
||
|
>
|
||
|
and
|
||
|
<a
|
||
|
href="https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html"
|
||
|
>KNNeighbors</a
|
||
|
>
|
||
|
classifiers. Both actually scored the same with
|
||
|
<a
|
||
|
href="https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html"
|
||
|
>cross validation</a
|
||
|
>
|
||
|
so I mainly used the forest.
|
||
|
</p>
|
||
|
<p><a href="https://doordesk.net/projects/reddit/model.html">Notebook Here</a></p>
|
||
|
<h3>Conclusion</h3>
|
||
|
<p>Some Predictors from Top 25:</p>
|
||
|
<ul>
|
||
|
<li>Is_Self</li>
|
||
|
<li>Subreddit_Memes</li>
|
||
|
<li>OC</li>
|
||
|
<li>Over_18</li>
|
||
|
<li>Subreddit_Shitposting</li>
|
||
|
<li>Is_Original_Content</li>
|
||
|
<li>Subreddit_Superstonk</li>
|
||
|
</ul>
|
||
|
<p>
|
||
|
Popular words: 'like', 'just', 'time', 'new', 'oc', 'good', 'got', 'day', 'today', 'im',
|
||
|
'dont', and 'love'.
|
||
|
</p>
|
||
|
<p>
|
||
|
People on Reddit (at least in the past few days) like their memes, porn, and talking
|
||
|
about their day. And it's preferred if the content is original and self posted. So yes,
|
||
|
post your memes to memes and shitposting, tag them NSFW, use some words from the list,
|
||
|
and rake in all that sweet karma!
|
||
|
</p>
|
||
|
<p>
|
||
|
But it's not that simple, this is a fairly simple model, with simple data. To go beyond
|
||
|
this I think the comments would have to be analyzed.
|
||
|
<a href="https://en.wikipedia.org/wiki/Lemmatisation">Lemmatisation</a> I thought would
|
||
|
be the most influential piece, and I still think that thinking is correct. But in this
|
||
|
case it doesn't apply because there is no real meaning to be had from reddit post
|
||
|
titles, at least to a computer. (or I did something wrong)
|
||
|
</p>
|
||
|
<p>
|
||
|
There's a lot more seen by a human than just the text in the title, there's often an
|
||
|
image attached, most posts reference a recent/current event, they could be an inside
|
||
|
joke of sorts. For some posts there could be emojis in the title, and depending on their
|
||
|
combination they can take on a meaning completely different from their individual
|
||
|
meanings. The next step from here I believe is to analyze the comments section of these
|
||
|
posts because in this moment I think that's the easiest way to truly describe the
|
||
|
meaning of a post to a computer. With what was gathered here I'm only to get 10% above
|
||
|
baseline and I think that's all there is to be had here, I mean we can tweak for a few
|
||
|
percent probably but I don't think there's much left on the table.
|
||
|
</p>
|
||
|
</article>
|