doordesk-js/doordesk/public/projects/20220614-reddit.html

<article>
    <p className="align-right date">Jun 14, 2022</p>
    <h2 className="title">What Goes Into a Successful Reddit Post?</h2>
    <p>
        In an attempt to find out what about a Reddit post makes it successful I will use some
        classification models to try to determine which features have the highest influence on
        making a correct prediction. In particular I use
        <a
            href="https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html"
        >Random Forest</a
        >
        and
        <a
            href="https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html"
        >KNNeighbors</a
        >
        classifiers. Then I'll score the results and see what the highest predictors are.
    </p>
    <p>
        To find what goes into making a successful Reddit post we'll have to do a few things,
        first of which is collecting data:
    </p>
    <h3>Introducing Scrapey!</h3>
    <p>
        <a href="https://doordesk.net/projects/reddit/scrapey.html">Scrapey</a> is my scraper script that takes a snapshot
        of Reddit/r/all hot and saves the data to a .csv file including a calculated age for
        each post about every 12 minutes. Run time is about 2 minutes per iteration and each
        time adds about 100 unique posts to the list while updating any post it's already seen.
    </p>
    <p>
        I run this in the background in a terminal and it updates my data set every ~12 minutes.
        I have records of all posts within about 12 minutes of them disappearing from /r/all.
    </p>
    <h3>EDA</h3>
    <p>
        <a href="https://doordesk.net/projects/reddit/EDA.html">Next I take a quick look to see what looks useful</a>, what
        doesn't, and check for outliers that will throw off the model. There were a few outliers
        to drop from the num_comments column.
    </p>
    Chosen Features:
    <ul>
        <li>Title</li>
        <li>Subreddit</li>
        <li>Over_18</li>
        <li>Is_Original_Content</li>
        <li>Is_Self</li>
        <li>Spoiler</li>
        <li>Locked</li>
        <li>Stickied</li>
        <li>Num_Comments (Target)</li>
    </ul>
    <p>
        Then I split the data I'm going to use into two dataframes (numeric and non) to prepare
        for further processing.
    </p>
    <h3>Clean</h3>
    <p><a href="https://doordesk.net/projects/reddit/clean.html">Cleaning the data further</a> consists of:</p>
    <ul>
        <li>Scaling numeric features between 0-1</li>
        <li>Converting '_' and '-' to whitespace</li>
        <li>Removing any non a-z or A-Z or whitespace</li>
        <li>Stripping any leftover whitespace</li>
        <li>Deleting any titles that were reduced to empty strings</li>
    </ul>
    <h3>Model</h3>
    <p>
        If the number of comments of a post is greater than the median total number of comments
        then it's assigned a 1, otherwise a 0. This is the target column. I then try some
        lemmatizing, it doesn't seem to add much. After that I create and join some dummies,
        then split and feed the new dataframe into
        <a
            href="https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html"
        >Random Forest</a
        >
        and
        <a
            href="https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html"
        >KNNeighbors</a
        >
        classifiers. Both actually scored the same with
        <a
            href="https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html"
        >cross validation</a
        >
        so I mainly used the forest.
    </p>
    <p><a href="https://doordesk.net/projects/reddit/model.html">Notebook Here</a></p>
    <h3>Conclusion</h3>
    <p>Some Predictors from Top 25:</p>
    <ul>
        <li>Is_Self</li>
        <li>Subreddit_Memes</li>
        <li>OC</li>
        <li>Over_18</li>
        <li>Subreddit_Shitposting</li>
        <li>Is_Original_Content</li>
        <li>Subreddit_Superstonk</li>
    </ul>
    <p>
        Popular words: 'like', 'just', 'time', 'new', 'oc', 'good', 'got', 'day', 'today', 'im',
        'dont', and 'love'.
    </p>
    <p>
        People on Reddit (at least in the past few days) like their memes, porn, and talking
        about their day. And it's preferred if the content is original and self posted. So yes,
        post your memes to memes and shitposting, tag them NSFW, use some words from the list,
        and rake in all that sweet karma!
    </p>
    <p>
        But it's not that simple, this is a fairly simple model, with simple data. To go beyond
        this I think the comments would have to be analyzed.
        <a href="https://en.wikipedia.org/wiki/Lemmatisation">Lemmatisation</a> I thought would
        be the most influential piece, and I still think that thinking is correct. But in this
        case it doesn't apply because there is no real meaning to be had from reddit post
        titles, at least to a computer. (or I did something wrong)
    </p>
    <p>
        There's a lot more seen by a human than just the text in the title, there's often an
        image attached, most posts reference a recent/current event, they could be an inside
        joke of sorts. For some posts there could be emojis in the title, and depending on their
        combination they can take on a meaning completely different from their individual
        meanings. The next step from here I believe is to analyze the comments section of these
        posts because in this moment I think that's the easiest way to truly describe the
        meaning of a post to a computer. With what was gathered here I'm only to get 10% above
        baseline and I think that's all there is to be had here, I mean we can tweak for a few
        percent probably but I don't think there's much left on the table.
    </p>
</article>