init

2022-07-27 13:09:15 -04:00 · 2022-07-27 13:09:15 -04:00 · 3bfeb55c48
commit 3bfeb55c48
11 changed files with 286990 additions and 0 deletions
--- a/.gitignore
+++ b/.gitignore
@ -0,0 +1,4 @@
 sekrit
 .DS_Store
 .ipynb_checkpoints
 */.ipynb_checkpoints
--- a/EDA.ipynb
+++ b/EDA.ipynb
--- a/README.md
+++ b/README.md
@ -0,0 +1,70 @@
 What Goes Into a Successful Reddit Post?
 ======================================
 Here I'll pull and explore some data from Reddit
 I'll attempt to find out what about a Reddit post makes it successful using classification. In particular a Random Forest model and KNNeighbors model.
 Then I'll score the results and see what the highest predictors are
 To find what goes into making a successful Reddit post we'll have to do a few things, first of which is collecting data:
 ## Introducing Scrapey!
 [Scrapey](scrapey.ipynb) takes a snapshot of [Reddit/r/all hot](https://www.reddit.com/r/all), and saves the data to a .csv file including a calculated age for each post about every 12 minutes. Run time is about 2 minutes per iteration and each time adds about 100 unique posts to the list while updating any post it's already seen.
 I run this in the background in a terminal and it updates my data set every ~12 minutes. I have records of all posts within about 12 minutes of them disappearing from /r/all.
 ## EDA
 Chosen Features:
 * Title
 * Subreddit
 * Over_18
 * Is_Original_Content
 * Is_Self
 * Spoiler
 * Locked
 * Stickied
 * Num_Comments (Target)
 [EDA](EDA.ipynb)
 ## Clean
 * Scale numeric features between 0-1
 * Convert '_' and '-' to whitespace
 * Remove any not a-z or A-Z or whitespace
 * Strip any leftover whitespace
 * Delete any titles that were reduced to empty strings
 [Clean](clean.ipynb)
 ## Model
 * Split data to train/test samples
 * Test RandomForestClassifier, KNeighborsClassifier
 * My preferred is RandomForest
 [Model](model.ipynb)
 ### Conclusion
 Long story short, if you want to karma farm then post OC memes and porn.
 Some Predictors from Top 25:
 * Is_Self
 * Subreddit_Memes
 * OC
 * Over_18
 * Subreddit_Shitposting
 * Is_Original_Content
 * Subreddit_Superstonk
 Popular words: 
 'like', 'just', 'time', 'new', 'oc', 'good', 'got', 'day', 'today', 'im', 'dont', and 'love'.
 People on Reddit (at least in the past few days) like their memes, porn, and talking about their day. And it's preferred if the content is original and self posted.
 So yes, post your memes to memes and shitposting, tag them NSFW, use some words from the list, and rake in all that sweet karma!
 But it's not that simple, this is a fairly simple model, with simple data. To go beyond this I think the comments would have to be analyzed. Tokenization I thought would be the most influential piece, and I still think that thinking is correct. But in this case it doesn't apply because there is no real meaning to be had from reddit post titles, at least to a computer. There's a lot more seen by a human than just the text in the title, there's often an image attached, most posts reference a recent/current event, they could be an inside joke of sorts. For some posts there could be emojis in the title, and depending on their combination they can take on a meaning completely different from their individual meanings. 
 The next step from here I believe is to analyze the comments section of these posts because in this moment I think that's the easiest way to truly describe the meaning of a post to a computer. With what was gathered here I'm only to get 10% above baseline and I think that's all there is to be had here, I mean we can tweak for a few percent probably but I don't think there's much left on the table.
--- a/clean.ipynb
+++ b/clean.ipynb
@ -0,0 +1,290 @@
 {
 "cells": [
  {
   "cell_type": "markdown",
   "id": "7b689e03-f1fa-4a63-b3e2-7342f71b4c1f",
   "metadata": {},
   "source": [
    "# Cleaning/Processing\n",
    "## Numerics\n",
    "The numeric columns added in the previous step need to be scaled so they don't end up with higher influence than other features."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "85e6ae15-a4d7-474f-918f-a81b790a0ef7",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-06-07T15:26:19.692651Z",
     "iopub.status.busy": "2022-06-07T15:26:19.692290Z",
     "iopub.status.idle": "2022-06-07T15:26:19.952952Z",
     "shell.execute_reply": "2022-06-07T15:26:19.952223Z",
     "shell.execute_reply.started": "2022-06-07T15:26:19.692571Z"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "numerics = pd.read_csv('data/numerics.csv')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "326de842-8fd1-4d1b-8696-6f1f7c6b916f",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-06-07T15:26:19.954589Z",
     "iopub.status.busy": "2022-06-07T15:26:19.954241Z",
     "iopub.status.idle": "2022-06-07T15:26:19.960763Z",
     "shell.execute_reply": "2022-06-07T15:26:19.960063Z",
     "shell.execute_reply.started": "2022-06-07T15:26:19.954558Z"
    },
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Score min/max: 45 / 193501\n",
      "Post Age min/max: 14846.5434589386 / 101274.16840600967\n",
      "Upvote Ratio min/max: 0.51 / 1.0\n",
      "\n"
     ]
    }
   ],
   "source": [
    "print(f'\\\n",
    "Score min/max: {numerics.score.min()} / {numerics.score.max()}\\n\\\n",
    "Post Age min/max: {numerics.post_age.min()} / {numerics.post_age.max()}\\n\\\n",
    "Upvote Ratio min/max: {numerics.upvote_ratio.min()} / {numerics.upvote_ratio.max()}\\n\\\n",
    "')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "42f191ae-2c1e-4471-9342-b25dea2be7a0",
   "metadata": {},
   "source": [
    "Upvote Ratio is already between 0 and 1 so there's 1/3 of the work out the way for free"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "eb3a9ce5-d4eb-47a8-bdbe-aaf34b1b948b",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-06-07T15:26:19.963570Z",
     "iopub.status.busy": "2022-06-07T15:26:19.963241Z",
     "iopub.status.idle": "2022-06-07T15:26:19.970133Z",
     "shell.execute_reply": "2022-06-07T15:26:19.969493Z",
     "shell.execute_reply.started": "2022-06-07T15:26:19.963543Z"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "def normalize_numerics(col):\n",
    "    col_max = numerics[col].max()\n",
    "    return [(val/col_max) for val in numerics[col]]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8c77c16b-abff-40f3-9621-78ca90096c70",
   "metadata": {},
   "source": [
    "I wasn't sure how often I would need to do this so I wrote a function"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "2e83a18d-9a20-4f18-8da5-dd1ad9a35ca1",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-06-07T15:26:19.971475Z",
     "iopub.status.busy": "2022-06-07T15:26:19.971081Z",
     "iopub.status.idle": "2022-06-07T15:26:20.010176Z",
     "shell.execute_reply": "2022-06-07T15:26:20.009339Z",
     "shell.execute_reply.started": "2022-06-07T15:26:19.971448Z"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "numerics['norm_score'] = normalize_numerics('score')\n",
    "numerics = numerics.drop('score',axis=1) # Prevent name collision with column for word 'score'\n",
    "numerics['post_age'] = normalize_numerics('post_age')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c042889a-c6ac-4dfa-92ed-28a668957296",
   "metadata": {},
   "source": [
    "So if we check again..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "17b0baf9-04bf-486c-b94f-5ea9cd801c37",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-06-07T15:26:20.011272Z",
     "iopub.status.busy": "2022-06-07T15:26:20.011063Z",
     "iopub.status.idle": "2022-06-07T15:26:20.016210Z",
     "shell.execute_reply": "2022-06-07T15:26:20.015608Z",
     "shell.execute_reply.started": "2022-06-07T15:26:20.011255Z"
    },
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Score min/max: 0.0002325569376902445 / 1.0\n",
      "Post Age min/max: 0.14659753511298737 / 1.0\n",
      "Upvote Ratio min/max: 0.51 / 1.0\n",
      "\n"
     ]
    }
   ],
   "source": [
    "print(f'\\\n",
    "Score min/max: {numerics.norm_score.min()} / {numerics.norm_score.max()}\\n\\\n",
    "Post Age min/max: {numerics.post_age.min()} / {numerics.post_age.max()}\\n\\\n",
    "Upvote Ratio min/max: {numerics.upvote_ratio.min()} / {numerics.upvote_ratio.max()}\\n\\\n",
    "')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "25157505-bc47-433e-8b5f-95a73d728d74",
   "metadata": {},
   "source": [
    "It's less human readable but better for the model, who is not human"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e65e348b-7023-40b7-8e14-1fff5950a6ed",
   "metadata": {},
   "source": [
    "## Titles"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "13180eef-55b3-4faa-adee-7c6f6ce4b81c",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-06-07T15:26:20.017202Z",
     "iopub.status.busy": "2022-06-07T15:26:20.017007Z",
     "iopub.status.idle": "2022-06-07T15:26:20.131445Z",
     "shell.execute_reply": "2022-06-07T15:26:20.130521Z",
     "shell.execute_reply.started": "2022-06-07T15:26:20.017184Z"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "df = pd.read_csv('data/workingdf.csv')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7eb7f69a-f634-4745-ad02-9ffb4d3a662c",
   "metadata": {},
   "source": [
    "Before beginning tokenizing, the random garbage that will end up producing gibberish will need to be removed. Things like emojis, punctuation and special characters, accents, etc. I've decided to replace underscores and hyphens with whitespace, then remove anything that is not a letter or whitespace, and finally strip all extra whitespace."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "44f5369e-3bb8-4f9d-b176-50ee5f5c07e2",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-06-07T15:26:20.132495Z",
     "iopub.status.busy": "2022-06-07T15:26:20.132306Z",
     "iopub.status.idle": "2022-06-07T15:26:20.279374Z",
     "shell.execute_reply": "2022-06-07T15:26:20.278619Z",
     "shell.execute_reply.started": "2022-06-07T15:26:20.132478Z"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "df['title'] = [title.lower().replace('_',' ').replace('-',' ') for title in df.title]\n",
    "df['title'] = df.title.replace(\"[^a-zA-Z\\s]\",'',regex=True)\n",
    "df['title'] = [title.strip() for title in df.title]\n",
    "df.drop(df[df.title==''].index,inplace=True) # drop now empty titles"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "15415269-6a3b-47eb-bfa3-b45f23195b4a",
   "metadata": {},
   "source": [
    "# Save"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "de7c67d6-22a8-456c-8e89-d22e789e0f87",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-06-07T15:26:20.280467Z",
     "iopub.status.busy": "2022-06-07T15:26:20.280158Z",
     "iopub.status.idle": "2022-06-07T15:26:20.649227Z",
     "shell.execute_reply": "2022-06-07T15:26:20.648747Z",
     "shell.execute_reply.started": "2022-06-07T15:26:20.280449Z"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "df.to_csv('data/df_clean.csv',index=False)\n",
    "numerics.to_csv('data/numerics_clean.csv',index=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "425bc75b-f459-4f38-96b1-99e3414e4226",
   "metadata": {},
   "source": [
    "Next! [Model](model.ipynb)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
 }
--- a/data/df_clean.csv
+++ b/data/df_clean.csv
--- a/data/numerics.csv
+++ b/data/numerics.csv
--- a/data/numerics_clean.csv
+++ b/data/numerics_clean.csv
--- a/data/startingover.csv
+++ b/data/startingover.csv
--- a/data/workingdf.csv
+++ b/data/workingdf.csv
--- a/model.ipynb
+++ b/model.ipynb
@ -0,0 +1,611 @@
 {
 "cells": [
  {
   "cell_type": "markdown",
   "id": "f2c2a80e-4240-4e89-ad59-039c8e65cd25",
   "metadata": {},
   "source": [
    "# Process/Model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "7c31603a-9b72-4353-a49d-9ad3b5e245e4",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-06-07T15:26:40.982440Z",
     "iopub.status.busy": "2022-06-07T15:26:40.982213Z",
     "iopub.status.idle": "2022-06-07T15:26:41.538590Z",
     "shell.execute_reply": "2022-06-07T15:26:41.537956Z",
     "shell.execute_reply.started": "2022-06-07T15:26:40.982383Z"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "from sklearn.feature_extraction.text import TfidfVectorizer\n",
    "# import spacy\n",
    "\n",
    "from sklearn.ensemble import RandomForestClassifier\n",
    "from sklearn.neighbors import KNeighborsClassifier\n",
    "from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cfc10edd-fec8-439d-a315-44d55056d4a3",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-06-07T04:06:14.259265Z",
     "iopub.status.busy": "2022-06-07T04:06:14.258568Z",
     "iopub.status.idle": "2022-06-07T04:06:14.264153Z",
     "shell.execute_reply": "2022-06-07T04:06:14.262980Z",
     "shell.execute_reply.started": "2022-06-07T04:06:14.259236Z"
    }
   },
   "source": [
    "### Load Files"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "3512f32a-cf1f-4763-a260-8970c0dfb58b",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-06-07T15:26:41.539644Z",
     "iopub.status.busy": "2022-06-07T15:26:41.539379Z",
     "iopub.status.idle": "2022-06-07T15:26:41.611358Z",
     "shell.execute_reply": "2022-06-07T15:26:41.610751Z",
     "shell.execute_reply.started": "2022-06-07T15:26:41.539618Z"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "df = pd.read_csv('data/df_clean.csv')\n",
    "numerics = pd.read_csv('data/numerics_clean.csv')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b4ba7548-f51d-43cc-befb-3280219a860b",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-06-07T04:06:33.188298Z",
     "iopub.status.busy": "2022-06-07T04:06:33.187754Z",
     "iopub.status.idle": "2022-06-07T04:06:33.218642Z",
     "shell.execute_reply": "2022-06-07T04:06:33.217761Z",
     "shell.execute_reply.started": "2022-06-07T04:06:33.188269Z"
    },
    "tags": []
   },
   "source": [
    "### Create target column (y)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "6f9c6b2a-d6b1-4ba9-9bdf-07fdf36115f9",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-06-07T15:26:41.613421Z",
     "iopub.status.busy": "2022-06-07T15:26:41.613145Z",
     "iopub.status.idle": "2022-06-07T15:26:41.642378Z",
     "shell.execute_reply": "2022-06-07T15:26:41.641819Z",
     "shell.execute_reply.started": "2022-06-07T15:26:41.613403Z"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "ymed = df.num_comments.median()\n",
    "y = pd.Series([1 if val > ymed else 0 for val in df.num_comments])\n",
    "df.drop('num_comments',axis=1,inplace=True) # get rid of this immediately"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c4a4af8f-4a83-4d8c-8547-a07aa069a2bd",
   "metadata": {},
   "source": [
    "### Lemmatise"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "a7083ac9-5033-4c43-90b8-815957443793",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-06-07T15:26:41.643370Z",
     "iopub.status.busy": "2022-06-07T15:26:41.643114Z",
     "iopub.status.idle": "2022-06-07T15:26:41.646350Z",
     "shell.execute_reply": "2022-06-07T15:26:41.645572Z",
     "shell.execute_reply.started": "2022-06-07T15:26:41.643353Z"
    }
   },
   "outputs": [],
   "source": [
    "# # Lemmatize and filter out ' ' tokens\n",
    "# nlp = spacy.load('en_core_web_sm')\n",
    "# df['title'] = [' '.join([word.lemma_ for word in nlp(title) if word.lemma_ != ' '])\\\n",
    "#                          for title in df.title] # This should be optimized"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "138b8776-d5cf-47b5-bd6e-319a5f7c66ae",
   "metadata": {},
   "source": [
    "Lemmatising to my surprise seems to add no value. I thought this was going to be the most important part, but as it turns out it just takes forever to process and adds nothing but run time. I suspect this is because the post titles are so short that there is no real meaning to be extracted. This could be useful when analyzing comments"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a43ea9ac-8e67-41ba-b137-16c2c195f744",
   "metadata": {},
   "source": [
    "### Create X"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "358fce01-c93b-4b35-a66f-bbf39d3eabd8",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-06-07T15:26:41.647434Z",
     "iopub.status.busy": "2022-06-07T15:26:41.647206Z",
     "iopub.status.idle": "2022-06-07T15:26:42.472794Z",
     "shell.execute_reply": "2022-06-07T15:26:42.471961Z",
     "shell.execute_reply.started": "2022-06-07T15:26:41.647417Z"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "tf = TfidfVectorizer(stop_words='english',max_features=500)\n",
    "tfvec = tf.fit(df.title)\n",
    "X = pd.DataFrame(tfvec.transform(df.title).todense(),columns=tfvec.get_feature_names_out())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cf0c3334-5795-4f2e-95da-28164de3d36b",
   "metadata": {},
   "source": [
    "### Join X with numeric columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "0e1a7188-565d-48c9-8e0a-93553968980c",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-06-07T15:26:42.473997Z",
     "iopub.status.busy": "2022-06-07T15:26:42.473562Z",
     "iopub.status.idle": "2022-06-07T15:26:42.480297Z",
     "shell.execute_reply": "2022-06-07T15:26:42.479581Z",
     "shell.execute_reply.started": "2022-06-07T15:26:42.473979Z"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "df = df.join(numerics)\n",
    "del numerics # done with numerics"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2248137d-3f32-4bc7-af53-93de51791d80",
   "metadata": {},
   "source": [
    "### Create dummies from columns that are objects or booleans"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "6c14641e-0e0e-46b5-960a-2bb41e84c1dd",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-06-07T15:26:42.481258Z",
     "iopub.status.busy": "2022-06-07T15:26:42.481069Z",
     "iopub.status.idle": "2022-06-07T15:26:49.841545Z",
     "shell.execute_reply": "2022-06-07T15:26:49.840824Z",
     "shell.execute_reply.started": "2022-06-07T15:26:42.481241Z"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "def make_dummies(df):\n",
    "    for col_name in df.columns:\n",
    "        if (df[col_name].dtype == 'O') or (df[col_name].dtype == 'bool'):\n",
    "            dums = pd.get_dummies(df[col_name],prefix=col_name,dtype=int,drop_first=True)\n",
    "            df = df.drop(labels=[col_name],axis=1)\n",
    "            df = df.join(dums)\n",
    "    return df\n",
    "\n",
    "dums = make_dummies(df[df.columns[1:]]) # [1:] excludes first column, 'title'\n",
    "del df # done with df"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "24b13f04-8b0a-43d2-9f47-0bb0edd8b305",
   "metadata": {},
   "source": [
    "### Now join dummies with X"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "3a2e6910-4023-4e75-bd68-864fcbfc193d",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-06-07T15:26:49.842408Z",
     "iopub.status.busy": "2022-06-07T15:26:49.842246Z",
     "iopub.status.idle": "2022-06-07T15:26:50.247406Z",
     "shell.execute_reply": "2022-06-07T15:26:50.246837Z",
     "shell.execute_reply.started": "2022-06-07T15:26:49.842392Z"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "X = X.join(dums)\n",
    "del dums # done with dums"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ef94a7c0-dcde-41c1-8a0d-407938f85608",
   "metadata": {},
   "source": [
    "### Now model it!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "c5a322ee-a81e-491a-97d2-3d22a2fd89a7",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-06-07T15:26:50.248820Z",
     "iopub.status.busy": "2022-06-07T15:26:50.248515Z",
     "iopub.status.idle": "2022-06-07T15:26:52.034771Z",
     "shell.execute_reply": "2022-06-07T15:26:52.034058Z",
     "shell.execute_reply.started": "2022-06-07T15:26:50.248795Z"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Do a split\n",
    "X_train, X_test, y_train, y_test = train_test_split(X,y)\n",
    "del X\n",
    "del y"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "fb1c22e4-57ff-4133-9c36-1c3c3e00c8bd",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-06-07T15:26:52.035757Z",
     "iopub.status.busy": "2022-06-07T15:26:52.035519Z",
     "iopub.status.idle": "2022-06-07T15:28:35.841910Z",
     "shell.execute_reply": "2022-06-07T15:28:35.841244Z",
     "shell.execute_reply.started": "2022-06-07T15:26:52.035741Z"
    },
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Create Random Forest...\n",
      "Create Logistic Regression...\n",
      "fit RF...\n",
      "fit KNN...\n",
      "Scoring...\n",
      "Score:\t0.62 ± 0.0056\n",
      "Score:\t0.6 ± 0.0052\n"
     ]
    }
   ],
   "source": [
    "print('Create Random Forest...')\n",
    "rf = RandomForestClassifier(n_jobs=-1)\n",
    "print('Create Logistic Regression...')\n",
    "# knn = KNeighborsClassifier(n_jobs=-1)\n",
    "print('fit RF...')\n",
    "model_rf = rf.fit(X_train,y_train)\n",
    "\n",
    "print('fit KNN...')\n",
    "# model_knn = knn.fit(X_train,y_train)\n",
    "\n",
    "# Model Scores\n",
    "def score(model,X,y):\n",
    "    cv=StratifiedKFold(n_splits=3,shuffle=True)\n",
    "    s = cross_val_score(model,X,y,cv=cv) # n_jobs=-1 actually makes it slower here\n",
    "    print(\"Score:\\t{:0.2} ± {:0.2}\".format(s.mean(), 2 * s.std()))\n",
    "\n",
    "print('Scoring...')\n",
    "score(model_rf,X_train,y_train)\n",
    "score(model_rf,X_test,y_test)\n",
    "# score(model_knn,X_train,y_train)\n",
    "# score(model_knn,X_test,y_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ec845f9b-2df2-475e-b4a0-12d6e4d1a38c",
   "metadata": {},
   "source": [
    "### Comparing Models\n",
    "Between Random Forest, K Neighbors, and LogisticRegression, they all score about the same. But Random Forest takes two minutes to run and the rest take a lifetime. I think the data and the problem isn't complex enough to warrant more than a Random Forest.\n",
    "\n",
    "Everything seems to perform about 10% above the baseline, which would be 50%. In other words, the target is split cleanly in half, so if the predictions were to be 1s across the board the accuracy would be at 50%. So a cross-val score above 50 means the model is working, but doesn't necessarily give insight on exactly what it's predicting."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "1ea3c8de-1f87-43ef-b589-7fab1552f072",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-06-07T15:28:35.843245Z",
     "iopub.status.busy": "2022-06-07T15:28:35.842845Z",
     "iopub.status.idle": "2022-06-07T15:28:35.877909Z",
     "shell.execute_reply": "2022-06-07T15:28:35.877235Z",
     "shell.execute_reply.started": "2022-06-07T15:28:35.843225Z"
    },
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Variable</th>\n",
       "      <th>Importance</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>500</th>\n",
       "      <td>post_age</td>\n",
       "      <td>0.080172</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>502</th>\n",
       "      <td>norm_score</td>\n",
       "      <td>0.077290</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>501</th>\n",
       "      <td>upvote_ratio</td>\n",
       "      <td>0.047547</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5545</th>\n",
       "      <td>is_self_True</td>\n",
       "      <td>0.020376</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>232</th>\n",
       "      <td>like</td>\n",
       "      <td>0.003663</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>212</th>\n",
       "      <td>just</td>\n",
       "      <td>0.003490</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4518</th>\n",
       "      <td>subreddit_memes</td>\n",
       "      <td>0.003337</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>428</th>\n",
       "      <td>time</td>\n",
       "      <td>0.003296</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>287</th>\n",
       "      <td>new</td>\n",
       "      <td>0.003230</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>293</th>\n",
       "      <td>oc</td>\n",
       "      <td>0.002981</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>198</th>\n",
       "      <td>im</td>\n",
       "      <td>0.002539</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>85</th>\n",
       "      <td>day</td>\n",
       "      <td>0.002455</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>157</th>\n",
       "      <td>got</td>\n",
       "      <td>0.002369</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>99</th>\n",
       "      <td>dont</td>\n",
       "      <td>0.002164</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>431</th>\n",
       "      <td>today</td>\n",
       "      <td>0.002146</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>247</th>\n",
       "      <td>love</td>\n",
       "      <td>0.002137</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>253</th>\n",
       "      <td>man</td>\n",
       "      <td>0.002102</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>156</th>\n",
       "      <td>good</td>\n",
       "      <td>0.002101</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5546</th>\n",
       "      <td>spoiler_True</td>\n",
       "      <td>0.002013</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5544</th>\n",
       "      <td>is_original_content_True</td>\n",
       "      <td>0.002012</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>141</th>\n",
       "      <td>game</td>\n",
       "      <td>0.001961</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5027</th>\n",
       "      <td>subreddit_shitposting</td>\n",
       "      <td>0.001955</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>art</td>\n",
       "      <td>0.001921</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5543</th>\n",
       "      <td>over_18_True</td>\n",
       "      <td>0.001905</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>311</th>\n",
       "      <td>people</td>\n",
       "      <td>0.001879</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                      Variable  Importance\n",
       "500                   post_age    0.080172\n",
       "502                 norm_score    0.077290\n",
       "501               upvote_ratio    0.047547\n",
       "5545              is_self_True    0.020376\n",
       "232                       like    0.003663\n",
       "212                       just    0.003490\n",
       "4518           subreddit_memes    0.003337\n",
       "428                       time    0.003296\n",
       "287                        new    0.003230\n",
       "293                         oc    0.002981\n",
       "198                         im    0.002539\n",
       "85                         day    0.002455\n",
       "157                        got    0.002369\n",
       "99                        dont    0.002164\n",
       "431                      today    0.002146\n",
       "247                       love    0.002137\n",
       "253                        man    0.002102\n",
       "156                       good    0.002101\n",
       "5546              spoiler_True    0.002013\n",
       "5544  is_original_content_True    0.002012\n",
       "141                       game    0.001961\n",
       "5027     subreddit_shitposting    0.001955\n",
       "11                         art    0.001921\n",
       "5543              over_18_True    0.001905\n",
       "311                     people    0.001879"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.DataFrame({'Variable':X_train.columns,\n",
    "              'Importance':rf.feature_importances_}).sort_values('Importance', ascending=False).head(25)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4e664e35-6091-47b9-85e9-384eb3a46dcd",
   "metadata": {},
   "source": [
    "Here are the top 25 predictors scored by importance, or the amount of influence they have. At this time of writing, at the top is age, score, upvote ratio. Of course if you want to have a popular post, make it popular, but we can't just say that. Beyond that it appears that self posts see more activity, the subreddits memes and shitposting have been very active and popular the past few days, over 18 content is popular. And some key words that may get you to the top are 'like','just','time','new','oc','good','got','day','today','im','dont', and 'love'. I don't go on Reddit often but I'm not exactly a stranger either, this looks about right. Memes is very popular, and a very easy karma farm. People love their OC (and their porn). A lot of people on reddit talk about what's going on in their day ('today', 'day')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "48b62749-e027-41dd-acb3-32fa1c52eeea",
   "metadata": {},
   "source": [
    "This is a fairly simple model, with simple data. To go beyond this I think the comments would have to be analyzed. Tokenization I thought would be the most influential piece, and I still think that thinking is correct. But in this case it doesn't apply because there is no real meaning to be had from reddit post titles, at least to a computer. There's a lot more seen by a human than just the text in the title, there's often an image attached, most posts reference a recent/current event, they could be an inside joke of sorts. For some posts there could be emojis in the title, and depending on their combination they can take on a meaning completely different from their individual meanings. \n",
    "\n",
    "The next step from here I believe is to analyze the comments section of these posts because in this moment I think that's the easiest way to truly describe the meaning of a post to a computer. With what was gathered here I'm only to get 10% above baseline and I think that's all there is to be had here, I mean we can tweak for a few percent probably but I don't think there's much left on the table."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
 }
--- a/scrapey.ipynb
+++ b/scrapey.ipynb
@ -0,0 +1,206 @@
 {
 "cells": [
  {
   "cell_type": "markdown",
   "id": "15017f1a-bfcb-4195-b635-d2138873a9cf",
   "metadata": {},
   "source": [
    "## Anatomy of Scrapey!"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bac30d6c-6a53-4ee5-ad68-d096c7cc567d",
   "metadata": {},
   "source": [
    "Scrapey takes a snapshot of [Reddit/r/all hot](https://www.reddit.com/r/all), and saves the data to a .csv file including a calculated age for each post about every 12 minutes. Run time is about 2 minutes per iteration and each time adds about 100 unique posts to the list while updating any post it's already seen.\n",
    "\n",
    "To run it yourself you should create a file ./sekrit with your:\n",
    "* client_id token\n",
    "* client_secret token\n",
    "* username (optional)\n",
    "* password (if using username, also optional)\n",
    "\n",
    "Each value goes on their own line in this order, or you can just hard code them below.<br />\n",
    "If you don't want to use a username or password just comment out those lines below"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b85a5ad5-1de3-41aa-a74c-d2a3beb1d1d2",
   "metadata": {},
   "source": [
    "Imports"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cb1dacc9-852c-436e-875f-dd5d5bb4f3d7",
   "metadata": {},
   "outputs": [],
   "source": [
    "import praw\n",
    "import pandas as pd\n",
    "from datetime import datetime\n",
    "import time\n",
    "print(datetime.now().strftime('%Y-%m-%d %H:%M:%S'))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "41de2589-1fd9-41c6-b1ee-6585366cd53b",
   "metadata": {},
   "source": [
    "Load all from current collection"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3459ab26-8640-4116-98da-d48016f93d4b",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Connect to DB\n",
    "db_name = 'data/startingover.csv'\n",
    "# db = pd.DataFrame() # for fresh start\n",
    "db = pd.read_csv(db_name)\n",
    "print('Connected to DB...')\n",
    "print(db.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1b9cb802-b6c6-4380-8026-23710c3624b5",
   "metadata": {},
   "source": [
    "Access Reddit API via [PRAW](https://github.com/praw-dev/praw)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "17d30e19-663a-4c07-847c-ff30f16ea19c",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Extremely Confidential\n",
    "sekrits = open('sekrit').read().split('\\n')\n",
    "\n",
    "# Connect to Reddit\n",
    "reddit = praw.Reddit(\n",
    "    client_id = sekrits[0],\n",
    "    client_secret = sekrits[1],\n",
    "    username = sekrits[2], # Optional\n",
    "    password = sekrits[3], # Optional\n",
    "    redirect_uri= 'http://localhost:8080',\n",
    "    user_agent = 'totally_not_a_bot', # fool everyone\n",
    ")\n",
    "print('Connected to Reddit...')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fd2c83ee-d534-4730-add9-da4e304c1c9d",
   "metadata": {},
   "source": [
    "The following block is a little large but if I split it up it will break the loop and it can't be run from the notebook.\n",
    "1. Loop through all posts on /r/all hot at the current moment, and create a dataframe of all of these posts with the listed features\n",
    "2. Calculate a current age of the post and add that in its own column.\n",
    "3. Append the newly pulled posts to the posts already saved\n",
    "4. Overwrite any old records that have the same post id as a new record\n",
    "5. Save back to the original .csv, wait 10 minutes, repeat."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "49504fca-7beb-45db-b43f-98a9e719bf31",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Grab everything from /r/all hot\n",
    "print('Pulling...')\n",
    "while True:\n",
    "    pull = pd.DataFrame({\\\n",
    "       'author': post.author,\n",
    "       # 'comments': post.comments, # takes really long, returns object\n",
    "       'created_utc': post.created_utc,\n",
    "       'distinguished': post.distinguished,\n",
    "       'edited': post.edited,\n",
    "       'id': post.id,\n",
    "       'is_original_content': post.is_original_content,\n",
    "       'is_self': post.is_self,\n",
    "       'link_flair_text': post.link_flair_text,\n",
    "       'locked': post.locked,\n",
    "       'name': post.name,\n",
    "       'num_comments': post.num_comments,\n",
    "       'over_18': post.over_18,\n",
    "       'permalink': post.permalink,\n",
    "       'score': post.score,\n",
    "       'selftext': post.selftext,\n",
    "       'spoiler': post.spoiler,\n",
    "       'stickied': post.stickied,\n",
    "       'subreddit': post.subreddit,\n",
    "       'title': post.title,\n",
    "       'upvote_ratio': post.upvote_ratio,\n",
    "       'url': post.url,\n",
    "       'utc_now': datetime.utcnow().timestamp(),\n",
    "       'post_age': (datetime.utcnow().timestamp()-post.created_utc) # Create age col\n",
    "          } for post in reddit.subreddit('all').hot(limit=None))\n",
    "\n",
    "    # add new list to BOTTOM of old list\n",
    "    db = pd.concat([db,pull])\n",
    "    # effectively update post record in place\n",
    "    db = db.drop_duplicates('id',keep='last')\n",
    "    # save\n",
    "    db.to_csv(db_name, index=False)\n",
    "\n",
    "    # stats\n",
    "    total = db.shape[0]\n",
    "    haul = pull.shape[0]\n",
    "    print('Haul: ',pull.shape)\n",
    "    print('Total:',db.shape)\n",
    "    print(datetime.now().strftime('%Y-%m-%d %H:%M:%S'))\n",
    "\n",
    "    # wait\n",
    "    print('Now wait...')\n",
    "    time.sleep(600)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "39a4a83d-77c5-4dad-ae2d-867e61000d7a",
   "metadata": {},
   "source": [
    "I run this in the background in a terminal and it updates my data set every ~12 minutes. I have records of all posts within about 12 minutes of them disappearing from /r/all.\n",
    "\n",
    "Next up: [EDA](EDA.ipynb)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
 }