{ "cells": [ { "cell_type": "markdown", "id": "f2c2a80e-4240-4e89-ad59-039c8e65cd25", "metadata": {}, "source": [ "# Process/Model" ] }, { "cell_type": "code", "execution_count": 1, "id": "7c31603a-9b72-4353-a49d-9ad3b5e245e4", "metadata": { "execution": { "iopub.execute_input": "2022-06-07T15:26:40.982440Z", "iopub.status.busy": "2022-06-07T15:26:40.982213Z", "iopub.status.idle": "2022-06-07T15:26:41.538590Z", "shell.execute_reply": "2022-06-07T15:26:41.537956Z", "shell.execute_reply.started": "2022-06-07T15:26:40.982383Z" }, "tags": [] }, "outputs": [], "source": [ "import pandas as pd\n", "from sklearn.feature_extraction.text import TfidfVectorizer\n", "# import spacy\n", "\n", "from sklearn.ensemble import RandomForestClassifier\n", "from sklearn.neighbors import KNeighborsClassifier\n", "from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold" ] }, { "cell_type": "markdown", "id": "cfc10edd-fec8-439d-a315-44d55056d4a3", "metadata": { "execution": { "iopub.execute_input": "2022-06-07T04:06:14.259265Z", "iopub.status.busy": "2022-06-07T04:06:14.258568Z", "iopub.status.idle": "2022-06-07T04:06:14.264153Z", "shell.execute_reply": "2022-06-07T04:06:14.262980Z", "shell.execute_reply.started": "2022-06-07T04:06:14.259236Z" } }, "source": [ "### Load Files" ] }, { "cell_type": "code", "execution_count": 2, "id": "3512f32a-cf1f-4763-a260-8970c0dfb58b", "metadata": { "execution": { "iopub.execute_input": "2022-06-07T15:26:41.539644Z", "iopub.status.busy": "2022-06-07T15:26:41.539379Z", "iopub.status.idle": "2022-06-07T15:26:41.611358Z", "shell.execute_reply": "2022-06-07T15:26:41.610751Z", "shell.execute_reply.started": "2022-06-07T15:26:41.539618Z" }, "tags": [] }, "outputs": [], "source": [ "df = pd.read_csv('data/df_clean.csv')\n", "numerics = pd.read_csv('data/numerics_clean.csv')" ] }, { "cell_type": "markdown", "id": "b4ba7548-f51d-43cc-befb-3280219a860b", "metadata": { "execution": { "iopub.execute_input": "2022-06-07T04:06:33.188298Z", "iopub.status.busy": "2022-06-07T04:06:33.187754Z", "iopub.status.idle": "2022-06-07T04:06:33.218642Z", "shell.execute_reply": "2022-06-07T04:06:33.217761Z", "shell.execute_reply.started": "2022-06-07T04:06:33.188269Z" }, "tags": [] }, "source": [ "### Create target column (y)" ] }, { "cell_type": "code", "execution_count": 3, "id": "6f9c6b2a-d6b1-4ba9-9bdf-07fdf36115f9", "metadata": { "execution": { "iopub.execute_input": "2022-06-07T15:26:41.613421Z", "iopub.status.busy": "2022-06-07T15:26:41.613145Z", "iopub.status.idle": "2022-06-07T15:26:41.642378Z", "shell.execute_reply": "2022-06-07T15:26:41.641819Z", "shell.execute_reply.started": "2022-06-07T15:26:41.613403Z" }, "tags": [] }, "outputs": [], "source": [ "ymed = df.num_comments.median()\n", "y = pd.Series([1 if val > ymed else 0 for val in df.num_comments])\n", "df.drop('num_comments',axis=1,inplace=True) # get rid of this immediately" ] }, { "cell_type": "markdown", "id": "c4a4af8f-4a83-4d8c-8547-a07aa069a2bd", "metadata": {}, "source": [ "### Lemmatise" ] }, { "cell_type": "code", "execution_count": 4, "id": "a7083ac9-5033-4c43-90b8-815957443793", "metadata": { "execution": { "iopub.execute_input": "2022-06-07T15:26:41.643370Z", "iopub.status.busy": "2022-06-07T15:26:41.643114Z", "iopub.status.idle": "2022-06-07T15:26:41.646350Z", "shell.execute_reply": "2022-06-07T15:26:41.645572Z", "shell.execute_reply.started": "2022-06-07T15:26:41.643353Z" } }, "outputs": [], "source": [ "# # Lemmatize and filter out ' ' tokens\n", "# nlp = spacy.load('en_core_web_sm')\n", "# df['title'] = [' '.join([word.lemma_ for word in nlp(title) if word.lemma_ != ' '])\\\n", "# for title in df.title] # This should be optimized" ] }, { "cell_type": "markdown", "id": "138b8776-d5cf-47b5-bd6e-319a5f7c66ae", "metadata": {}, "source": [ "Lemmatising to my surprise seems to add no value. I thought this was going to be the most important part, but as it turns out it just takes forever to process and adds nothing but run time. I suspect this is because the post titles are so short that there is no real meaning to be extracted. This could be useful when analyzing comments" ] }, { "cell_type": "markdown", "id": "a43ea9ac-8e67-41ba-b137-16c2c195f744", "metadata": {}, "source": [ "### Create X" ] }, { "cell_type": "code", "execution_count": 5, "id": "358fce01-c93b-4b35-a66f-bbf39d3eabd8", "metadata": { "execution": { "iopub.execute_input": "2022-06-07T15:26:41.647434Z", "iopub.status.busy": "2022-06-07T15:26:41.647206Z", "iopub.status.idle": "2022-06-07T15:26:42.472794Z", "shell.execute_reply": "2022-06-07T15:26:42.471961Z", "shell.execute_reply.started": "2022-06-07T15:26:41.647417Z" }, "tags": [] }, "outputs": [], "source": [ "tf = TfidfVectorizer(stop_words='english',max_features=500)\n", "tfvec = tf.fit(df.title)\n", "X = pd.DataFrame(tfvec.transform(df.title).todense(),columns=tfvec.get_feature_names_out())" ] }, { "cell_type": "markdown", "id": "cf0c3334-5795-4f2e-95da-28164de3d36b", "metadata": {}, "source": [ "### Join X with numeric columns" ] }, { "cell_type": "code", "execution_count": 6, "id": "0e1a7188-565d-48c9-8e0a-93553968980c", "metadata": { "execution": { "iopub.execute_input": "2022-06-07T15:26:42.473997Z", "iopub.status.busy": "2022-06-07T15:26:42.473562Z", "iopub.status.idle": "2022-06-07T15:26:42.480297Z", "shell.execute_reply": "2022-06-07T15:26:42.479581Z", "shell.execute_reply.started": "2022-06-07T15:26:42.473979Z" }, "tags": [] }, "outputs": [], "source": [ "df = df.join(numerics)\n", "del numerics # done with numerics" ] }, { "cell_type": "markdown", "id": "2248137d-3f32-4bc7-af53-93de51791d80", "metadata": {}, "source": [ "### Create dummies from columns that are objects or booleans" ] }, { "cell_type": "code", "execution_count": 7, "id": "6c14641e-0e0e-46b5-960a-2bb41e84c1dd", "metadata": { "execution": { "iopub.execute_input": "2022-06-07T15:26:42.481258Z", "iopub.status.busy": "2022-06-07T15:26:42.481069Z", "iopub.status.idle": "2022-06-07T15:26:49.841545Z", "shell.execute_reply": "2022-06-07T15:26:49.840824Z", "shell.execute_reply.started": "2022-06-07T15:26:42.481241Z" }, "tags": [] }, "outputs": [], "source": [ "def make_dummies(df):\n", " for col_name in df.columns:\n", " if (df[col_name].dtype == 'O') or (df[col_name].dtype == 'bool'):\n", " dums = pd.get_dummies(df[col_name],prefix=col_name,dtype=int,drop_first=True)\n", " df = df.drop(labels=[col_name],axis=1)\n", " df = df.join(dums)\n", " return df\n", "\n", "dums = make_dummies(df[df.columns[1:]]) # [1:] excludes first column, 'title'\n", "del df # done with df" ] }, { "cell_type": "markdown", "id": "24b13f04-8b0a-43d2-9f47-0bb0edd8b305", "metadata": {}, "source": [ "### Now join dummies with X" ] }, { "cell_type": "code", "execution_count": 8, "id": "3a2e6910-4023-4e75-bd68-864fcbfc193d", "metadata": { "execution": { "iopub.execute_input": "2022-06-07T15:26:49.842408Z", "iopub.status.busy": "2022-06-07T15:26:49.842246Z", "iopub.status.idle": "2022-06-07T15:26:50.247406Z", "shell.execute_reply": "2022-06-07T15:26:50.246837Z", "shell.execute_reply.started": "2022-06-07T15:26:49.842392Z" }, "tags": [] }, "outputs": [], "source": [ "X = X.join(dums)\n", "del dums # done with dums" ] }, { "cell_type": "markdown", "id": "ef94a7c0-dcde-41c1-8a0d-407938f85608", "metadata": {}, "source": [ "### Now model it!" ] }, { "cell_type": "code", "execution_count": 9, "id": "c5a322ee-a81e-491a-97d2-3d22a2fd89a7", "metadata": { "execution": { "iopub.execute_input": "2022-06-07T15:26:50.248820Z", "iopub.status.busy": "2022-06-07T15:26:50.248515Z", "iopub.status.idle": "2022-06-07T15:26:52.034771Z", "shell.execute_reply": "2022-06-07T15:26:52.034058Z", "shell.execute_reply.started": "2022-06-07T15:26:50.248795Z" }, "tags": [] }, "outputs": [], "source": [ "# Do a split\n", "X_train, X_test, y_train, y_test = train_test_split(X,y)\n", "del X\n", "del y" ] }, { "cell_type": "code", "execution_count": 10, "id": "fb1c22e4-57ff-4133-9c36-1c3c3e00c8bd", "metadata": { "execution": { "iopub.execute_input": "2022-06-07T15:26:52.035757Z", "iopub.status.busy": "2022-06-07T15:26:52.035519Z", "iopub.status.idle": "2022-06-07T15:28:35.841910Z", "shell.execute_reply": "2022-06-07T15:28:35.841244Z", "shell.execute_reply.started": "2022-06-07T15:26:52.035741Z" }, "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Create Random Forest...\n", "Create Logistic Regression...\n", "fit RF...\n", "fit KNN...\n", "Scoring...\n", "Score:\t0.62 ± 0.0056\n", "Score:\t0.6 ± 0.0052\n" ] } ], "source": [ "print('Create Random Forest...')\n", "rf = RandomForestClassifier(n_jobs=-1)\n", "print('Create Logistic Regression...')\n", "# knn = KNeighborsClassifier(n_jobs=-1)\n", "print('fit RF...')\n", "model_rf = rf.fit(X_train,y_train)\n", "\n", "print('fit KNN...')\n", "# model_knn = knn.fit(X_train,y_train)\n", "\n", "# Model Scores\n", "def score(model,X,y):\n", " cv=StratifiedKFold(n_splits=3,shuffle=True)\n", " s = cross_val_score(model,X,y,cv=cv) # n_jobs=-1 actually makes it slower here\n", " print(\"Score:\\t{:0.2} ± {:0.2}\".format(s.mean(), 2 * s.std()))\n", "\n", "print('Scoring...')\n", "score(model_rf,X_train,y_train)\n", "score(model_rf,X_test,y_test)\n", "# score(model_knn,X_train,y_train)\n", "# score(model_knn,X_test,y_test)" ] }, { "cell_type": "markdown", "id": "ec845f9b-2df2-475e-b4a0-12d6e4d1a38c", "metadata": {}, "source": [ "### Comparing Models\n", "Between Random Forest, K Neighbors, and LogisticRegression, they all score about the same. But Random Forest takes two minutes to run and the rest take a lifetime. I think the data and the problem isn't complex enough to warrant more than a Random Forest.\n", "\n", "Everything seems to perform about 10% above the baseline, which would be 50%. In other words, the target is split cleanly in half, so if the predictions were to be 1s across the board the accuracy would be at 50%. So a cross-val score above 50 means the model is working, but doesn't necessarily give insight on exactly what it's predicting." ] }, { "cell_type": "code", "execution_count": 11, "id": "1ea3c8de-1f87-43ef-b589-7fab1552f072", "metadata": { "execution": { "iopub.execute_input": "2022-06-07T15:28:35.843245Z", "iopub.status.busy": "2022-06-07T15:28:35.842845Z", "iopub.status.idle": "2022-06-07T15:28:35.877909Z", "shell.execute_reply": "2022-06-07T15:28:35.877235Z", "shell.execute_reply.started": "2022-06-07T15:28:35.843225Z" }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
VariableImportance
500post_age0.080172
502norm_score0.077290
501upvote_ratio0.047547
5545is_self_True0.020376
232like0.003663
212just0.003490
4518subreddit_memes0.003337
428time0.003296
287new0.003230
293oc0.002981
198im0.002539
85day0.002455
157got0.002369
99dont0.002164
431today0.002146
247love0.002137
253man0.002102
156good0.002101
5546spoiler_True0.002013
5544is_original_content_True0.002012
141game0.001961
5027subreddit_shitposting0.001955
11art0.001921
5543over_18_True0.001905
311people0.001879
\n", "
" ], "text/plain": [ " Variable Importance\n", "500 post_age 0.080172\n", "502 norm_score 0.077290\n", "501 upvote_ratio 0.047547\n", "5545 is_self_True 0.020376\n", "232 like 0.003663\n", "212 just 0.003490\n", "4518 subreddit_memes 0.003337\n", "428 time 0.003296\n", "287 new 0.003230\n", "293 oc 0.002981\n", "198 im 0.002539\n", "85 day 0.002455\n", "157 got 0.002369\n", "99 dont 0.002164\n", "431 today 0.002146\n", "247 love 0.002137\n", "253 man 0.002102\n", "156 good 0.002101\n", "5546 spoiler_True 0.002013\n", "5544 is_original_content_True 0.002012\n", "141 game 0.001961\n", "5027 subreddit_shitposting 0.001955\n", "11 art 0.001921\n", "5543 over_18_True 0.001905\n", "311 people 0.001879" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.DataFrame({'Variable':X_train.columns,\n", " 'Importance':rf.feature_importances_}).sort_values('Importance', ascending=False).head(25)" ] }, { "cell_type": "markdown", "id": "4e664e35-6091-47b9-85e9-384eb3a46dcd", "metadata": {}, "source": [ "Here are the top 25 predictors scored by importance, or the amount of influence they have. At this time of writing, at the top is age, score, upvote ratio. Of course if you want to have a popular post, make it popular, but we can't just say that. Beyond that it appears that self posts see more activity, the subreddits memes and shitposting have been very active and popular the past few days, over 18 content is popular. And some key words that may get you to the top are 'like','just','time','new','oc','good','got','day','today','im','dont', and 'love'. I don't go on Reddit often but I'm not exactly a stranger either, this looks about right. Memes is very popular, and a very easy karma farm. People love their OC (and their porn). A lot of people on reddit talk about what's going on in their day ('today', 'day')" ] }, { "cell_type": "markdown", "id": "48b62749-e027-41dd-acb3-32fa1c52eeea", "metadata": {}, "source": [ "This is a fairly simple model, with simple data. To go beyond this I think the comments would have to be analyzed. Tokenization I thought would be the most influential piece, and I still think that thinking is correct. But in this case it doesn't apply because there is no real meaning to be had from reddit post titles, at least to a computer. There's a lot more seen by a human than just the text in the title, there's often an image attached, most posts reference a recent/current event, they could be an inside joke of sorts. For some posts there could be emojis in the title, and depending on their combination they can take on a meaning completely different from their individual meanings. \n", "\n", "The next step from here I believe is to analyze the comments section of these posts because in this moment I think that's the easiest way to truly describe the meaning of a post to a computer. With what was gathered here I'm only to get 10% above baseline and I think that's all there is to be had here, I mean we can tweak for a few percent probably but I don't think there's much left on the table." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.5" } }, "nbformat": 4, "nbformat_minor": 5 }