{ "cells": [ { "cell_type": "markdown", "id": "f2c2a80e-4240-4e89-ad59-039c8e65cd25", "metadata": {}, "source": [ "# Process/Model" ] }, { "cell_type": "code", "execution_count": 1, "id": "7c31603a-9b72-4353-a49d-9ad3b5e245e4", "metadata": { "execution": { "iopub.execute_input": "2022-06-07T15:26:40.982440Z", "iopub.status.busy": "2022-06-07T15:26:40.982213Z", "iopub.status.idle": "2022-06-07T15:26:41.538590Z", "shell.execute_reply": "2022-06-07T15:26:41.537956Z", "shell.execute_reply.started": "2022-06-07T15:26:40.982383Z" }, "tags": [] }, "outputs": [], "source": [ "import pandas as pd\n", "from sklearn.feature_extraction.text import TfidfVectorizer\n", "# import spacy\n", "\n", "from sklearn.ensemble import RandomForestClassifier\n", "from sklearn.neighbors import KNeighborsClassifier\n", "from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold" ] }, { "cell_type": "markdown", "id": "cfc10edd-fec8-439d-a315-44d55056d4a3", "metadata": { "execution": { "iopub.execute_input": "2022-06-07T04:06:14.259265Z", "iopub.status.busy": "2022-06-07T04:06:14.258568Z", "iopub.status.idle": "2022-06-07T04:06:14.264153Z", "shell.execute_reply": "2022-06-07T04:06:14.262980Z", "shell.execute_reply.started": "2022-06-07T04:06:14.259236Z" } }, "source": [ "### Load Files" ] }, { "cell_type": "code", "execution_count": 2, "id": "3512f32a-cf1f-4763-a260-8970c0dfb58b", "metadata": { "execution": { "iopub.execute_input": "2022-06-07T15:26:41.539644Z", "iopub.status.busy": "2022-06-07T15:26:41.539379Z", "iopub.status.idle": "2022-06-07T15:26:41.611358Z", "shell.execute_reply": "2022-06-07T15:26:41.610751Z", "shell.execute_reply.started": "2022-06-07T15:26:41.539618Z" }, "tags": [] }, "outputs": [], "source": [ "df = pd.read_csv('data/df_clean.csv')\n", "numerics = pd.read_csv('data/numerics_clean.csv')" ] }, { "cell_type": "markdown", "id": "b4ba7548-f51d-43cc-befb-3280219a860b", "metadata": { "execution": { "iopub.execute_input": "2022-06-07T04:06:33.188298Z", "iopub.status.busy": "2022-06-07T04:06:33.187754Z", "iopub.status.idle": "2022-06-07T04:06:33.218642Z", "shell.execute_reply": "2022-06-07T04:06:33.217761Z", "shell.execute_reply.started": "2022-06-07T04:06:33.188269Z" }, "tags": [] }, "source": [ "### Create target column (y)" ] }, { "cell_type": "code", "execution_count": 3, "id": "6f9c6b2a-d6b1-4ba9-9bdf-07fdf36115f9", "metadata": { "execution": { "iopub.execute_input": "2022-06-07T15:26:41.613421Z", "iopub.status.busy": "2022-06-07T15:26:41.613145Z", "iopub.status.idle": "2022-06-07T15:26:41.642378Z", "shell.execute_reply": "2022-06-07T15:26:41.641819Z", "shell.execute_reply.started": "2022-06-07T15:26:41.613403Z" }, "tags": [] }, "outputs": [], "source": [ "ymed = df.num_comments.median()\n", "y = pd.Series([1 if val > ymed else 0 for val in df.num_comments])\n", "df.drop('num_comments',axis=1,inplace=True) # get rid of this immediately" ] }, { "cell_type": "markdown", "id": "c4a4af8f-4a83-4d8c-8547-a07aa069a2bd", "metadata": {}, "source": [ "### Lemmatise" ] }, { "cell_type": "code", "execution_count": 4, "id": "a7083ac9-5033-4c43-90b8-815957443793", "metadata": { "execution": { "iopub.execute_input": "2022-06-07T15:26:41.643370Z", "iopub.status.busy": "2022-06-07T15:26:41.643114Z", "iopub.status.idle": "2022-06-07T15:26:41.646350Z", "shell.execute_reply": "2022-06-07T15:26:41.645572Z", "shell.execute_reply.started": "2022-06-07T15:26:41.643353Z" } }, "outputs": [], "source": [ "# # Lemmatize and filter out ' ' tokens\n", "# nlp = spacy.load('en_core_web_sm')\n", "# df['title'] = [' '.join([word.lemma_ for word in nlp(title) if word.lemma_ != ' '])\\\n", "# for title in df.title] # This should be optimized" ] }, { "cell_type": "markdown", "id": "138b8776-d5cf-47b5-bd6e-319a5f7c66ae", "metadata": {}, "source": [ "Lemmatising to my surprise seems to add no value. I thought this was going to be the most important part, but as it turns out it just takes forever to process and adds nothing but run time. I suspect this is because the post titles are so short that there is no real meaning to be extracted. This could be useful when analyzing comments" ] }, { "cell_type": "markdown", "id": "a43ea9ac-8e67-41ba-b137-16c2c195f744", "metadata": {}, "source": [ "### Create X" ] }, { "cell_type": "code", "execution_count": 5, "id": "358fce01-c93b-4b35-a66f-bbf39d3eabd8", "metadata": { "execution": { "iopub.execute_input": "2022-06-07T15:26:41.647434Z", "iopub.status.busy": "2022-06-07T15:26:41.647206Z", "iopub.status.idle": "2022-06-07T15:26:42.472794Z", "shell.execute_reply": "2022-06-07T15:26:42.471961Z", "shell.execute_reply.started": "2022-06-07T15:26:41.647417Z" }, "tags": [] }, "outputs": [], "source": [ "tf = TfidfVectorizer(stop_words='english',max_features=500)\n", "tfvec = tf.fit(df.title)\n", "X = pd.DataFrame(tfvec.transform(df.title).todense(),columns=tfvec.get_feature_names_out())" ] }, { "cell_type": "markdown", "id": "cf0c3334-5795-4f2e-95da-28164de3d36b", "metadata": {}, "source": [ "### Join X with numeric columns" ] }, { "cell_type": "code", "execution_count": 6, "id": "0e1a7188-565d-48c9-8e0a-93553968980c", "metadata": { "execution": { "iopub.execute_input": "2022-06-07T15:26:42.473997Z", "iopub.status.busy": "2022-06-07T15:26:42.473562Z", "iopub.status.idle": "2022-06-07T15:26:42.480297Z", "shell.execute_reply": "2022-06-07T15:26:42.479581Z", "shell.execute_reply.started": "2022-06-07T15:26:42.473979Z" }, "tags": [] }, "outputs": [], "source": [ "df = df.join(numerics)\n", "del numerics # done with numerics" ] }, { "cell_type": "markdown", "id": "2248137d-3f32-4bc7-af53-93de51791d80", "metadata": {}, "source": [ "### Create dummies from columns that are objects or booleans" ] }, { "cell_type": "code", "execution_count": 7, "id": "6c14641e-0e0e-46b5-960a-2bb41e84c1dd", "metadata": { "execution": { "iopub.execute_input": "2022-06-07T15:26:42.481258Z", "iopub.status.busy": "2022-06-07T15:26:42.481069Z", "iopub.status.idle": "2022-06-07T15:26:49.841545Z", "shell.execute_reply": "2022-06-07T15:26:49.840824Z", "shell.execute_reply.started": "2022-06-07T15:26:42.481241Z" }, "tags": [] }, "outputs": [], "source": [ "def make_dummies(df):\n", " for col_name in df.columns:\n", " if (df[col_name].dtype == 'O') or (df[col_name].dtype == 'bool'):\n", " dums = pd.get_dummies(df[col_name],prefix=col_name,dtype=int,drop_first=True)\n", " df = df.drop(labels=[col_name],axis=1)\n", " df = df.join(dums)\n", " return df\n", "\n", "dums = make_dummies(df[df.columns[1:]]) # [1:] excludes first column, 'title'\n", "del df # done with df" ] }, { "cell_type": "markdown", "id": "24b13f04-8b0a-43d2-9f47-0bb0edd8b305", "metadata": {}, "source": [ "### Now join dummies with X" ] }, { "cell_type": "code", "execution_count": 8, "id": "3a2e6910-4023-4e75-bd68-864fcbfc193d", "metadata": { "execution": { "iopub.execute_input": "2022-06-07T15:26:49.842408Z", "iopub.status.busy": "2022-06-07T15:26:49.842246Z", "iopub.status.idle": "2022-06-07T15:26:50.247406Z", "shell.execute_reply": "2022-06-07T15:26:50.246837Z", "shell.execute_reply.started": "2022-06-07T15:26:49.842392Z" }, "tags": [] }, "outputs": [], "source": [ "X = X.join(dums)\n", "del dums # done with dums" ] }, { "cell_type": "markdown", "id": "ef94a7c0-dcde-41c1-8a0d-407938f85608", "metadata": {}, "source": [ "### Now model it!" ] }, { "cell_type": "code", "execution_count": 9, "id": "c5a322ee-a81e-491a-97d2-3d22a2fd89a7", "metadata": { "execution": { "iopub.execute_input": "2022-06-07T15:26:50.248820Z", "iopub.status.busy": "2022-06-07T15:26:50.248515Z", "iopub.status.idle": "2022-06-07T15:26:52.034771Z", "shell.execute_reply": "2022-06-07T15:26:52.034058Z", "shell.execute_reply.started": "2022-06-07T15:26:50.248795Z" }, "tags": [] }, "outputs": [], "source": [ "# Do a split\n", "X_train, X_test, y_train, y_test = train_test_split(X,y)\n", "del X\n", "del y" ] }, { "cell_type": "code", "execution_count": 10, "id": "fb1c22e4-57ff-4133-9c36-1c3c3e00c8bd", "metadata": { "execution": { "iopub.execute_input": "2022-06-07T15:26:52.035757Z", "iopub.status.busy": "2022-06-07T15:26:52.035519Z", "iopub.status.idle": "2022-06-07T15:28:35.841910Z", "shell.execute_reply": "2022-06-07T15:28:35.841244Z", "shell.execute_reply.started": "2022-06-07T15:26:52.035741Z" }, "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Create Random Forest...\n", "Create Logistic Regression...\n", "fit RF...\n", "fit KNN...\n", "Scoring...\n", "Score:\t0.62 ± 0.0056\n", "Score:\t0.6 ± 0.0052\n" ] } ], "source": [ "print('Create Random Forest...')\n", "rf = RandomForestClassifier(n_jobs=-1)\n", "print('Create Logistic Regression...')\n", "# knn = KNeighborsClassifier(n_jobs=-1)\n", "print('fit RF...')\n", "model_rf = rf.fit(X_train,y_train)\n", "\n", "print('fit KNN...')\n", "# model_knn = knn.fit(X_train,y_train)\n", "\n", "# Model Scores\n", "def score(model,X,y):\n", " cv=StratifiedKFold(n_splits=3,shuffle=True)\n", " s = cross_val_score(model,X,y,cv=cv) # n_jobs=-1 actually makes it slower here\n", " print(\"Score:\\t{:0.2} ± {:0.2}\".format(s.mean(), 2 * s.std()))\n", "\n", "print('Scoring...')\n", "score(model_rf,X_train,y_train)\n", "score(model_rf,X_test,y_test)\n", "# score(model_knn,X_train,y_train)\n", "# score(model_knn,X_test,y_test)" ] }, { "cell_type": "markdown", "id": "ec845f9b-2df2-475e-b4a0-12d6e4d1a38c", "metadata": {}, "source": [ "### Comparing Models\n", "Between Random Forest, K Neighbors, and LogisticRegression, they all score about the same. But Random Forest takes two minutes to run and the rest take a lifetime. I think the data and the problem isn't complex enough to warrant more than a Random Forest.\n", "\n", "Everything seems to perform about 10% above the baseline, which would be 50%. In other words, the target is split cleanly in half, so if the predictions were to be 1s across the board the accuracy would be at 50%. So a cross-val score above 50 means the model is working, but doesn't necessarily give insight on exactly what it's predicting." ] }, { "cell_type": "code", "execution_count": 11, "id": "1ea3c8de-1f87-43ef-b589-7fab1552f072", "metadata": { "execution": { "iopub.execute_input": "2022-06-07T15:28:35.843245Z", "iopub.status.busy": "2022-06-07T15:28:35.842845Z", "iopub.status.idle": "2022-06-07T15:28:35.877909Z", "shell.execute_reply": "2022-06-07T15:28:35.877235Z", "shell.execute_reply.started": "2022-06-07T15:28:35.843225Z" }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", " | Variable | \n", "Importance | \n", "
---|---|---|
500 | \n", "post_age | \n", "0.080172 | \n", "
502 | \n", "norm_score | \n", "0.077290 | \n", "
501 | \n", "upvote_ratio | \n", "0.047547 | \n", "
5545 | \n", "is_self_True | \n", "0.020376 | \n", "
232 | \n", "like | \n", "0.003663 | \n", "
212 | \n", "just | \n", "0.003490 | \n", "
4518 | \n", "subreddit_memes | \n", "0.003337 | \n", "
428 | \n", "time | \n", "0.003296 | \n", "
287 | \n", "new | \n", "0.003230 | \n", "
293 | \n", "oc | \n", "0.002981 | \n", "
198 | \n", "im | \n", "0.002539 | \n", "
85 | \n", "day | \n", "0.002455 | \n", "
157 | \n", "got | \n", "0.002369 | \n", "
99 | \n", "dont | \n", "0.002164 | \n", "
431 | \n", "today | \n", "0.002146 | \n", "
247 | \n", "love | \n", "0.002137 | \n", "
253 | \n", "man | \n", "0.002102 | \n", "
156 | \n", "good | \n", "0.002101 | \n", "
5546 | \n", "spoiler_True | \n", "0.002013 | \n", "
5544 | \n", "is_original_content_True | \n", "0.002012 | \n", "
141 | \n", "game | \n", "0.001961 | \n", "
5027 | \n", "subreddit_shitposting | \n", "0.001955 | \n", "
11 | \n", "art | \n", "0.001921 | \n", "
5543 | \n", "over_18_True | \n", "0.001905 | \n", "
311 | \n", "people | \n", "0.001879 | \n", "