reddit/model.ipynb

612 lines
20 KiB
Text
Raw Permalink Normal View History

2022-07-27 13:09:15 -04:00
{
"cells": [
{
"cell_type": "markdown",
"id": "f2c2a80e-4240-4e89-ad59-039c8e65cd25",
"metadata": {},
"source": [
"# Process/Model"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "7c31603a-9b72-4353-a49d-9ad3b5e245e4",
"metadata": {
"execution": {
"iopub.execute_input": "2022-06-07T15:26:40.982440Z",
"iopub.status.busy": "2022-06-07T15:26:40.982213Z",
"iopub.status.idle": "2022-06-07T15:26:41.538590Z",
"shell.execute_reply": "2022-06-07T15:26:41.537956Z",
"shell.execute_reply.started": "2022-06-07T15:26:40.982383Z"
},
"tags": []
},
"outputs": [],
"source": [
"import pandas as pd\n",
"from sklearn.feature_extraction.text import TfidfVectorizer\n",
"# import spacy\n",
"\n",
"from sklearn.ensemble import RandomForestClassifier\n",
"from sklearn.neighbors import KNeighborsClassifier\n",
"from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold"
]
},
{
"cell_type": "markdown",
"id": "cfc10edd-fec8-439d-a315-44d55056d4a3",
"metadata": {
"execution": {
"iopub.execute_input": "2022-06-07T04:06:14.259265Z",
"iopub.status.busy": "2022-06-07T04:06:14.258568Z",
"iopub.status.idle": "2022-06-07T04:06:14.264153Z",
"shell.execute_reply": "2022-06-07T04:06:14.262980Z",
"shell.execute_reply.started": "2022-06-07T04:06:14.259236Z"
}
},
"source": [
"### Load Files"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "3512f32a-cf1f-4763-a260-8970c0dfb58b",
"metadata": {
"execution": {
"iopub.execute_input": "2022-06-07T15:26:41.539644Z",
"iopub.status.busy": "2022-06-07T15:26:41.539379Z",
"iopub.status.idle": "2022-06-07T15:26:41.611358Z",
"shell.execute_reply": "2022-06-07T15:26:41.610751Z",
"shell.execute_reply.started": "2022-06-07T15:26:41.539618Z"
},
"tags": []
},
"outputs": [],
"source": [
"df = pd.read_csv('data/df_clean.csv')\n",
"numerics = pd.read_csv('data/numerics_clean.csv')"
]
},
{
"cell_type": "markdown",
"id": "b4ba7548-f51d-43cc-befb-3280219a860b",
"metadata": {
"execution": {
"iopub.execute_input": "2022-06-07T04:06:33.188298Z",
"iopub.status.busy": "2022-06-07T04:06:33.187754Z",
"iopub.status.idle": "2022-06-07T04:06:33.218642Z",
"shell.execute_reply": "2022-06-07T04:06:33.217761Z",
"shell.execute_reply.started": "2022-06-07T04:06:33.188269Z"
},
"tags": []
},
"source": [
"### Create target column (y)"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "6f9c6b2a-d6b1-4ba9-9bdf-07fdf36115f9",
"metadata": {
"execution": {
"iopub.execute_input": "2022-06-07T15:26:41.613421Z",
"iopub.status.busy": "2022-06-07T15:26:41.613145Z",
"iopub.status.idle": "2022-06-07T15:26:41.642378Z",
"shell.execute_reply": "2022-06-07T15:26:41.641819Z",
"shell.execute_reply.started": "2022-06-07T15:26:41.613403Z"
},
"tags": []
},
"outputs": [],
"source": [
"ymed = df.num_comments.median()\n",
"y = pd.Series([1 if val > ymed else 0 for val in df.num_comments])\n",
"df.drop('num_comments',axis=1,inplace=True) # get rid of this immediately"
]
},
{
"cell_type": "markdown",
"id": "c4a4af8f-4a83-4d8c-8547-a07aa069a2bd",
"metadata": {},
"source": [
"### Lemmatise"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "a7083ac9-5033-4c43-90b8-815957443793",
"metadata": {
"execution": {
"iopub.execute_input": "2022-06-07T15:26:41.643370Z",
"iopub.status.busy": "2022-06-07T15:26:41.643114Z",
"iopub.status.idle": "2022-06-07T15:26:41.646350Z",
"shell.execute_reply": "2022-06-07T15:26:41.645572Z",
"shell.execute_reply.started": "2022-06-07T15:26:41.643353Z"
}
},
"outputs": [],
"source": [
"# # Lemmatize and filter out ' ' tokens\n",
"# nlp = spacy.load('en_core_web_sm')\n",
"# df['title'] = [' '.join([word.lemma_ for word in nlp(title) if word.lemma_ != ' '])\\\n",
"# for title in df.title] # This should be optimized"
]
},
{
"cell_type": "markdown",
"id": "138b8776-d5cf-47b5-bd6e-319a5f7c66ae",
"metadata": {},
"source": [
"Lemmatising to my surprise seems to add no value. I thought this was going to be the most important part, but as it turns out it just takes forever to process and adds nothing but run time. I suspect this is because the post titles are so short that there is no real meaning to be extracted. This could be useful when analyzing comments"
]
},
{
"cell_type": "markdown",
"id": "a43ea9ac-8e67-41ba-b137-16c2c195f744",
"metadata": {},
"source": [
"### Create X"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "358fce01-c93b-4b35-a66f-bbf39d3eabd8",
"metadata": {
"execution": {
"iopub.execute_input": "2022-06-07T15:26:41.647434Z",
"iopub.status.busy": "2022-06-07T15:26:41.647206Z",
"iopub.status.idle": "2022-06-07T15:26:42.472794Z",
"shell.execute_reply": "2022-06-07T15:26:42.471961Z",
"shell.execute_reply.started": "2022-06-07T15:26:41.647417Z"
},
"tags": []
},
"outputs": [],
"source": [
"tf = TfidfVectorizer(stop_words='english',max_features=500)\n",
"tfvec = tf.fit(df.title)\n",
"X = pd.DataFrame(tfvec.transform(df.title).todense(),columns=tfvec.get_feature_names_out())"
]
},
{
"cell_type": "markdown",
"id": "cf0c3334-5795-4f2e-95da-28164de3d36b",
"metadata": {},
"source": [
"### Join X with numeric columns"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "0e1a7188-565d-48c9-8e0a-93553968980c",
"metadata": {
"execution": {
"iopub.execute_input": "2022-06-07T15:26:42.473997Z",
"iopub.status.busy": "2022-06-07T15:26:42.473562Z",
"iopub.status.idle": "2022-06-07T15:26:42.480297Z",
"shell.execute_reply": "2022-06-07T15:26:42.479581Z",
"shell.execute_reply.started": "2022-06-07T15:26:42.473979Z"
},
"tags": []
},
"outputs": [],
"source": [
"df = df.join(numerics)\n",
"del numerics # done with numerics"
]
},
{
"cell_type": "markdown",
"id": "2248137d-3f32-4bc7-af53-93de51791d80",
"metadata": {},
"source": [
"### Create dummies from columns that are objects or booleans"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "6c14641e-0e0e-46b5-960a-2bb41e84c1dd",
"metadata": {
"execution": {
"iopub.execute_input": "2022-06-07T15:26:42.481258Z",
"iopub.status.busy": "2022-06-07T15:26:42.481069Z",
"iopub.status.idle": "2022-06-07T15:26:49.841545Z",
"shell.execute_reply": "2022-06-07T15:26:49.840824Z",
"shell.execute_reply.started": "2022-06-07T15:26:42.481241Z"
},
"tags": []
},
"outputs": [],
"source": [
"def make_dummies(df):\n",
" for col_name in df.columns:\n",
" if (df[col_name].dtype == 'O') or (df[col_name].dtype == 'bool'):\n",
" dums = pd.get_dummies(df[col_name],prefix=col_name,dtype=int,drop_first=True)\n",
" df = df.drop(labels=[col_name],axis=1)\n",
" df = df.join(dums)\n",
" return df\n",
"\n",
"dums = make_dummies(df[df.columns[1:]]) # [1:] excludes first column, 'title'\n",
"del df # done with df"
]
},
{
"cell_type": "markdown",
"id": "24b13f04-8b0a-43d2-9f47-0bb0edd8b305",
"metadata": {},
"source": [
"### Now join dummies with X"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "3a2e6910-4023-4e75-bd68-864fcbfc193d",
"metadata": {
"execution": {
"iopub.execute_input": "2022-06-07T15:26:49.842408Z",
"iopub.status.busy": "2022-06-07T15:26:49.842246Z",
"iopub.status.idle": "2022-06-07T15:26:50.247406Z",
"shell.execute_reply": "2022-06-07T15:26:50.246837Z",
"shell.execute_reply.started": "2022-06-07T15:26:49.842392Z"
},
"tags": []
},
"outputs": [],
"source": [
"X = X.join(dums)\n",
"del dums # done with dums"
]
},
{
"cell_type": "markdown",
"id": "ef94a7c0-dcde-41c1-8a0d-407938f85608",
"metadata": {},
"source": [
"### Now model it!"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "c5a322ee-a81e-491a-97d2-3d22a2fd89a7",
"metadata": {
"execution": {
"iopub.execute_input": "2022-06-07T15:26:50.248820Z",
"iopub.status.busy": "2022-06-07T15:26:50.248515Z",
"iopub.status.idle": "2022-06-07T15:26:52.034771Z",
"shell.execute_reply": "2022-06-07T15:26:52.034058Z",
"shell.execute_reply.started": "2022-06-07T15:26:50.248795Z"
},
"tags": []
},
"outputs": [],
"source": [
"# Do a split\n",
"X_train, X_test, y_train, y_test = train_test_split(X,y)\n",
"del X\n",
"del y"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "fb1c22e4-57ff-4133-9c36-1c3c3e00c8bd",
"metadata": {
"execution": {
"iopub.execute_input": "2022-06-07T15:26:52.035757Z",
"iopub.status.busy": "2022-06-07T15:26:52.035519Z",
"iopub.status.idle": "2022-06-07T15:28:35.841910Z",
"shell.execute_reply": "2022-06-07T15:28:35.841244Z",
"shell.execute_reply.started": "2022-06-07T15:26:52.035741Z"
},
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Create Random Forest...\n",
"Create Logistic Regression...\n",
"fit RF...\n",
"fit KNN...\n",
"Scoring...\n",
"Score:\t0.62 ± 0.0056\n",
"Score:\t0.6 ± 0.0052\n"
]
}
],
"source": [
"print('Create Random Forest...')\n",
"rf = RandomForestClassifier(n_jobs=-1)\n",
"print('Create Logistic Regression...')\n",
"# knn = KNeighborsClassifier(n_jobs=-1)\n",
"print('fit RF...')\n",
"model_rf = rf.fit(X_train,y_train)\n",
"\n",
"print('fit KNN...')\n",
"# model_knn = knn.fit(X_train,y_train)\n",
"\n",
"# Model Scores\n",
"def score(model,X,y):\n",
" cv=StratifiedKFold(n_splits=3,shuffle=True)\n",
" s = cross_val_score(model,X,y,cv=cv) # n_jobs=-1 actually makes it slower here\n",
" print(\"Score:\\t{:0.2} ± {:0.2}\".format(s.mean(), 2 * s.std()))\n",
"\n",
"print('Scoring...')\n",
"score(model_rf,X_train,y_train)\n",
"score(model_rf,X_test,y_test)\n",
"# score(model_knn,X_train,y_train)\n",
"# score(model_knn,X_test,y_test)"
]
},
{
"cell_type": "markdown",
"id": "ec845f9b-2df2-475e-b4a0-12d6e4d1a38c",
"metadata": {},
"source": [
"### Comparing Models\n",
"Between Random Forest, K Neighbors, and LogisticRegression, they all score about the same. But Random Forest takes two minutes to run and the rest take a lifetime. I think the data and the problem isn't complex enough to warrant more than a Random Forest.\n",
"\n",
"Everything seems to perform about 10% above the baseline, which would be 50%. In other words, the target is split cleanly in half, so if the predictions were to be 1s across the board the accuracy would be at 50%. So a cross-val score above 50 means the model is working, but doesn't necessarily give insight on exactly what it's predicting."
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "1ea3c8de-1f87-43ef-b589-7fab1552f072",
"metadata": {
"execution": {
"iopub.execute_input": "2022-06-07T15:28:35.843245Z",
"iopub.status.busy": "2022-06-07T15:28:35.842845Z",
"iopub.status.idle": "2022-06-07T15:28:35.877909Z",
"shell.execute_reply": "2022-06-07T15:28:35.877235Z",
"shell.execute_reply.started": "2022-06-07T15:28:35.843225Z"
},
"tags": []
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Variable</th>\n",
" <th>Importance</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>500</th>\n",
" <td>post_age</td>\n",
" <td>0.080172</td>\n",
" </tr>\n",
" <tr>\n",
" <th>502</th>\n",
" <td>norm_score</td>\n",
" <td>0.077290</td>\n",
" </tr>\n",
" <tr>\n",
" <th>501</th>\n",
" <td>upvote_ratio</td>\n",
" <td>0.047547</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5545</th>\n",
" <td>is_self_True</td>\n",
" <td>0.020376</td>\n",
" </tr>\n",
" <tr>\n",
" <th>232</th>\n",
" <td>like</td>\n",
" <td>0.003663</td>\n",
" </tr>\n",
" <tr>\n",
" <th>212</th>\n",
" <td>just</td>\n",
" <td>0.003490</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4518</th>\n",
" <td>subreddit_memes</td>\n",
" <td>0.003337</td>\n",
" </tr>\n",
" <tr>\n",
" <th>428</th>\n",
" <td>time</td>\n",
" <td>0.003296</td>\n",
" </tr>\n",
" <tr>\n",
" <th>287</th>\n",
" <td>new</td>\n",
" <td>0.003230</td>\n",
" </tr>\n",
" <tr>\n",
" <th>293</th>\n",
" <td>oc</td>\n",
" <td>0.002981</td>\n",
" </tr>\n",
" <tr>\n",
" <th>198</th>\n",
" <td>im</td>\n",
" <td>0.002539</td>\n",
" </tr>\n",
" <tr>\n",
" <th>85</th>\n",
" <td>day</td>\n",
" <td>0.002455</td>\n",
" </tr>\n",
" <tr>\n",
" <th>157</th>\n",
" <td>got</td>\n",
" <td>0.002369</td>\n",
" </tr>\n",
" <tr>\n",
" <th>99</th>\n",
" <td>dont</td>\n",
" <td>0.002164</td>\n",
" </tr>\n",
" <tr>\n",
" <th>431</th>\n",
" <td>today</td>\n",
" <td>0.002146</td>\n",
" </tr>\n",
" <tr>\n",
" <th>247</th>\n",
" <td>love</td>\n",
" <td>0.002137</td>\n",
" </tr>\n",
" <tr>\n",
" <th>253</th>\n",
" <td>man</td>\n",
" <td>0.002102</td>\n",
" </tr>\n",
" <tr>\n",
" <th>156</th>\n",
" <td>good</td>\n",
" <td>0.002101</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5546</th>\n",
" <td>spoiler_True</td>\n",
" <td>0.002013</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5544</th>\n",
" <td>is_original_content_True</td>\n",
" <td>0.002012</td>\n",
" </tr>\n",
" <tr>\n",
" <th>141</th>\n",
" <td>game</td>\n",
" <td>0.001961</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5027</th>\n",
" <td>subreddit_shitposting</td>\n",
" <td>0.001955</td>\n",
" </tr>\n",
" <tr>\n",
" <th>11</th>\n",
" <td>art</td>\n",
" <td>0.001921</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5543</th>\n",
" <td>over_18_True</td>\n",
" <td>0.001905</td>\n",
" </tr>\n",
" <tr>\n",
" <th>311</th>\n",
" <td>people</td>\n",
" <td>0.001879</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Variable Importance\n",
"500 post_age 0.080172\n",
"502 norm_score 0.077290\n",
"501 upvote_ratio 0.047547\n",
"5545 is_self_True 0.020376\n",
"232 like 0.003663\n",
"212 just 0.003490\n",
"4518 subreddit_memes 0.003337\n",
"428 time 0.003296\n",
"287 new 0.003230\n",
"293 oc 0.002981\n",
"198 im 0.002539\n",
"85 day 0.002455\n",
"157 got 0.002369\n",
"99 dont 0.002164\n",
"431 today 0.002146\n",
"247 love 0.002137\n",
"253 man 0.002102\n",
"156 good 0.002101\n",
"5546 spoiler_True 0.002013\n",
"5544 is_original_content_True 0.002012\n",
"141 game 0.001961\n",
"5027 subreddit_shitposting 0.001955\n",
"11 art 0.001921\n",
"5543 over_18_True 0.001905\n",
"311 people 0.001879"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.DataFrame({'Variable':X_train.columns,\n",
" 'Importance':rf.feature_importances_}).sort_values('Importance', ascending=False).head(25)"
]
},
{
"cell_type": "markdown",
"id": "4e664e35-6091-47b9-85e9-384eb3a46dcd",
"metadata": {},
"source": [
"Here are the top 25 predictors scored by importance, or the amount of influence they have. At this time of writing, at the top is age, score, upvote ratio. Of course if you want to have a popular post, make it popular, but we can't just say that. Beyond that it appears that self posts see more activity, the subreddits memes and shitposting have been very active and popular the past few days, over 18 content is popular. And some key words that may get you to the top are 'like','just','time','new','oc','good','got','day','today','im','dont', and 'love'. I don't go on Reddit often but I'm not exactly a stranger either, this looks about right. Memes is very popular, and a very easy karma farm. People love their OC (and their porn). A lot of people on reddit talk about what's going on in their day ('today', 'day')"
]
},
{
"cell_type": "markdown",
"id": "48b62749-e027-41dd-acb3-32fa1c52eeea",
"metadata": {},
"source": [
"This is a fairly simple model, with simple data. To go beyond this I think the comments would have to be analyzed. Tokenization I thought would be the most influential piece, and I still think that thinking is correct. But in this case it doesn't apply because there is no real meaning to be had from reddit post titles, at least to a computer. There's a lot more seen by a human than just the text in the title, there's often an image attached, most posts reference a recent/current event, they could be an inside joke of sorts. For some posts there could be emojis in the title, and depending on their combination they can take on a meaning completely different from their individual meanings. \n",
"\n",
"The next step from here I believe is to analyze the comments section of these posts because in this moment I think that's the easiest way to truly describe the meaning of a post to a computer. With what was gathered here I'm only to get 10% above baseline and I think that's all there is to be had here, I mean we can tweak for a few percent probably but I don't think there's much left on the table."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.5"
}
},
"nbformat": 4,
"nbformat_minor": 5
}