This commit is contained in:
Adam 2022-07-27 13:09:15 -04:00
commit 3bfeb55c48
11 changed files with 286990 additions and 0 deletions

4
.gitignore vendored Normal file
View file

@ -0,0 +1,4 @@
sekrit
.DS_Store
.ipynb_checkpoints
*/.ipynb_checkpoints

885
EDA.ipynb Normal file

File diff suppressed because one or more lines are too long

70
README.md Normal file
View file

@ -0,0 +1,70 @@
What Goes Into a Successful Reddit Post?
======================================
Here I'll pull and explore some data from Reddit
I'll attempt to find out what about a Reddit post makes it successful using classification. In particular a Random Forest model and KNNeighbors model.
Then I'll score the results and see what the highest predictors are
To find what goes into making a successful Reddit post we'll have to do a few things, first of which is collecting data:
## Introducing Scrapey!
[Scrapey](scrapey.ipynb) takes a snapshot of [Reddit/r/all hot](https://www.reddit.com/r/all), and saves the data to a .csv file including a calculated age for each post about every 12 minutes. Run time is about 2 minutes per iteration and each time adds about 100 unique posts to the list while updating any post it's already seen.
I run this in the background in a terminal and it updates my data set every ~12 minutes. I have records of all posts within about 12 minutes of them disappearing from /r/all.
## EDA
Chosen Features:
* Title
* Subreddit
* Over_18
* Is_Original_Content
* Is_Self
* Spoiler
* Locked
* Stickied
* Num_Comments (Target)
[EDA](EDA.ipynb)
## Clean
* Scale numeric features between 0-1
* Convert '_' and '-' to whitespace
* Remove any not a-z or A-Z or whitespace
* Strip any leftover whitespace
* Delete any titles that were reduced to empty strings
[Clean](clean.ipynb)
## Model
* Split data to train/test samples
* Test RandomForestClassifier, KNeighborsClassifier
* My preferred is RandomForest
[Model](model.ipynb)
### Conclusion
Long story short, if you want to karma farm then post OC memes and porn.
Some Predictors from Top 25:
* Is_Self
* Subreddit_Memes
* OC
* Over_18
* Subreddit_Shitposting
* Is_Original_Content
* Subreddit_Superstonk
Popular words:
'like', 'just', 'time', 'new', 'oc', 'good', 'got', 'day', 'today', 'im', 'dont', and 'love'.
People on Reddit (at least in the past few days) like their memes, porn, and talking about their day. And it's preferred if the content is original and self posted.
So yes, post your memes to memes and shitposting, tag them NSFW, use some words from the list, and rake in all that sweet karma!
But it's not that simple, this is a fairly simple model, with simple data. To go beyond this I think the comments would have to be analyzed. Tokenization I thought would be the most influential piece, and I still think that thinking is correct. But in this case it doesn't apply because there is no real meaning to be had from reddit post titles, at least to a computer. There's a lot more seen by a human than just the text in the title, there's often an image attached, most posts reference a recent/current event, they could be an inside joke of sorts. For some posts there could be emojis in the title, and depending on their combination they can take on a meaning completely different from their individual meanings.
The next step from here I believe is to analyze the comments section of these posts because in this moment I think that's the easiest way to truly describe the meaning of a post to a computer. With what was gathered here I'm only to get 10% above baseline and I think that's all there is to be had here, I mean we can tweak for a few percent probably but I don't think there's much left on the table.

290
clean.ipynb Normal file
View file

@ -0,0 +1,290 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "7b689e03-f1fa-4a63-b3e2-7342f71b4c1f",
"metadata": {},
"source": [
"# Cleaning/Processing\n",
"## Numerics\n",
"The numeric columns added in the previous step need to be scaled so they don't end up with higher influence than other features."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "85e6ae15-a4d7-474f-918f-a81b790a0ef7",
"metadata": {
"execution": {
"iopub.execute_input": "2022-06-07T15:26:19.692651Z",
"iopub.status.busy": "2022-06-07T15:26:19.692290Z",
"iopub.status.idle": "2022-06-07T15:26:19.952952Z",
"shell.execute_reply": "2022-06-07T15:26:19.952223Z",
"shell.execute_reply.started": "2022-06-07T15:26:19.692571Z"
},
"tags": []
},
"outputs": [],
"source": [
"import pandas as pd\n",
"numerics = pd.read_csv('data/numerics.csv')"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "326de842-8fd1-4d1b-8696-6f1f7c6b916f",
"metadata": {
"execution": {
"iopub.execute_input": "2022-06-07T15:26:19.954589Z",
"iopub.status.busy": "2022-06-07T15:26:19.954241Z",
"iopub.status.idle": "2022-06-07T15:26:19.960763Z",
"shell.execute_reply": "2022-06-07T15:26:19.960063Z",
"shell.execute_reply.started": "2022-06-07T15:26:19.954558Z"
},
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Score min/max: 45 / 193501\n",
"Post Age min/max: 14846.5434589386 / 101274.16840600967\n",
"Upvote Ratio min/max: 0.51 / 1.0\n",
"\n"
]
}
],
"source": [
"print(f'\\\n",
"Score min/max: {numerics.score.min()} / {numerics.score.max()}\\n\\\n",
"Post Age min/max: {numerics.post_age.min()} / {numerics.post_age.max()}\\n\\\n",
"Upvote Ratio min/max: {numerics.upvote_ratio.min()} / {numerics.upvote_ratio.max()}\\n\\\n",
"')"
]
},
{
"cell_type": "markdown",
"id": "42f191ae-2c1e-4471-9342-b25dea2be7a0",
"metadata": {},
"source": [
"Upvote Ratio is already between 0 and 1 so there's 1/3 of the work out the way for free"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "eb3a9ce5-d4eb-47a8-bdbe-aaf34b1b948b",
"metadata": {
"execution": {
"iopub.execute_input": "2022-06-07T15:26:19.963570Z",
"iopub.status.busy": "2022-06-07T15:26:19.963241Z",
"iopub.status.idle": "2022-06-07T15:26:19.970133Z",
"shell.execute_reply": "2022-06-07T15:26:19.969493Z",
"shell.execute_reply.started": "2022-06-07T15:26:19.963543Z"
},
"tags": []
},
"outputs": [],
"source": [
"def normalize_numerics(col):\n",
" col_max = numerics[col].max()\n",
" return [(val/col_max) for val in numerics[col]]"
]
},
{
"cell_type": "markdown",
"id": "8c77c16b-abff-40f3-9621-78ca90096c70",
"metadata": {},
"source": [
"I wasn't sure how often I would need to do this so I wrote a function"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "2e83a18d-9a20-4f18-8da5-dd1ad9a35ca1",
"metadata": {
"execution": {
"iopub.execute_input": "2022-06-07T15:26:19.971475Z",
"iopub.status.busy": "2022-06-07T15:26:19.971081Z",
"iopub.status.idle": "2022-06-07T15:26:20.010176Z",
"shell.execute_reply": "2022-06-07T15:26:20.009339Z",
"shell.execute_reply.started": "2022-06-07T15:26:19.971448Z"
},
"tags": []
},
"outputs": [],
"source": [
"numerics['norm_score'] = normalize_numerics('score')\n",
"numerics = numerics.drop('score',axis=1) # Prevent name collision with column for word 'score'\n",
"numerics['post_age'] = normalize_numerics('post_age')"
]
},
{
"cell_type": "markdown",
"id": "c042889a-c6ac-4dfa-92ed-28a668957296",
"metadata": {},
"source": [
"So if we check again..."
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "17b0baf9-04bf-486c-b94f-5ea9cd801c37",
"metadata": {
"execution": {
"iopub.execute_input": "2022-06-07T15:26:20.011272Z",
"iopub.status.busy": "2022-06-07T15:26:20.011063Z",
"iopub.status.idle": "2022-06-07T15:26:20.016210Z",
"shell.execute_reply": "2022-06-07T15:26:20.015608Z",
"shell.execute_reply.started": "2022-06-07T15:26:20.011255Z"
},
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Score min/max: 0.0002325569376902445 / 1.0\n",
"Post Age min/max: 0.14659753511298737 / 1.0\n",
"Upvote Ratio min/max: 0.51 / 1.0\n",
"\n"
]
}
],
"source": [
"print(f'\\\n",
"Score min/max: {numerics.norm_score.min()} / {numerics.norm_score.max()}\\n\\\n",
"Post Age min/max: {numerics.post_age.min()} / {numerics.post_age.max()}\\n\\\n",
"Upvote Ratio min/max: {numerics.upvote_ratio.min()} / {numerics.upvote_ratio.max()}\\n\\\n",
"')"
]
},
{
"cell_type": "markdown",
"id": "25157505-bc47-433e-8b5f-95a73d728d74",
"metadata": {},
"source": [
"It's less human readable but better for the model, who is not human"
]
},
{
"cell_type": "markdown",
"id": "e65e348b-7023-40b7-8e14-1fff5950a6ed",
"metadata": {},
"source": [
"## Titles"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "13180eef-55b3-4faa-adee-7c6f6ce4b81c",
"metadata": {
"execution": {
"iopub.execute_input": "2022-06-07T15:26:20.017202Z",
"iopub.status.busy": "2022-06-07T15:26:20.017007Z",
"iopub.status.idle": "2022-06-07T15:26:20.131445Z",
"shell.execute_reply": "2022-06-07T15:26:20.130521Z",
"shell.execute_reply.started": "2022-06-07T15:26:20.017184Z"
},
"tags": []
},
"outputs": [],
"source": [
"df = pd.read_csv('data/workingdf.csv')"
]
},
{
"cell_type": "markdown",
"id": "7eb7f69a-f634-4745-ad02-9ffb4d3a662c",
"metadata": {},
"source": [
"Before beginning tokenizing, the random garbage that will end up producing gibberish will need to be removed. Things like emojis, punctuation and special characters, accents, etc. I've decided to replace underscores and hyphens with whitespace, then remove anything that is not a letter or whitespace, and finally strip all extra whitespace."
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "44f5369e-3bb8-4f9d-b176-50ee5f5c07e2",
"metadata": {
"execution": {
"iopub.execute_input": "2022-06-07T15:26:20.132495Z",
"iopub.status.busy": "2022-06-07T15:26:20.132306Z",
"iopub.status.idle": "2022-06-07T15:26:20.279374Z",
"shell.execute_reply": "2022-06-07T15:26:20.278619Z",
"shell.execute_reply.started": "2022-06-07T15:26:20.132478Z"
},
"tags": []
},
"outputs": [],
"source": [
"df['title'] = [title.lower().replace('_',' ').replace('-',' ') for title in df.title]\n",
"df['title'] = df.title.replace(\"[^a-zA-Z\\s]\",'',regex=True)\n",
"df['title'] = [title.strip() for title in df.title]\n",
"df.drop(df[df.title==''].index,inplace=True) # drop now empty titles"
]
},
{
"cell_type": "markdown",
"id": "15415269-6a3b-47eb-bfa3-b45f23195b4a",
"metadata": {},
"source": [
"# Save"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "de7c67d6-22a8-456c-8e89-d22e789e0f87",
"metadata": {
"execution": {
"iopub.execute_input": "2022-06-07T15:26:20.280467Z",
"iopub.status.busy": "2022-06-07T15:26:20.280158Z",
"iopub.status.idle": "2022-06-07T15:26:20.649227Z",
"shell.execute_reply": "2022-06-07T15:26:20.648747Z",
"shell.execute_reply.started": "2022-06-07T15:26:20.280449Z"
},
"tags": []
},
"outputs": [],
"source": [
"df.to_csv('data/df_clean.csv',index=False)\n",
"numerics.to_csv('data/numerics_clean.csv',index=False)"
]
},
{
"cell_type": "markdown",
"id": "425bc75b-f459-4f38-96b1-99e3414e4226",
"metadata": {},
"source": [
"Next! [Model](model.ipynb)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.5"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

48028
data/df_clean.csv Normal file

File diff suppressed because it is too large Load diff

48545
data/numerics.csv Normal file

File diff suppressed because it is too large Load diff

48545
data/numerics_clean.csv Normal file

File diff suppressed because it is too large Load diff

91261
data/startingover.csv Normal file

File diff suppressed because it is too large Load diff

48545
data/workingdf.csv Normal file

File diff suppressed because it is too large Load diff

611
model.ipynb Normal file
View file

@ -0,0 +1,611 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "f2c2a80e-4240-4e89-ad59-039c8e65cd25",
"metadata": {},
"source": [
"# Process/Model"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "7c31603a-9b72-4353-a49d-9ad3b5e245e4",
"metadata": {
"execution": {
"iopub.execute_input": "2022-06-07T15:26:40.982440Z",
"iopub.status.busy": "2022-06-07T15:26:40.982213Z",
"iopub.status.idle": "2022-06-07T15:26:41.538590Z",
"shell.execute_reply": "2022-06-07T15:26:41.537956Z",
"shell.execute_reply.started": "2022-06-07T15:26:40.982383Z"
},
"tags": []
},
"outputs": [],
"source": [
"import pandas as pd\n",
"from sklearn.feature_extraction.text import TfidfVectorizer\n",
"# import spacy\n",
"\n",
"from sklearn.ensemble import RandomForestClassifier\n",
"from sklearn.neighbors import KNeighborsClassifier\n",
"from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold"
]
},
{
"cell_type": "markdown",
"id": "cfc10edd-fec8-439d-a315-44d55056d4a3",
"metadata": {
"execution": {
"iopub.execute_input": "2022-06-07T04:06:14.259265Z",
"iopub.status.busy": "2022-06-07T04:06:14.258568Z",
"iopub.status.idle": "2022-06-07T04:06:14.264153Z",
"shell.execute_reply": "2022-06-07T04:06:14.262980Z",
"shell.execute_reply.started": "2022-06-07T04:06:14.259236Z"
}
},
"source": [
"### Load Files"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "3512f32a-cf1f-4763-a260-8970c0dfb58b",
"metadata": {
"execution": {
"iopub.execute_input": "2022-06-07T15:26:41.539644Z",
"iopub.status.busy": "2022-06-07T15:26:41.539379Z",
"iopub.status.idle": "2022-06-07T15:26:41.611358Z",
"shell.execute_reply": "2022-06-07T15:26:41.610751Z",
"shell.execute_reply.started": "2022-06-07T15:26:41.539618Z"
},
"tags": []
},
"outputs": [],
"source": [
"df = pd.read_csv('data/df_clean.csv')\n",
"numerics = pd.read_csv('data/numerics_clean.csv')"
]
},
{
"cell_type": "markdown",
"id": "b4ba7548-f51d-43cc-befb-3280219a860b",
"metadata": {
"execution": {
"iopub.execute_input": "2022-06-07T04:06:33.188298Z",
"iopub.status.busy": "2022-06-07T04:06:33.187754Z",
"iopub.status.idle": "2022-06-07T04:06:33.218642Z",
"shell.execute_reply": "2022-06-07T04:06:33.217761Z",
"shell.execute_reply.started": "2022-06-07T04:06:33.188269Z"
},
"tags": []
},
"source": [
"### Create target column (y)"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "6f9c6b2a-d6b1-4ba9-9bdf-07fdf36115f9",
"metadata": {
"execution": {
"iopub.execute_input": "2022-06-07T15:26:41.613421Z",
"iopub.status.busy": "2022-06-07T15:26:41.613145Z",
"iopub.status.idle": "2022-06-07T15:26:41.642378Z",
"shell.execute_reply": "2022-06-07T15:26:41.641819Z",
"shell.execute_reply.started": "2022-06-07T15:26:41.613403Z"
},
"tags": []
},
"outputs": [],
"source": [
"ymed = df.num_comments.median()\n",
"y = pd.Series([1 if val > ymed else 0 for val in df.num_comments])\n",
"df.drop('num_comments',axis=1,inplace=True) # get rid of this immediately"
]
},
{
"cell_type": "markdown",
"id": "c4a4af8f-4a83-4d8c-8547-a07aa069a2bd",
"metadata": {},
"source": [
"### Lemmatise"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "a7083ac9-5033-4c43-90b8-815957443793",
"metadata": {
"execution": {
"iopub.execute_input": "2022-06-07T15:26:41.643370Z",
"iopub.status.busy": "2022-06-07T15:26:41.643114Z",
"iopub.status.idle": "2022-06-07T15:26:41.646350Z",
"shell.execute_reply": "2022-06-07T15:26:41.645572Z",
"shell.execute_reply.started": "2022-06-07T15:26:41.643353Z"
}
},
"outputs": [],
"source": [
"# # Lemmatize and filter out ' ' tokens\n",
"# nlp = spacy.load('en_core_web_sm')\n",
"# df['title'] = [' '.join([word.lemma_ for word in nlp(title) if word.lemma_ != ' '])\\\n",
"# for title in df.title] # This should be optimized"
]
},
{
"cell_type": "markdown",
"id": "138b8776-d5cf-47b5-bd6e-319a5f7c66ae",
"metadata": {},
"source": [
"Lemmatising to my surprise seems to add no value. I thought this was going to be the most important part, but as it turns out it just takes forever to process and adds nothing but run time. I suspect this is because the post titles are so short that there is no real meaning to be extracted. This could be useful when analyzing comments"
]
},
{
"cell_type": "markdown",
"id": "a43ea9ac-8e67-41ba-b137-16c2c195f744",
"metadata": {},
"source": [
"### Create X"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "358fce01-c93b-4b35-a66f-bbf39d3eabd8",
"metadata": {
"execution": {
"iopub.execute_input": "2022-06-07T15:26:41.647434Z",
"iopub.status.busy": "2022-06-07T15:26:41.647206Z",
"iopub.status.idle": "2022-06-07T15:26:42.472794Z",
"shell.execute_reply": "2022-06-07T15:26:42.471961Z",
"shell.execute_reply.started": "2022-06-07T15:26:41.647417Z"
},
"tags": []
},
"outputs": [],
"source": [
"tf = TfidfVectorizer(stop_words='english',max_features=500)\n",
"tfvec = tf.fit(df.title)\n",
"X = pd.DataFrame(tfvec.transform(df.title).todense(),columns=tfvec.get_feature_names_out())"
]
},
{
"cell_type": "markdown",
"id": "cf0c3334-5795-4f2e-95da-28164de3d36b",
"metadata": {},
"source": [
"### Join X with numeric columns"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "0e1a7188-565d-48c9-8e0a-93553968980c",
"metadata": {
"execution": {
"iopub.execute_input": "2022-06-07T15:26:42.473997Z",
"iopub.status.busy": "2022-06-07T15:26:42.473562Z",
"iopub.status.idle": "2022-06-07T15:26:42.480297Z",
"shell.execute_reply": "2022-06-07T15:26:42.479581Z",
"shell.execute_reply.started": "2022-06-07T15:26:42.473979Z"
},
"tags": []
},
"outputs": [],
"source": [
"df = df.join(numerics)\n",
"del numerics # done with numerics"
]
},
{
"cell_type": "markdown",
"id": "2248137d-3f32-4bc7-af53-93de51791d80",
"metadata": {},
"source": [
"### Create dummies from columns that are objects or booleans"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "6c14641e-0e0e-46b5-960a-2bb41e84c1dd",
"metadata": {
"execution": {
"iopub.execute_input": "2022-06-07T15:26:42.481258Z",
"iopub.status.busy": "2022-06-07T15:26:42.481069Z",
"iopub.status.idle": "2022-06-07T15:26:49.841545Z",
"shell.execute_reply": "2022-06-07T15:26:49.840824Z",
"shell.execute_reply.started": "2022-06-07T15:26:42.481241Z"
},
"tags": []
},
"outputs": [],
"source": [
"def make_dummies(df):\n",
" for col_name in df.columns:\n",
" if (df[col_name].dtype == 'O') or (df[col_name].dtype == 'bool'):\n",
" dums = pd.get_dummies(df[col_name],prefix=col_name,dtype=int,drop_first=True)\n",
" df = df.drop(labels=[col_name],axis=1)\n",
" df = df.join(dums)\n",
" return df\n",
"\n",
"dums = make_dummies(df[df.columns[1:]]) # [1:] excludes first column, 'title'\n",
"del df # done with df"
]
},
{
"cell_type": "markdown",
"id": "24b13f04-8b0a-43d2-9f47-0bb0edd8b305",
"metadata": {},
"source": [
"### Now join dummies with X"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "3a2e6910-4023-4e75-bd68-864fcbfc193d",
"metadata": {
"execution": {
"iopub.execute_input": "2022-06-07T15:26:49.842408Z",
"iopub.status.busy": "2022-06-07T15:26:49.842246Z",
"iopub.status.idle": "2022-06-07T15:26:50.247406Z",
"shell.execute_reply": "2022-06-07T15:26:50.246837Z",
"shell.execute_reply.started": "2022-06-07T15:26:49.842392Z"
},
"tags": []
},
"outputs": [],
"source": [
"X = X.join(dums)\n",
"del dums # done with dums"
]
},
{
"cell_type": "markdown",
"id": "ef94a7c0-dcde-41c1-8a0d-407938f85608",
"metadata": {},
"source": [
"### Now model it!"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "c5a322ee-a81e-491a-97d2-3d22a2fd89a7",
"metadata": {
"execution": {
"iopub.execute_input": "2022-06-07T15:26:50.248820Z",
"iopub.status.busy": "2022-06-07T15:26:50.248515Z",
"iopub.status.idle": "2022-06-07T15:26:52.034771Z",
"shell.execute_reply": "2022-06-07T15:26:52.034058Z",
"shell.execute_reply.started": "2022-06-07T15:26:50.248795Z"
},
"tags": []
},
"outputs": [],
"source": [
"# Do a split\n",
"X_train, X_test, y_train, y_test = train_test_split(X,y)\n",
"del X\n",
"del y"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "fb1c22e4-57ff-4133-9c36-1c3c3e00c8bd",
"metadata": {
"execution": {
"iopub.execute_input": "2022-06-07T15:26:52.035757Z",
"iopub.status.busy": "2022-06-07T15:26:52.035519Z",
"iopub.status.idle": "2022-06-07T15:28:35.841910Z",
"shell.execute_reply": "2022-06-07T15:28:35.841244Z",
"shell.execute_reply.started": "2022-06-07T15:26:52.035741Z"
},
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Create Random Forest...\n",
"Create Logistic Regression...\n",
"fit RF...\n",
"fit KNN...\n",
"Scoring...\n",
"Score:\t0.62 ± 0.0056\n",
"Score:\t0.6 ± 0.0052\n"
]
}
],
"source": [
"print('Create Random Forest...')\n",
"rf = RandomForestClassifier(n_jobs=-1)\n",
"print('Create Logistic Regression...')\n",
"# knn = KNeighborsClassifier(n_jobs=-1)\n",
"print('fit RF...')\n",
"model_rf = rf.fit(X_train,y_train)\n",
"\n",
"print('fit KNN...')\n",
"# model_knn = knn.fit(X_train,y_train)\n",
"\n",
"# Model Scores\n",
"def score(model,X,y):\n",
" cv=StratifiedKFold(n_splits=3,shuffle=True)\n",
" s = cross_val_score(model,X,y,cv=cv) # n_jobs=-1 actually makes it slower here\n",
" print(\"Score:\\t{:0.2} ± {:0.2}\".format(s.mean(), 2 * s.std()))\n",
"\n",
"print('Scoring...')\n",
"score(model_rf,X_train,y_train)\n",
"score(model_rf,X_test,y_test)\n",
"# score(model_knn,X_train,y_train)\n",
"# score(model_knn,X_test,y_test)"
]
},
{
"cell_type": "markdown",
"id": "ec845f9b-2df2-475e-b4a0-12d6e4d1a38c",
"metadata": {},
"source": [
"### Comparing Models\n",
"Between Random Forest, K Neighbors, and LogisticRegression, they all score about the same. But Random Forest takes two minutes to run and the rest take a lifetime. I think the data and the problem isn't complex enough to warrant more than a Random Forest.\n",
"\n",
"Everything seems to perform about 10% above the baseline, which would be 50%. In other words, the target is split cleanly in half, so if the predictions were to be 1s across the board the accuracy would be at 50%. So a cross-val score above 50 means the model is working, but doesn't necessarily give insight on exactly what it's predicting."
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "1ea3c8de-1f87-43ef-b589-7fab1552f072",
"metadata": {
"execution": {
"iopub.execute_input": "2022-06-07T15:28:35.843245Z",
"iopub.status.busy": "2022-06-07T15:28:35.842845Z",
"iopub.status.idle": "2022-06-07T15:28:35.877909Z",
"shell.execute_reply": "2022-06-07T15:28:35.877235Z",
"shell.execute_reply.started": "2022-06-07T15:28:35.843225Z"
},
"tags": []
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Variable</th>\n",
" <th>Importance</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>500</th>\n",
" <td>post_age</td>\n",
" <td>0.080172</td>\n",
" </tr>\n",
" <tr>\n",
" <th>502</th>\n",
" <td>norm_score</td>\n",
" <td>0.077290</td>\n",
" </tr>\n",
" <tr>\n",
" <th>501</th>\n",
" <td>upvote_ratio</td>\n",
" <td>0.047547</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5545</th>\n",
" <td>is_self_True</td>\n",
" <td>0.020376</td>\n",
" </tr>\n",
" <tr>\n",
" <th>232</th>\n",
" <td>like</td>\n",
" <td>0.003663</td>\n",
" </tr>\n",
" <tr>\n",
" <th>212</th>\n",
" <td>just</td>\n",
" <td>0.003490</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4518</th>\n",
" <td>subreddit_memes</td>\n",
" <td>0.003337</td>\n",
" </tr>\n",
" <tr>\n",
" <th>428</th>\n",
" <td>time</td>\n",
" <td>0.003296</td>\n",
" </tr>\n",
" <tr>\n",
" <th>287</th>\n",
" <td>new</td>\n",
" <td>0.003230</td>\n",
" </tr>\n",
" <tr>\n",
" <th>293</th>\n",
" <td>oc</td>\n",
" <td>0.002981</td>\n",
" </tr>\n",
" <tr>\n",
" <th>198</th>\n",
" <td>im</td>\n",
" <td>0.002539</td>\n",
" </tr>\n",
" <tr>\n",
" <th>85</th>\n",
" <td>day</td>\n",
" <td>0.002455</td>\n",
" </tr>\n",
" <tr>\n",
" <th>157</th>\n",
" <td>got</td>\n",
" <td>0.002369</td>\n",
" </tr>\n",
" <tr>\n",
" <th>99</th>\n",
" <td>dont</td>\n",
" <td>0.002164</td>\n",
" </tr>\n",
" <tr>\n",
" <th>431</th>\n",
" <td>today</td>\n",
" <td>0.002146</td>\n",
" </tr>\n",
" <tr>\n",
" <th>247</th>\n",
" <td>love</td>\n",
" <td>0.002137</td>\n",
" </tr>\n",
" <tr>\n",
" <th>253</th>\n",
" <td>man</td>\n",
" <td>0.002102</td>\n",
" </tr>\n",
" <tr>\n",
" <th>156</th>\n",
" <td>good</td>\n",
" <td>0.002101</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5546</th>\n",
" <td>spoiler_True</td>\n",
" <td>0.002013</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5544</th>\n",
" <td>is_original_content_True</td>\n",
" <td>0.002012</td>\n",
" </tr>\n",
" <tr>\n",
" <th>141</th>\n",
" <td>game</td>\n",
" <td>0.001961</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5027</th>\n",
" <td>subreddit_shitposting</td>\n",
" <td>0.001955</td>\n",
" </tr>\n",
" <tr>\n",
" <th>11</th>\n",
" <td>art</td>\n",
" <td>0.001921</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5543</th>\n",
" <td>over_18_True</td>\n",
" <td>0.001905</td>\n",
" </tr>\n",
" <tr>\n",
" <th>311</th>\n",
" <td>people</td>\n",
" <td>0.001879</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Variable Importance\n",
"500 post_age 0.080172\n",
"502 norm_score 0.077290\n",
"501 upvote_ratio 0.047547\n",
"5545 is_self_True 0.020376\n",
"232 like 0.003663\n",
"212 just 0.003490\n",
"4518 subreddit_memes 0.003337\n",
"428 time 0.003296\n",
"287 new 0.003230\n",
"293 oc 0.002981\n",
"198 im 0.002539\n",
"85 day 0.002455\n",
"157 got 0.002369\n",
"99 dont 0.002164\n",
"431 today 0.002146\n",
"247 love 0.002137\n",
"253 man 0.002102\n",
"156 good 0.002101\n",
"5546 spoiler_True 0.002013\n",
"5544 is_original_content_True 0.002012\n",
"141 game 0.001961\n",
"5027 subreddit_shitposting 0.001955\n",
"11 art 0.001921\n",
"5543 over_18_True 0.001905\n",
"311 people 0.001879"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.DataFrame({'Variable':X_train.columns,\n",
" 'Importance':rf.feature_importances_}).sort_values('Importance', ascending=False).head(25)"
]
},
{
"cell_type": "markdown",
"id": "4e664e35-6091-47b9-85e9-384eb3a46dcd",
"metadata": {},
"source": [
"Here are the top 25 predictors scored by importance, or the amount of influence they have. At this time of writing, at the top is age, score, upvote ratio. Of course if you want to have a popular post, make it popular, but we can't just say that. Beyond that it appears that self posts see more activity, the subreddits memes and shitposting have been very active and popular the past few days, over 18 content is popular. And some key words that may get you to the top are 'like','just','time','new','oc','good','got','day','today','im','dont', and 'love'. I don't go on Reddit often but I'm not exactly a stranger either, this looks about right. Memes is very popular, and a very easy karma farm. People love their OC (and their porn). A lot of people on reddit talk about what's going on in their day ('today', 'day')"
]
},
{
"cell_type": "markdown",
"id": "48b62749-e027-41dd-acb3-32fa1c52eeea",
"metadata": {},
"source": [
"This is a fairly simple model, with simple data. To go beyond this I think the comments would have to be analyzed. Tokenization I thought would be the most influential piece, and I still think that thinking is correct. But in this case it doesn't apply because there is no real meaning to be had from reddit post titles, at least to a computer. There's a lot more seen by a human than just the text in the title, there's often an image attached, most posts reference a recent/current event, they could be an inside joke of sorts. For some posts there could be emojis in the title, and depending on their combination they can take on a meaning completely different from their individual meanings. \n",
"\n",
"The next step from here I believe is to analyze the comments section of these posts because in this moment I think that's the easiest way to truly describe the meaning of a post to a computer. With what was gathered here I'm only to get 10% above baseline and I think that's all there is to be had here, I mean we can tweak for a few percent probably but I don't think there's much left on the table."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.5"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

206
scrapey.ipynb Normal file
View file

@ -0,0 +1,206 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "15017f1a-bfcb-4195-b635-d2138873a9cf",
"metadata": {},
"source": [
"## Anatomy of Scrapey!"
]
},
{
"cell_type": "markdown",
"id": "bac30d6c-6a53-4ee5-ad68-d096c7cc567d",
"metadata": {},
"source": [
"Scrapey takes a snapshot of [Reddit/r/all hot](https://www.reddit.com/r/all), and saves the data to a .csv file including a calculated age for each post about every 12 minutes. Run time is about 2 minutes per iteration and each time adds about 100 unique posts to the list while updating any post it's already seen.\n",
"\n",
"To run it yourself you should create a file ./sekrit with your:\n",
"* client_id token\n",
"* client_secret token\n",
"* username (optional)\n",
"* password (if using username, also optional)\n",
"\n",
"Each value goes on their own line in this order, or you can just hard code them below.<br />\n",
"If you don't want to use a username or password just comment out those lines below"
]
},
{
"cell_type": "markdown",
"id": "b85a5ad5-1de3-41aa-a74c-d2a3beb1d1d2",
"metadata": {},
"source": [
"Imports"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "cb1dacc9-852c-436e-875f-dd5d5bb4f3d7",
"metadata": {},
"outputs": [],
"source": [
"import praw\n",
"import pandas as pd\n",
"from datetime import datetime\n",
"import time\n",
"print(datetime.now().strftime('%Y-%m-%d %H:%M:%S'))"
]
},
{
"cell_type": "markdown",
"id": "41de2589-1fd9-41c6-b1ee-6585366cd53b",
"metadata": {},
"source": [
"Load all from current collection"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3459ab26-8640-4116-98da-d48016f93d4b",
"metadata": {},
"outputs": [],
"source": [
"# Connect to DB\n",
"db_name = 'data/startingover.csv'\n",
"# db = pd.DataFrame() # for fresh start\n",
"db = pd.read_csv(db_name)\n",
"print('Connected to DB...')\n",
"print(db.shape)"
]
},
{
"cell_type": "markdown",
"id": "1b9cb802-b6c6-4380-8026-23710c3624b5",
"metadata": {},
"source": [
"Access Reddit API via [PRAW](https://github.com/praw-dev/praw)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "17d30e19-663a-4c07-847c-ff30f16ea19c",
"metadata": {},
"outputs": [],
"source": [
"# Extremely Confidential\n",
"sekrits = open('sekrit').read().split('\\n')\n",
"\n",
"# Connect to Reddit\n",
"reddit = praw.Reddit(\n",
" client_id = sekrits[0],\n",
" client_secret = sekrits[1],\n",
" username = sekrits[2], # Optional\n",
" password = sekrits[3], # Optional\n",
" redirect_uri= 'http://localhost:8080',\n",
" user_agent = 'totally_not_a_bot', # fool everyone\n",
")\n",
"print('Connected to Reddit...')"
]
},
{
"cell_type": "markdown",
"id": "fd2c83ee-d534-4730-add9-da4e304c1c9d",
"metadata": {},
"source": [
"The following block is a little large but if I split it up it will break the loop and it can't be run from the notebook.\n",
"1. Loop through all posts on /r/all hot at the current moment, and create a dataframe of all of these posts with the listed features\n",
"2. Calculate a current age of the post and add that in its own column.\n",
"3. Append the newly pulled posts to the posts already saved\n",
"4. Overwrite any old records that have the same post id as a new record\n",
"5. Save back to the original .csv, wait 10 minutes, repeat."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "49504fca-7beb-45db-b43f-98a9e719bf31",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"# Grab everything from /r/all hot\n",
"print('Pulling...')\n",
"while True:\n",
" pull = pd.DataFrame({\\\n",
" 'author': post.author,\n",
" # 'comments': post.comments, # takes really long, returns object\n",
" 'created_utc': post.created_utc,\n",
" 'distinguished': post.distinguished,\n",
" 'edited': post.edited,\n",
" 'id': post.id,\n",
" 'is_original_content': post.is_original_content,\n",
" 'is_self': post.is_self,\n",
" 'link_flair_text': post.link_flair_text,\n",
" 'locked': post.locked,\n",
" 'name': post.name,\n",
" 'num_comments': post.num_comments,\n",
" 'over_18': post.over_18,\n",
" 'permalink': post.permalink,\n",
" 'score': post.score,\n",
" 'selftext': post.selftext,\n",
" 'spoiler': post.spoiler,\n",
" 'stickied': post.stickied,\n",
" 'subreddit': post.subreddit,\n",
" 'title': post.title,\n",
" 'upvote_ratio': post.upvote_ratio,\n",
" 'url': post.url,\n",
" 'utc_now': datetime.utcnow().timestamp(),\n",
" 'post_age': (datetime.utcnow().timestamp()-post.created_utc) # Create age col\n",
" } for post in reddit.subreddit('all').hot(limit=None))\n",
"\n",
" # add new list to BOTTOM of old list\n",
" db = pd.concat([db,pull])\n",
" # effectively update post record in place\n",
" db = db.drop_duplicates('id',keep='last')\n",
" # save\n",
" db.to_csv(db_name, index=False)\n",
"\n",
" # stats\n",
" total = db.shape[0]\n",
" haul = pull.shape[0]\n",
" print('Haul: ',pull.shape)\n",
" print('Total:',db.shape)\n",
" print(datetime.now().strftime('%Y-%m-%d %H:%M:%S'))\n",
"\n",
" # wait\n",
" print('Now wait...')\n",
" time.sleep(600)"
]
},
{
"cell_type": "markdown",
"id": "39a4a83d-77c5-4dad-ae2d-867e61000d7a",
"metadata": {},
"source": [
"I run this in the background in a terminal and it updates my data set every ~12 minutes. I have records of all posts within about 12 minutes of them disappearing from /r/all.\n",
"\n",
"Next up: [EDA](EDA.ipynb)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.5"
}
},
"nbformat": 4,
"nbformat_minor": 5
}