reddit/scrapey.ipynb
2022-07-27 13:09:15 -04:00

206 lines
6.4 KiB
Text

{
"cells": [
{
"cell_type": "markdown",
"id": "15017f1a-bfcb-4195-b635-d2138873a9cf",
"metadata": {},
"source": [
"## Anatomy of Scrapey!"
]
},
{
"cell_type": "markdown",
"id": "bac30d6c-6a53-4ee5-ad68-d096c7cc567d",
"metadata": {},
"source": [
"Scrapey takes a snapshot of [Reddit/r/all hot](https://www.reddit.com/r/all), and saves the data to a .csv file including a calculated age for each post about every 12 minutes. Run time is about 2 minutes per iteration and each time adds about 100 unique posts to the list while updating any post it's already seen.\n",
"\n",
"To run it yourself you should create a file ./sekrit with your:\n",
"* client_id token\n",
"* client_secret token\n",
"* username (optional)\n",
"* password (if using username, also optional)\n",
"\n",
"Each value goes on their own line in this order, or you can just hard code them below.<br />\n",
"If you don't want to use a username or password just comment out those lines below"
]
},
{
"cell_type": "markdown",
"id": "b85a5ad5-1de3-41aa-a74c-d2a3beb1d1d2",
"metadata": {},
"source": [
"Imports"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "cb1dacc9-852c-436e-875f-dd5d5bb4f3d7",
"metadata": {},
"outputs": [],
"source": [
"import praw\n",
"import pandas as pd\n",
"from datetime import datetime\n",
"import time\n",
"print(datetime.now().strftime('%Y-%m-%d %H:%M:%S'))"
]
},
{
"cell_type": "markdown",
"id": "41de2589-1fd9-41c6-b1ee-6585366cd53b",
"metadata": {},
"source": [
"Load all from current collection"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3459ab26-8640-4116-98da-d48016f93d4b",
"metadata": {},
"outputs": [],
"source": [
"# Connect to DB\n",
"db_name = 'data/startingover.csv'\n",
"# db = pd.DataFrame() # for fresh start\n",
"db = pd.read_csv(db_name)\n",
"print('Connected to DB...')\n",
"print(db.shape)"
]
},
{
"cell_type": "markdown",
"id": "1b9cb802-b6c6-4380-8026-23710c3624b5",
"metadata": {},
"source": [
"Access Reddit API via [PRAW](https://github.com/praw-dev/praw)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "17d30e19-663a-4c07-847c-ff30f16ea19c",
"metadata": {},
"outputs": [],
"source": [
"# Extremely Confidential\n",
"sekrits = open('sekrit').read().split('\\n')\n",
"\n",
"# Connect to Reddit\n",
"reddit = praw.Reddit(\n",
" client_id = sekrits[0],\n",
" client_secret = sekrits[1],\n",
" username = sekrits[2], # Optional\n",
" password = sekrits[3], # Optional\n",
" redirect_uri= 'http://localhost:8080',\n",
" user_agent = 'totally_not_a_bot', # fool everyone\n",
")\n",
"print('Connected to Reddit...')"
]
},
{
"cell_type": "markdown",
"id": "fd2c83ee-d534-4730-add9-da4e304c1c9d",
"metadata": {},
"source": [
"The following block is a little large but if I split it up it will break the loop and it can't be run from the notebook.\n",
"1. Loop through all posts on /r/all hot at the current moment, and create a dataframe of all of these posts with the listed features\n",
"2. Calculate a current age of the post and add that in its own column.\n",
"3. Append the newly pulled posts to the posts already saved\n",
"4. Overwrite any old records that have the same post id as a new record\n",
"5. Save back to the original .csv, wait 10 minutes, repeat."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "49504fca-7beb-45db-b43f-98a9e719bf31",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"# Grab everything from /r/all hot\n",
"print('Pulling...')\n",
"while True:\n",
" pull = pd.DataFrame({\\\n",
" 'author': post.author,\n",
" # 'comments': post.comments, # takes really long, returns object\n",
" 'created_utc': post.created_utc,\n",
" 'distinguished': post.distinguished,\n",
" 'edited': post.edited,\n",
" 'id': post.id,\n",
" 'is_original_content': post.is_original_content,\n",
" 'is_self': post.is_self,\n",
" 'link_flair_text': post.link_flair_text,\n",
" 'locked': post.locked,\n",
" 'name': post.name,\n",
" 'num_comments': post.num_comments,\n",
" 'over_18': post.over_18,\n",
" 'permalink': post.permalink,\n",
" 'score': post.score,\n",
" 'selftext': post.selftext,\n",
" 'spoiler': post.spoiler,\n",
" 'stickied': post.stickied,\n",
" 'subreddit': post.subreddit,\n",
" 'title': post.title,\n",
" 'upvote_ratio': post.upvote_ratio,\n",
" 'url': post.url,\n",
" 'utc_now': datetime.utcnow().timestamp(),\n",
" 'post_age': (datetime.utcnow().timestamp()-post.created_utc) # Create age col\n",
" } for post in reddit.subreddit('all').hot(limit=None))\n",
"\n",
" # add new list to BOTTOM of old list\n",
" db = pd.concat([db,pull])\n",
" # effectively update post record in place\n",
" db = db.drop_duplicates('id',keep='last')\n",
" # save\n",
" db.to_csv(db_name, index=False)\n",
"\n",
" # stats\n",
" total = db.shape[0]\n",
" haul = pull.shape[0]\n",
" print('Haul: ',pull.shape)\n",
" print('Total:',db.shape)\n",
" print(datetime.now().strftime('%Y-%m-%d %H:%M:%S'))\n",
"\n",
" # wait\n",
" print('Now wait...')\n",
" time.sleep(600)"
]
},
{
"cell_type": "markdown",
"id": "39a4a83d-77c5-4dad-ae2d-867e61000d7a",
"metadata": {},
"source": [
"I run this in the background in a terminal and it updates my data set every ~12 minutes. I have records of all posts within about 12 minutes of them disappearing from /r/all.\n",
"\n",
"Next up: [EDA](EDA.ipynb)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.5"
}
},
"nbformat": 4,
"nbformat_minor": 5
}