207 lines
6.4 KiB
Text
207 lines
6.4 KiB
Text
|
{
|
||
|
"cells": [
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"id": "15017f1a-bfcb-4195-b635-d2138873a9cf",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"## Anatomy of Scrapey!"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"id": "bac30d6c-6a53-4ee5-ad68-d096c7cc567d",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"Scrapey takes a snapshot of [Reddit/r/all hot](https://www.reddit.com/r/all), and saves the data to a .csv file including a calculated age for each post about every 12 minutes. Run time is about 2 minutes per iteration and each time adds about 100 unique posts to the list while updating any post it's already seen.\n",
|
||
|
"\n",
|
||
|
"To run it yourself you should create a file ./sekrit with your:\n",
|
||
|
"* client_id token\n",
|
||
|
"* client_secret token\n",
|
||
|
"* username (optional)\n",
|
||
|
"* password (if using username, also optional)\n",
|
||
|
"\n",
|
||
|
"Each value goes on their own line in this order, or you can just hard code them below.<br />\n",
|
||
|
"If you don't want to use a username or password just comment out those lines below"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"id": "b85a5ad5-1de3-41aa-a74c-d2a3beb1d1d2",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"Imports"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": null,
|
||
|
"id": "cb1dacc9-852c-436e-875f-dd5d5bb4f3d7",
|
||
|
"metadata": {},
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"import praw\n",
|
||
|
"import pandas as pd\n",
|
||
|
"from datetime import datetime\n",
|
||
|
"import time\n",
|
||
|
"print(datetime.now().strftime('%Y-%m-%d %H:%M:%S'))"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"id": "41de2589-1fd9-41c6-b1ee-6585366cd53b",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"Load all from current collection"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": null,
|
||
|
"id": "3459ab26-8640-4116-98da-d48016f93d4b",
|
||
|
"metadata": {},
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"# Connect to DB\n",
|
||
|
"db_name = 'data/startingover.csv'\n",
|
||
|
"# db = pd.DataFrame() # for fresh start\n",
|
||
|
"db = pd.read_csv(db_name)\n",
|
||
|
"print('Connected to DB...')\n",
|
||
|
"print(db.shape)"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"id": "1b9cb802-b6c6-4380-8026-23710c3624b5",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"Access Reddit API via [PRAW](https://github.com/praw-dev/praw)"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": null,
|
||
|
"id": "17d30e19-663a-4c07-847c-ff30f16ea19c",
|
||
|
"metadata": {},
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"# Extremely Confidential\n",
|
||
|
"sekrits = open('sekrit').read().split('\\n')\n",
|
||
|
"\n",
|
||
|
"# Connect to Reddit\n",
|
||
|
"reddit = praw.Reddit(\n",
|
||
|
" client_id = sekrits[0],\n",
|
||
|
" client_secret = sekrits[1],\n",
|
||
|
" username = sekrits[2], # Optional\n",
|
||
|
" password = sekrits[3], # Optional\n",
|
||
|
" redirect_uri= 'http://localhost:8080',\n",
|
||
|
" user_agent = 'totally_not_a_bot', # fool everyone\n",
|
||
|
")\n",
|
||
|
"print('Connected to Reddit...')"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"id": "fd2c83ee-d534-4730-add9-da4e304c1c9d",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"The following block is a little large but if I split it up it will break the loop and it can't be run from the notebook.\n",
|
||
|
"1. Loop through all posts on /r/all hot at the current moment, and create a dataframe of all of these posts with the listed features\n",
|
||
|
"2. Calculate a current age of the post and add that in its own column.\n",
|
||
|
"3. Append the newly pulled posts to the posts already saved\n",
|
||
|
"4. Overwrite any old records that have the same post id as a new record\n",
|
||
|
"5. Save back to the original .csv, wait 10 minutes, repeat."
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": null,
|
||
|
"id": "49504fca-7beb-45db-b43f-98a9e719bf31",
|
||
|
"metadata": {
|
||
|
"tags": []
|
||
|
},
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"# Grab everything from /r/all hot\n",
|
||
|
"print('Pulling...')\n",
|
||
|
"while True:\n",
|
||
|
" pull = pd.DataFrame({\\\n",
|
||
|
" 'author': post.author,\n",
|
||
|
" # 'comments': post.comments, # takes really long, returns object\n",
|
||
|
" 'created_utc': post.created_utc,\n",
|
||
|
" 'distinguished': post.distinguished,\n",
|
||
|
" 'edited': post.edited,\n",
|
||
|
" 'id': post.id,\n",
|
||
|
" 'is_original_content': post.is_original_content,\n",
|
||
|
" 'is_self': post.is_self,\n",
|
||
|
" 'link_flair_text': post.link_flair_text,\n",
|
||
|
" 'locked': post.locked,\n",
|
||
|
" 'name': post.name,\n",
|
||
|
" 'num_comments': post.num_comments,\n",
|
||
|
" 'over_18': post.over_18,\n",
|
||
|
" 'permalink': post.permalink,\n",
|
||
|
" 'score': post.score,\n",
|
||
|
" 'selftext': post.selftext,\n",
|
||
|
" 'spoiler': post.spoiler,\n",
|
||
|
" 'stickied': post.stickied,\n",
|
||
|
" 'subreddit': post.subreddit,\n",
|
||
|
" 'title': post.title,\n",
|
||
|
" 'upvote_ratio': post.upvote_ratio,\n",
|
||
|
" 'url': post.url,\n",
|
||
|
" 'utc_now': datetime.utcnow().timestamp(),\n",
|
||
|
" 'post_age': (datetime.utcnow().timestamp()-post.created_utc) # Create age col\n",
|
||
|
" } for post in reddit.subreddit('all').hot(limit=None))\n",
|
||
|
"\n",
|
||
|
" # add new list to BOTTOM of old list\n",
|
||
|
" db = pd.concat([db,pull])\n",
|
||
|
" # effectively update post record in place\n",
|
||
|
" db = db.drop_duplicates('id',keep='last')\n",
|
||
|
" # save\n",
|
||
|
" db.to_csv(db_name, index=False)\n",
|
||
|
"\n",
|
||
|
" # stats\n",
|
||
|
" total = db.shape[0]\n",
|
||
|
" haul = pull.shape[0]\n",
|
||
|
" print('Haul: ',pull.shape)\n",
|
||
|
" print('Total:',db.shape)\n",
|
||
|
" print(datetime.now().strftime('%Y-%m-%d %H:%M:%S'))\n",
|
||
|
"\n",
|
||
|
" # wait\n",
|
||
|
" print('Now wait...')\n",
|
||
|
" time.sleep(600)"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"id": "39a4a83d-77c5-4dad-ae2d-867e61000d7a",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"I run this in the background in a terminal and it updates my data set every ~12 minutes. I have records of all posts within about 12 minutes of them disappearing from /r/all.\n",
|
||
|
"\n",
|
||
|
"Next up: [EDA](EDA.ipynb)"
|
||
|
]
|
||
|
}
|
||
|
],
|
||
|
"metadata": {
|
||
|
"kernelspec": {
|
||
|
"display_name": "Python 3 (ipykernel)",
|
||
|
"language": "python",
|
||
|
"name": "python3"
|
||
|
},
|
||
|
"language_info": {
|
||
|
"codemirror_mode": {
|
||
|
"name": "ipython",
|
||
|
"version": 3
|
||
|
},
|
||
|
"file_extension": ".py",
|
||
|
"mimetype": "text/x-python",
|
||
|
"name": "python",
|
||
|
"nbconvert_exporter": "python",
|
||
|
"pygments_lexer": "ipython3",
|
||
|
"version": "3.10.5"
|
||
|
}
|
||
|
},
|
||
|
"nbformat": 4,
|
||
|
"nbformat_minor": 5
|
||
|
}
|