reddit/clean.ipynb

291 lines
8.3 KiB
Text
Raw Normal View History

2022-07-27 13:09:15 -04:00
{
"cells": [
{
"cell_type": "markdown",
"id": "7b689e03-f1fa-4a63-b3e2-7342f71b4c1f",
"metadata": {},
"source": [
"# Cleaning/Processing\n",
"## Numerics\n",
"The numeric columns added in the previous step need to be scaled so they don't end up with higher influence than other features."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "85e6ae15-a4d7-474f-918f-a81b790a0ef7",
"metadata": {
"execution": {
"iopub.execute_input": "2022-06-07T15:26:19.692651Z",
"iopub.status.busy": "2022-06-07T15:26:19.692290Z",
"iopub.status.idle": "2022-06-07T15:26:19.952952Z",
"shell.execute_reply": "2022-06-07T15:26:19.952223Z",
"shell.execute_reply.started": "2022-06-07T15:26:19.692571Z"
},
"tags": []
},
"outputs": [],
"source": [
"import pandas as pd\n",
"numerics = pd.read_csv('data/numerics.csv')"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "326de842-8fd1-4d1b-8696-6f1f7c6b916f",
"metadata": {
"execution": {
"iopub.execute_input": "2022-06-07T15:26:19.954589Z",
"iopub.status.busy": "2022-06-07T15:26:19.954241Z",
"iopub.status.idle": "2022-06-07T15:26:19.960763Z",
"shell.execute_reply": "2022-06-07T15:26:19.960063Z",
"shell.execute_reply.started": "2022-06-07T15:26:19.954558Z"
},
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Score min/max: 45 / 193501\n",
"Post Age min/max: 14846.5434589386 / 101274.16840600967\n",
"Upvote Ratio min/max: 0.51 / 1.0\n",
"\n"
]
}
],
"source": [
"print(f'\\\n",
"Score min/max: {numerics.score.min()} / {numerics.score.max()}\\n\\\n",
"Post Age min/max: {numerics.post_age.min()} / {numerics.post_age.max()}\\n\\\n",
"Upvote Ratio min/max: {numerics.upvote_ratio.min()} / {numerics.upvote_ratio.max()}\\n\\\n",
"')"
]
},
{
"cell_type": "markdown",
"id": "42f191ae-2c1e-4471-9342-b25dea2be7a0",
"metadata": {},
"source": [
"Upvote Ratio is already between 0 and 1 so there's 1/3 of the work out the way for free"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "eb3a9ce5-d4eb-47a8-bdbe-aaf34b1b948b",
"metadata": {
"execution": {
"iopub.execute_input": "2022-06-07T15:26:19.963570Z",
"iopub.status.busy": "2022-06-07T15:26:19.963241Z",
"iopub.status.idle": "2022-06-07T15:26:19.970133Z",
"shell.execute_reply": "2022-06-07T15:26:19.969493Z",
"shell.execute_reply.started": "2022-06-07T15:26:19.963543Z"
},
"tags": []
},
"outputs": [],
"source": [
"def normalize_numerics(col):\n",
" col_max = numerics[col].max()\n",
" return [(val/col_max) for val in numerics[col]]"
]
},
{
"cell_type": "markdown",
"id": "8c77c16b-abff-40f3-9621-78ca90096c70",
"metadata": {},
"source": [
"I wasn't sure how often I would need to do this so I wrote a function"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "2e83a18d-9a20-4f18-8da5-dd1ad9a35ca1",
"metadata": {
"execution": {
"iopub.execute_input": "2022-06-07T15:26:19.971475Z",
"iopub.status.busy": "2022-06-07T15:26:19.971081Z",
"iopub.status.idle": "2022-06-07T15:26:20.010176Z",
"shell.execute_reply": "2022-06-07T15:26:20.009339Z",
"shell.execute_reply.started": "2022-06-07T15:26:19.971448Z"
},
"tags": []
},
"outputs": [],
"source": [
"numerics['norm_score'] = normalize_numerics('score')\n",
"numerics = numerics.drop('score',axis=1) # Prevent name collision with column for word 'score'\n",
"numerics['post_age'] = normalize_numerics('post_age')"
]
},
{
"cell_type": "markdown",
"id": "c042889a-c6ac-4dfa-92ed-28a668957296",
"metadata": {},
"source": [
"So if we check again..."
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "17b0baf9-04bf-486c-b94f-5ea9cd801c37",
"metadata": {
"execution": {
"iopub.execute_input": "2022-06-07T15:26:20.011272Z",
"iopub.status.busy": "2022-06-07T15:26:20.011063Z",
"iopub.status.idle": "2022-06-07T15:26:20.016210Z",
"shell.execute_reply": "2022-06-07T15:26:20.015608Z",
"shell.execute_reply.started": "2022-06-07T15:26:20.011255Z"
},
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Score min/max: 0.0002325569376902445 / 1.0\n",
"Post Age min/max: 0.14659753511298737 / 1.0\n",
"Upvote Ratio min/max: 0.51 / 1.0\n",
"\n"
]
}
],
"source": [
"print(f'\\\n",
"Score min/max: {numerics.norm_score.min()} / {numerics.norm_score.max()}\\n\\\n",
"Post Age min/max: {numerics.post_age.min()} / {numerics.post_age.max()}\\n\\\n",
"Upvote Ratio min/max: {numerics.upvote_ratio.min()} / {numerics.upvote_ratio.max()}\\n\\\n",
"')"
]
},
{
"cell_type": "markdown",
"id": "25157505-bc47-433e-8b5f-95a73d728d74",
"metadata": {},
"source": [
"It's less human readable but better for the model, who is not human"
]
},
{
"cell_type": "markdown",
"id": "e65e348b-7023-40b7-8e14-1fff5950a6ed",
"metadata": {},
"source": [
"## Titles"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "13180eef-55b3-4faa-adee-7c6f6ce4b81c",
"metadata": {
"execution": {
"iopub.execute_input": "2022-06-07T15:26:20.017202Z",
"iopub.status.busy": "2022-06-07T15:26:20.017007Z",
"iopub.status.idle": "2022-06-07T15:26:20.131445Z",
"shell.execute_reply": "2022-06-07T15:26:20.130521Z",
"shell.execute_reply.started": "2022-06-07T15:26:20.017184Z"
},
"tags": []
},
"outputs": [],
"source": [
"df = pd.read_csv('data/workingdf.csv')"
]
},
{
"cell_type": "markdown",
"id": "7eb7f69a-f634-4745-ad02-9ffb4d3a662c",
"metadata": {},
"source": [
"Before beginning tokenizing, the random garbage that will end up producing gibberish will need to be removed. Things like emojis, punctuation and special characters, accents, etc. I've decided to replace underscores and hyphens with whitespace, then remove anything that is not a letter or whitespace, and finally strip all extra whitespace."
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "44f5369e-3bb8-4f9d-b176-50ee5f5c07e2",
"metadata": {
"execution": {
"iopub.execute_input": "2022-06-07T15:26:20.132495Z",
"iopub.status.busy": "2022-06-07T15:26:20.132306Z",
"iopub.status.idle": "2022-06-07T15:26:20.279374Z",
"shell.execute_reply": "2022-06-07T15:26:20.278619Z",
"shell.execute_reply.started": "2022-06-07T15:26:20.132478Z"
},
"tags": []
},
"outputs": [],
"source": [
"df['title'] = [title.lower().replace('_',' ').replace('-',' ') for title in df.title]\n",
"df['title'] = df.title.replace(\"[^a-zA-Z\\s]\",'',regex=True)\n",
"df['title'] = [title.strip() for title in df.title]\n",
"df.drop(df[df.title==''].index,inplace=True) # drop now empty titles"
]
},
{
"cell_type": "markdown",
"id": "15415269-6a3b-47eb-bfa3-b45f23195b4a",
"metadata": {},
"source": [
"# Save"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "de7c67d6-22a8-456c-8e89-d22e789e0f87",
"metadata": {
"execution": {
"iopub.execute_input": "2022-06-07T15:26:20.280467Z",
"iopub.status.busy": "2022-06-07T15:26:20.280158Z",
"iopub.status.idle": "2022-06-07T15:26:20.649227Z",
"shell.execute_reply": "2022-06-07T15:26:20.648747Z",
"shell.execute_reply.started": "2022-06-07T15:26:20.280449Z"
},
"tags": []
},
"outputs": [],
"source": [
"df.to_csv('data/df_clean.csv',index=False)\n",
"numerics.to_csv('data/numerics_clean.csv',index=False)"
]
},
{
"cell_type": "markdown",
"id": "425bc75b-f459-4f38-96b1-99e3414e4226",
"metadata": {},
"source": [
"Next! [Model](model.ipynb)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.5"
}
},
"nbformat": 4,
"nbformat_minor": 5
}