This commit is contained in:
Adam 2023-02-05 00:30:55 -05:00
parent 836294c229
commit 02a0bd70f2
14 changed files with 347 additions and 44 deletions

View file

@ -0,0 +1,138 @@
<article>
<p className="align-right date">May 29, 2022</p>
<h2 className="title">Predicting Housing Prices</h2>
<p>
A recent project I had for class was to use
<a href="https://scikit-learn.org/stable/index.html" target="new">scikit-learn</a>
to create a regression model that will predict the price of a house based on some
features of that house.
</p>
<h3>How?</h3>
<ol>
<li>
Pick out and analyze certain features from the dataset. Used here is the
<a href="https://www.kaggle.com/datasets/marcopale/housing" target="new"
>Ames Iowa Housing Data</a
>
set.
</li>
<li>
Do some signal processing to provide a clearer input down the line, improving
accuracy
</li>
<li>Make predictions on sale price</li>
<li>
Compare the predicted prices to recorded actual sale prices and score the results
</li>
</ol>
<h3>What's important?</h3>
<p>
Well, I don't know much about appraising houses. But I have heard the term "price per
square foot" so we'll start with that:
</p>
<p className="align-center"><img src="https://doordesk.net/pics/livarea_no_outliers.png" /></p>
<p>
There is a feature for 'Above Grade Living Area' meaning floor area that's not basement.
It looks linear, there were a couple outliers to take care of but this should be a good
signal.
</p>
<p>Next I calculated the age of every house at time of sale and plotted it:</p>
<p className="align-center"><img src="https://doordesk.net/pics/age.png" /></p>
<p>
Exactly what I'd expect to see. Price drops as age goes up, a few outliers. We'll
include that in the model.
</p>
<p>Next I chose the area of the lot:</p>
<p className="align-center"><img src="https://doordesk.net/pics/lot_area.png" /></p>
<p>
Lot area positively affects sale price because land has value. Most of the houses here
have similarly sized lots.
</p>
<h3>Pre-Processing</h3>
<div>
<p>
Here is an example where using
<a
href="https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html"
target="new"
>StandardScaler()</a
>
just doesn't cut it. The values are all scaled in a way where they can be compared
to one-another, but outliers have a huge effect on the clarity of the signal as a
whole.
</p>
<span>
<center>
<img src="https://doordesk.net/pics/age_liv_area_ss.png" />
<img src="https://doordesk.net/pics/age_liv_qt.png" />
</center>
</span>
</div>
<p>
You should clearly see in the second figure that an old shed represented in the top left
corner will sell for far less than a brand new mansion represented in the bottom right
corner. This is the result of using the
<a
href="https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.QuantileTransformer.html"
target="new"
>QuantileTransformer()</a
>
for scaling.
</p>
<h3>The Model</h3>
<p>
A simple
<a
href="https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html"
>LinearRegression()</a
>
should do just fine, with
<a
href="https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.QuantileTransformer.html"
target="new"
>QuantileTransformer()</a
>
scaling of course.
</p>
<center>
<img src="https://doordesk.net/pics/mod_out.png" />
</center>
<p>
Predictions were within about $35-$40k on average.<br />
It's a little fuzzy in the higher end of prices, I believe due to the small sample size.
There are a few outliers that can probably be reduced with some deeper cleaning however
I was worried about going too far and creating a different story. An "ideal" model in
this case would look like a straight line.
</p>
<h3>Conclusion</h3>
<p>
This model was designed with a focus on quality and consistency. With some refinement,
the margin of error should be able to be reduced to a reasonable number and then
reliable, accurate predictions can be made for any application where there is a need to
assess the value of a property.
</p>
<p>
I think a large limiting factor here is the size of the dataset compared to the quality
of the features provided. There are
<a href="http://jse.amstat.org/v19n3/decock/DataDocumentation.txt">more features</a>
from this dataset that can be included but I think the largest gains will be had from
simply feeding in more data. As you stray from the "low hanging fruit" features, the
quality of your model overall starts to go down.
</p>
<p>Here's an interesting case, Overall Condition of Property:<br /><br /></p>
<center>
<img src="https://doordesk.net/pics/overall_cond.png" />
</center>
<p>
You would expect sale price to increase with quality, no? Yet it goes down.. Why?<br />
I believe it's because a lot of sellers want to say that their house is of highest
quality, no matter the condition. It seems that most normal people (who aren't liars)
dont't care to rate their property and just say it's average. Both of these combined
actually create a negative trend for quality which definitely won't help predictions!
</p>
<p>
I would like to expand this in the future, maybe scraping websites like Zillow to gather
more data. <br />
We'll see.
</p>
</article>

Binary file not shown.

After

Width:  |  Height:  |  Size: 80 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 140 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 252 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 79 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 65 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 128 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 48 KiB

View file

@ -0,0 +1,128 @@
<article>
<p className="align-right date">Jun 14, 2022</p>
<h2 className="title">What Goes Into a Successful Reddit Post?</h2>
<p>
In an attempt to find out what about a Reddit post makes it successful I will use some
classification models to try to determine which features have the highest influence on
making a correct prediction. In particular I use
<a
href="https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html"
>Random Forest</a
>
and
<a
href="https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html"
>KNNeighbors</a
>
classifiers. Then I'll score the results and see what the highest predictors are.
</p>
<p>
To find what goes into making a successful Reddit post we'll have to do a few things,
first of which is collecting data:
</p>
<h3>Introducing Scrapey!</h3>
<p>
<a href="projects/reddit/scrapey.html">Scrapey</a> is my scraper script that takes a snapshot
of Reddit/r/all hot and saves the data to a .csv file including a calculated age for
each post about every 12 minutes. Run time is about 2 minutes per iteration and each
time adds about 100 unique posts to the list while updating any post it's already seen.
</p>
<p>
I run this in the background in a terminal and it updates my data set every ~12 minutes.
I have records of all posts within about 12 minutes of them disappearing from /r/all.
</p>
<h3>EDA</h3>
<p>
<a href="projects/reddit/EDA.html">Next I take a quick look to see what looks useful</a>, what
doesn't, and check for outliers that will throw off the model. There were a few outliers
to drop from the num_comments column.
</p>
Chosen Features:
<ul>
<li>Title</li>
<li>Subreddit</li>
<li>Over_18</li>
<li>Is_Original_Content</li>
<li>Is_Self</li>
<li>Spoiler</li>
<li>Locked</li>
<li>Stickied</li>
<li>Num_Comments (Target)</li>
</ul>
<p>
Then I split the data I'm going to use into two dataframes (numeric and non) to prepare
for further processing.
</p>
<h3>Clean</h3>
<p><a href="projects/reddit/clean.html">Cleaning the data further</a> consists of:</p>
<ul>
<li>Scaling numeric features between 0-1</li>
<li>Converting '_' and '-' to whitespace</li>
<li>Removing any non a-z or A-Z or whitespace</li>
<li>Stripping any leftover whitespace</li>
<li>Deleting any titles that were reduced to empty strings</li>
</ul>
<h3>Model</h3>
<p>
If the number of comments of a post is greater than the median total number of comments
then it's assigned a 1, otherwise a 0. This is the target column. I then try some
lemmatizing, it doesn't seem to add much. After that I create and join some dummies,
then split and feed the new dataframe into
<a
href="https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html"
>Random Forest</a
>
and
<a
href="https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html"
>KNNeighbors</a
>
classifiers. Both actually scored the same with
<a
href="https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html"
>cross validation</a
>
so I mainly used the forest.
</p>
<p><a href="projects/reddit/model.html">Notebook Here</a></p>
<h3>Conclusion</h3>
<p>Some Predictors from Top 25:</p>
<ul>
<li>Is_Self</li>
<li>Subreddit_Memes</li>
<li>OC</li>
<li>Over_18</li>
<li>Subreddit_Shitposting</li>
<li>Is_Original_Content</li>
<li>Subreddit_Superstonk</li>
</ul>
<p>
Popular words: 'like', 'just', 'time', 'new', 'oc', 'good', 'got', 'day', 'today', 'im',
'dont', and 'love'.
</p>
<p>
People on Reddit (at least in the past few days) like their memes, porn, and talking
about their day. And it's preferred if the content is original and self posted. So yes,
post your memes to memes and shitposting, tag them NSFW, use some words from the list,
and rake in all that sweet karma!
</p>
<p>
But it's not that simple, this is a fairly simple model, with simple data. To go beyond
this I think the comments would have to be analyzed.
<a href="https://en.wikipedia.org/wiki/Lemmatisation">Lemmatisation</a> I thought would
be the most influential piece, and I still think that thinking is correct. But in this
case it doesn't apply because there is no real meaning to be had from reddit post
titles, at least to a computer. (or I did something wrong)
</p>
<p>
There's a lot more seen by a human than just the text in the title, there's often an
image attached, most posts reference a recent/current event, they could be an inside
joke of sorts. For some posts there could be emojis in the title, and depending on their
combination they can take on a meaning completely different from their individual
meanings. The next step from here I believe is to analyze the comments section of these
posts because in this moment I think that's the easiest way to truly describe the
meaning of a post to a computer. With what was gathered here I'm only to get 10% above
baseline and I think that's all there is to be had here, I mean we can tweak for a few
percent probably but I don't think there's much left on the table.
</p>
</article>

View file

@ -2,16 +2,9 @@
<p className="align-right date">Jul 01, 2022</p>
<h2 className="title">It's a post about nothing!</h2>
<p>The progress update</p>
<center>
<iframe
src="https://gfycat.com/ifr/DistantUnpleasantHyracotherium"
frameborder="0"
scrolling="no"
allowfullscreen
width="640"
height="535"
></iframe>
</center>
<p className='align-center'>
<img src="https://doordesk.net/pics/plates.gif" />
</p>
<h3>Bots</h3>
<p>
After finding a number of ways not to begin the project formerly known as my capstone,
@ -20,12 +13,12 @@
href="https://www.kaggle.com/datasets/bwandowando/ukraine-russian-crisis-twitter-dataset-1-2-m-rows"
>dataset</a
>. The project is about detecting bots, starting with twitter. I've
<a href="projects/bots/docs/debot.pdf">studied</a> a
<a href="projects/bots/docs/botwalk.pdf">few</a>
<a href="projects/bots/docs/smu.pdf">different</a>
<a href="projects/bots/docs/div.pdf">methods</a> of bot detection and particularly like the
<a href="projects/bots/docs/debot.pdf">DeBot</a> and
<a href="projects/bots/docs/botwalk.pdf">BotWalk</a> methods and think I will try to mimic them,
<a href="https://doordesk.net/projects/bots/docs/debot.pdf">studied</a> a
<a href="https://doordesk.net/projects/bots/docs/botwalk.pdf">few</a>
<a href="https://doordesk.net/projects/bots/docs/smu.pdf">different</a>
<a href="https://doordesk.net/projects/bots/docs/div.pdf">methods</a> of bot detection and particularly like the
<a href="https://doordesk.net/projects/bots/docs/debot.pdf">DeBot</a> and
<a href="https://doordesk.net/projects/bots/docs/botwalk.pdf">BotWalk</a> methods and think I will try to mimic them,
in that order.
</p>
<p>

View file

@ -1,22 +1,39 @@
import { Component } from 'react'
import { Component } from 'react'
import './App.css'
import Header from './components/Header.js'
import Blog from './components/Blog.js'
const BLOG_POSTS = [
'blog/000000000-swim.html',
'blog/20220506-change.html'
const FAKE_IT_TIL_YOU_MAKE_IT: string[] = [
'Blog',
'Games',
'Cartman',
'Enigma',
'Notebooks',
]
class App extends Component {
constructor(props) {
interface IAppProps {
}
interface IAppState {
currentPage: string;
}
class App extends Component<IAppProps, IAppState> {
constructor(props: IAppProps) {
super(props)
this.state = {
currentPage: 'Blog'
}
}
render() {
let page;
if (this.state.currentPage === 'Blog') {
page = <Blog />
}
return (
<div className="App">
<Header />
<Blog />
<Header pages={FAKE_IT_TIL_YOU_MAKE_IT} currentPage={this.state.currentPage} />
{page}
</div>
)
}

View file

@ -1,10 +1,21 @@
import { Component } from 'react'
import BlogPost from './BlogPost.js'
const BLOG_URLS: string[] = [
// should render one by one
// make api that has post id, title, date, etc with url to article; then
// distribute to blog posts
const FAKE_IT_TIL_YOU_MAKE_IT: string[] = [
'blog/20220701-progress.html',
'blog/20220614-reddit.html',
'blog/20220602-back.html',
'blog/20220529-housing.html',
'blog/20220520-nvidia.html',
'blog/20220506-change.html',
'blog/000000000-swim.html'
'blog/000000000-swim.html',
]
interface IBlogProps {
}
@ -23,7 +34,7 @@ class Blog extends Component<IBlogProps, IBlogState> {
render() {
return (
<>
{this.renderPosts(BLOG_URLS)}
{this.renderPosts(FAKE_IT_TIL_YOU_MAKE_IT)}
</>
)
}

View file

@ -1,21 +1,32 @@
function Header() {
return (
<div className="header">
<div className="content">
<header>
<h1>DoorDesk</h1>
</header>
<nav>
<p>
<a href="../index.html">[Home]</a> -
<a href="../games">[Games]</a> -
<a href="https://github.com/adoyle0">[GitHub]</a> -
[Cartman]
</p>
</nav>
import { Component } from 'react'
interface IHeaderProps {
pages: string[];
currentPage: string;
}
interface IHeaderState {
}
class Header extends Component<IHeaderProps, IHeaderState> {
constructor(props: IHeaderProps) {
super(props)
}
render() {
return (
<div className="header">
<div className="content">
<header>
<h1>DoorDesk</h1>
</header>
<nav>
<p> {this.props.currentPage} </p>
<p> {this.props.pages} </p>
</nav>
</div>
</div>
</div>
)
)
}
}
export default Header

5
start_frontend_ghetto Executable file
View file

@ -0,0 +1,5 @@
#!/bin/bash
cd doordesk &&
npm install &&
npm run dev