doordesk-js/slow_react/public/projects/20220529-housing.html
2023-04-12 04:23:27 -04:00

138 lines
5.9 KiB
HTML

<article>
<p className="align-right date">May 29, 2022</p>
<h2 className="title">Predicting Housing Prices</h2>
<p>
A recent project I had for class was to use
<a href="https://scikit-learn.org/stable/index.html" target="new">scikit-learn</a>
to create a regression model that will predict the price of a house based on some
features of that house.
</p>
<h3>How?</h3>
<ol>
<li>
Pick out and analyze certain features from the dataset. Used here is the
<a href="https://www.kaggle.com/datasets/marcopale/housing" target="new"
>Ames Iowa Housing Data</a
>
set.
</li>
<li>
Do some signal processing to provide a clearer input down the line, improving
accuracy
</li>
<li>Make predictions on sale price</li>
<li>
Compare the predicted prices to recorded actual sale prices and score the results
</li>
</ol>
<h3>What's important?</h3>
<p>
Well, I don't know much about appraising houses. But I have heard the term "price per
square foot" so we'll start with that:
</p>
<p className="align-center"><img src="https://doordesk.net/pics/livarea_no_outliers.png" /></p>
<p>
There is a feature for 'Above Grade Living Area' meaning floor area that's not basement.
It looks linear, there were a couple outliers to take care of but this should be a good
signal.
</p>
<p>Next I calculated the age of every house at time of sale and plotted it:</p>
<p className="align-center"><img src="https://doordesk.net/pics/age.png" /></p>
<p>
Exactly what I'd expect to see. Price drops as age goes up, a few outliers. We'll
include that in the model.
</p>
<p>Next I chose the area of the lot:</p>
<p className="align-center"><img src="https://doordesk.net/pics/lot_area.png" /></p>
<p>
Lot area positively affects sale price because land has value. Most of the houses here
have similarly sized lots.
</p>
<h3>Pre-Processing</h3>
<div>
<p>
Here is an example where using
<a
href="https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html"
target="new"
>StandardScaler()</a
>
just doesn't cut it. The values are all scaled in a way where they can be compared
to one-another, but outliers have a huge effect on the clarity of the signal as a
whole.
</p>
<span>
<center>
<img src="https://doordesk.net/pics/age_liv_area_ss.png" />
<img src="https://doordesk.net/pics/age_liv_qt.png" />
</center>
</span>
</div>
<p>
You should clearly see in the second figure that an old shed represented in the top left
corner will sell for far less than a brand new mansion represented in the bottom right
corner. This is the result of using the
<a
href="https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.QuantileTransformer.html"
target="new"
>QuantileTransformer()</a
>
for scaling.
</p>
<h3>The Model</h3>
<p>
A simple
<a
href="https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html"
>LinearRegression()</a
>
should do just fine, with
<a
href="https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.QuantileTransformer.html"
target="new"
>QuantileTransformer()</a
>
scaling of course.
</p>
<center>
<img src="https://doordesk.net/pics/mod_out.png" />
</center>
<p>
Predictions were within about $35-$40k on average.<br />
It's a little fuzzy in the higher end of prices, I believe due to the small sample size.
There are a few outliers that can probably be reduced with some deeper cleaning however
I was worried about going too far and creating a different story. An "ideal" model in
this case would look like a straight line.
</p>
<h3>Conclusion</h3>
<p>
This model was designed with a focus on quality and consistency. With some refinement,
the margin of error should be able to be reduced to a reasonable number and then
reliable, accurate predictions can be made for any application where there is a need to
assess the value of a property.
</p>
<p>
I think a large limiting factor here is the size of the dataset compared to the quality
of the features provided. There are
<a href="http://jse.amstat.org/v19n3/decock/DataDocumentation.txt">more features</a>
from this dataset that can be included but I think the largest gains will be had from
simply feeding in more data. As you stray from the "low hanging fruit" features, the
quality of your model overall starts to go down.
</p>
<p>Here's an interesting case, Overall Condition of Property:<br /><br /></p>
<center>
<img src="https://doordesk.net/pics/overall_cond.png" />
</center>
<p>
You would expect sale price to increase with quality, no? Yet it goes down.. Why?<br />
I believe it's because a lot of sellers want to say that their house is of highest
quality, no matter the condition. It seems that most normal people (who aren't liars)
dont't care to rate their property and just say it's average. Both of these combined
actually create a negative trend for quality which definitely won't help predictions!
</p>
<p>
I would like to expand this in the future, maybe scraping websites like Zillow to gather
more data. <br />
We'll see.
</p>
</article>