136 lines
5.8 KiB
HTML
136 lines
5.8 KiB
HTML
<article>
|
|
<p>
|
|
A recent project I had for class was to use
|
|
<a href="https://scikit-learn.org/stable/index.html" target="new">scikit-learn</a>
|
|
to create a regression model that will predict the price of a house based on some
|
|
features of that house.
|
|
</p>
|
|
<h3>How?</h3>
|
|
<ol>
|
|
<li>
|
|
Pick out and analyze certain features from the dataset. Used here is the
|
|
<a href="https://www.kaggle.com/datasets/marcopale/housing" target="new"
|
|
>Ames Iowa Housing Data</a
|
|
>
|
|
set.
|
|
</li>
|
|
<li>
|
|
Do some signal processing to provide a clearer input down the line, improving
|
|
accuracy
|
|
</li>
|
|
<li>Make predictions on sale price</li>
|
|
<li>
|
|
Compare the predicted prices to recorded actual sale prices and score the results
|
|
</li>
|
|
</ol>
|
|
<h3>What's important?</h3>
|
|
<p>
|
|
Well, I don't know much about appraising houses. But I have heard the term "price per
|
|
square foot" so we'll start with that:
|
|
</p>
|
|
<p style="text-align: center;"><img src="https://doordesk.net/pics/livarea_no_outliers.png" /></p>
|
|
<p>
|
|
There is a feature for 'Above Grade Living Area' meaning floor area that's not basement.
|
|
It looks linear, there were a couple outliers to take care of but this should be a good
|
|
signal.
|
|
</p>
|
|
<p>Next I calculated the age of every house at time of sale and plotted it:</p>
|
|
<p style="text-align: center;"><img src="https://doordesk.net/pics/age.png" /></p>
|
|
<p>
|
|
Exactly what I'd expect to see. Price drops as age goes up, a few outliers. We'll
|
|
include that in the model.
|
|
</p>
|
|
<p>Next I chose the area of the lot:</p>
|
|
<p style="text-align: center;"><img src="https://doordesk.net/pics/lot_area.png" /></p>
|
|
<p>
|
|
Lot area positively affects sale price because land has value. Most of the houses here
|
|
have similarly sized lots.
|
|
</p>
|
|
<h3>Pre-Processing</h3>
|
|
<div>
|
|
<p>
|
|
Here is an example where using
|
|
<a
|
|
href="https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html"
|
|
target="new"
|
|
>StandardScaler()</a
|
|
>
|
|
just doesn't cut it. The values are all scaled in a way where they can be compared
|
|
to one-another, but outliers have a huge effect on the clarity of the signal as a
|
|
whole.
|
|
</p>
|
|
<span>
|
|
<center>
|
|
<img src="https://doordesk.net/pics/age_liv_area_ss.png" />
|
|
<img src="https://doordesk.net/pics/age_liv_qt.png" />
|
|
</center>
|
|
</span>
|
|
</div>
|
|
<p>
|
|
You should clearly see in the second figure that an old shed represented in the top left
|
|
corner will sell for far less than a brand new mansion represented in the bottom right
|
|
corner. This is the result of using the
|
|
<a
|
|
href="https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.QuantileTransformer.html"
|
|
target="new"
|
|
>QuantileTransformer()</a
|
|
>
|
|
for scaling.
|
|
</p>
|
|
<h3>The Model</h3>
|
|
<p>
|
|
A simple
|
|
<a
|
|
href="https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html"
|
|
>LinearRegression()</a
|
|
>
|
|
should do just fine, with
|
|
<a
|
|
href="https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.QuantileTransformer.html"
|
|
target="new"
|
|
>QuantileTransformer()</a
|
|
>
|
|
scaling of course.
|
|
</p>
|
|
<center>
|
|
<img src="https://doordesk.net/pics/mod_out.png" />
|
|
</center>
|
|
<p>
|
|
Predictions were within about $35-$40k on average.<br />
|
|
It's a little fuzzy in the higher end of prices, I believe due to the small sample size.
|
|
There are a few outliers that can probably be reduced with some deeper cleaning however
|
|
I was worried about going too far and creating a different story. An "ideal" model in
|
|
this case would look like a straight line.
|
|
</p>
|
|
<h3>Conclusion</h3>
|
|
<p>
|
|
This model was designed with a focus on quality and consistency. With some refinement,
|
|
the margin of error should be able to be reduced to a reasonable number and then
|
|
reliable, accurate predictions can be made for any application where there is a need to
|
|
assess the value of a property.
|
|
</p>
|
|
<p>
|
|
I think a large limiting factor here is the size of the dataset compared to the quality
|
|
of the features provided. There are
|
|
<a href="http://jse.amstat.org/v19n3/decock/DataDocumentation.txt">more features</a>
|
|
from this dataset that can be included but I think the largest gains will be had from
|
|
simply feeding in more data. As you stray from the "low hanging fruit" features, the
|
|
quality of your model overall starts to go down.
|
|
</p>
|
|
<p>Here's an interesting case, Overall Condition of Property:<br /><br /></p>
|
|
<center>
|
|
<img src="https://doordesk.net/pics/overall_cond.png" />
|
|
</center>
|
|
<p>
|
|
You would expect sale price to increase with quality, no? Yet it goes down.. Why?<br />
|
|
I believe it's because a lot of sellers want to say that their house is of highest
|
|
quality, no matter the condition. It seems that most normal people (who aren't liars)
|
|
dont't care to rate their property and just say it's average. Both of these combined
|
|
actually create a negative trend for quality which definitely won't help predictions!
|
|
</p>
|
|
<p>
|
|
I would like to expand this in the future, maybe scraping websites like Zillow to gather
|
|
more data. <br />
|
|
We'll see.
|
|
</p>
|
|
</article>
|