113 lines
4.7 KiB
Markdown
113 lines
4.7 KiB
Markdown
|
Content_Type: project
|
||
|
Title: Predicting Housing Prices
|
||
|
Date: 2022 5 29
|
||
|
|
||
|
A recent project I had for class was to use [scikit-learn](https://scikit-learn.org/stable/index.html) to create a regression model that will predict the price of a house based on some features of that house.
|
||
|
|
||
|
### How?
|
||
|
|
||
|
1 Pick out and analyze certain features from the dataset. Used here is the [Ames Iowa Housing Data](https://www.kaggle.com/datasets/marcopale/housing) set.
|
||
|
1 Do some signal processing to provide a clearer input down the line, improving accuracy
|
||
|
1 Make predictions on sale price
|
||
|
1 Compare the predicted prices to recorded actual sale prices and score the results
|
||
|
|
||
|
### What's important?
|
||
|
|
||
|
Well, I don't know much about appraising houses. But I have heard the term "price per
|
||
|
square foot" so we'll start with that:
|
||
|
|
||
|
<p style="text-align: center;"><img src="https://old.doordesk.net/pics/livarea_no_outliers.png" /></p>
|
||
|
|
||
|
There is a feature for 'Above Grade Living Area' meaning floor area that's not basement.
|
||
|
It looks linear, there were a couple outliers to take care of but this should be a good
|
||
|
signal.
|
||
|
|
||
|
Next I calculated the age of every house at time of sale and plotted it:
|
||
|
|
||
|
<p style="text-align: center;"><img src="https://old.doordesk.net/pics/age.png" /></p>
|
||
|
|
||
|
Exactly what I'd expect to see. Price drops as age goes up, a few outliers. We'll
|
||
|
include that in the model.
|
||
|
|
||
|
Next I chose the area of the lot:
|
||
|
|
||
|
<p style="text-align: center;"><img src="https://old.doordesk.net/pics/lot_area.png" /></p>
|
||
|
|
||
|
Lot area positively affects sale price because land has value. Most of the houses here
|
||
|
have similarly sized lots.
|
||
|
|
||
|
### Pre-Processing
|
||
|
<div>
|
||
|
<p>
|
||
|
Here is an example where using
|
||
|
<a
|
||
|
href="https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html"
|
||
|
target="new"
|
||
|
>StandardScaler()</a
|
||
|
>
|
||
|
just doesn't cut it. The values are all scaled in a way where they can be compared
|
||
|
to one-another, but outliers have a huge effect on the clarity of the signal as a
|
||
|
whole.
|
||
|
</p>
|
||
|
<span>
|
||
|
<center>
|
||
|
<img src="https://old.doordesk.net/pics/age_liv_area_ss.png" />
|
||
|
<img src="https://old.doordesk.net/pics/age_liv_qt.png"
|
||
|
</center>
|
||
|
</span>
|
||
|
</div>
|
||
|
|
||
|
You should clearly see in the second figure that an old shed represented in the top left
|
||
|
corner will sell for far less than a brand new mansion represented in the bottom right
|
||
|
corner. This is the result of using the [QuantileTransformer()](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.QuantileTransformer.html)
|
||
|
for scaling.
|
||
|
|
||
|
### The Model
|
||
|
|
||
|
A simple [LinearRegression()](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)
|
||
|
should do just fine, with [QuantileTransformer()](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.QuantileTransformer.html)
|
||
|
scaling of course.
|
||
|
|
||
|
<center>
|
||
|
<img src="https://old.doordesk.net/pics/mod_out.png" />
|
||
|
</center>
|
||
|
|
||
|
Predictions were within about $35-$40k on average.
|
||
|
|
||
|
It's a little fuzzy in the higher end of prices, I believe due to the small sample size.
|
||
|
There are a few outliers that can probably be reduced with some deeper cleaning however
|
||
|
I was worried about going too far and creating a different story. An "ideal" model in
|
||
|
this case would look like a straight line.
|
||
|
|
||
|
### Conclusion
|
||
|
|
||
|
This model was designed with a focus on quality and consistency. With some refinement,
|
||
|
the margin of error should be able to be reduced to a reasonable number and then
|
||
|
reliable, accurate predictions can be made for any application where there is a need to
|
||
|
assess the value of a property.
|
||
|
|
||
|
I think a large limiting factor here is the size of the dataset compared to the quality
|
||
|
of the features provided. There are
|
||
|
<a href="http://jse.amstat.org/v19n3/decock/DataDocumentation.txt">more features</a>
|
||
|
from this dataset that can be included but I think the largest gains will be had from
|
||
|
simply feeding in more data. As you stray from the "low hanging fruit" features, the
|
||
|
quality of your model overall starts to go down.
|
||
|
|
||
|
Here's an interesting case, Overall Condition of Property:
|
||
|
|
||
|
<center>
|
||
|
<img src="https://old.doordesk.net/pics/overall_cond.png" />
|
||
|
</center>
|
||
|
|
||
|
You would expect sale price to increase with quality, no? Yet it goes down.. Why?
|
||
|
|
||
|
I believe it's because a lot of sellers want to say that their house is of highest
|
||
|
quality, no matter the condition. It seems that most normal people (who aren't liars)
|
||
|
dont't care to rate their property and just say it's average. Both of these combined
|
||
|
actually create a negative trend for quality which definitely won't help predictions!
|
||
|
|
||
|
I would like to expand this in the future, maybe scraping websites like Zillow to gather
|
||
|
more data.
|
||
|
|
||
|
We'll see.
|