doordesk-js/dennis/static/projects/20220529-housing.html

<article>
    <p>
        A recent project I had for class was to use
        <a href="https://scikit-learn.org/stable/index.html" target="new">scikit-learn</a>
        to create a regression model that will predict the price of a house based on some
        features of that house.
    </p>
    <h3>How?</h3>
    <ol>
        <li>
            Pick out and analyze certain features from the dataset. Used here is the
            <a href="https://www.kaggle.com/datasets/marcopale/housing" target="new"
            >Ames Iowa Housing Data</a
            >
            set.
        </li>
        <li>
            Do some signal processing to provide a clearer input down the line, improving
            accuracy
        </li>
        <li>Make predictions on sale price</li>
        <li>
            Compare the predicted prices to recorded actual sale prices and score the results
        </li>
    </ol>
    <h3>What's important?</h3>
    <p>
        Well, I don't know much about appraising houses. But I have heard the term "price per
        square foot" so we'll start with that:
    </p>
    <p style="text-align: center;"><img src="https://doordesk.net/pics/livarea_no_outliers.png" /></p>
    <p>
        There is a feature for 'Above Grade Living Area' meaning floor area that's not basement.
        It looks linear, there were a couple outliers to take care of but this should be a good
        signal.
    </p>
    <p>Next I calculated the age of every house at time of sale and plotted it:</p>
    <p style="text-align: center;"><img src="https://doordesk.net/pics/age.png" /></p>
    <p>
        Exactly what I'd expect to see. Price drops as age goes up, a few outliers. We'll
        include that in the model.
    </p>
    <p>Next I chose the area of the lot:</p>
    <p style="text-align: center;"><img src="https://doordesk.net/pics/lot_area.png" /></p>
    <p>
        Lot area positively affects sale price because land has value. Most of the houses here
        have similarly sized lots.
    </p>
    <h3>Pre-Processing</h3>
    <div>
        <p>
            Here is an example where using
            <a
                href="https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html"
                target="new"
            >StandardScaler()</a
            >
            just doesn't cut it. The values are all scaled in a way where they can be compared
            to one-another, but outliers have a huge effect on the clarity of the signal as a
            whole.
        </p>
        <span>
            <center>
                <img src="https://doordesk.net/pics/age_liv_area_ss.png" />
                <img src="https://doordesk.net/pics/age_liv_qt.png" />
            </center>
        </span>
    </div>
    <p>
        You should clearly see in the second figure that an old shed represented in the top left
        corner will sell for far less than a brand new mansion represented in the bottom right
        corner. This is the result of using the
        <a
            href="https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.QuantileTransformer.html"
            target="new"
        >QuantileTransformer()</a
        >
        for scaling.
    </p>
    <h3>The Model</h3>
    <p>
        A simple
        <a
            href="https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html"
        >LinearRegression()</a
        >
        should do just fine, with
        <a
            href="https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.QuantileTransformer.html"
            target="new"
        >QuantileTransformer()</a
        >
        scaling of course.
    </p>
    <center>
        <img src="https://doordesk.net/pics/mod_out.png" />
    </center>
    <p>
        Predictions were within about $35-$40k on average.<br />
        It's a little fuzzy in the higher end of prices, I believe due to the small sample size.
        There are a few outliers that can probably be reduced with some deeper cleaning however
        I was worried about going too far and creating a different story. An "ideal" model in
        this case would look like a straight line.
    </p>
    <h3>Conclusion</h3>
    <p>
        This model was designed with a focus on quality and consistency. With some refinement,
        the margin of error should be able to be reduced to a reasonable number and then
        reliable, accurate predictions can be made for any application where there is a need to
        assess the value of a property.
    </p>
    <p>
        I think a large limiting factor here is the size of the dataset compared to the quality
        of the features provided. There are
        <a href="http://jse.amstat.org/v19n3/decock/DataDocumentation.txt">more features</a>
        from this dataset that can be included but I think the largest gains will be had from
        simply feeding in more data. As you stray from the "low hanging fruit" features, the
        quality of your model overall starts to go down.
    </p>
    <p>Here's an interesting case, Overall Condition of Property:<br /><br /></p>
    <center>
        <img src="https://doordesk.net/pics/overall_cond.png" />
    </center>
    <p>
        You would expect sale price to increase with quality, no? Yet it goes down.. Why?<br />
        I believe it's because a lot of sellers want to say that their house is of highest
        quality, no matter the condition. It seems that most normal people (who aren't liars)
        dont't care to rate their property and just say it's average. Both of these combined
        actually create a negative trend for quality which definitely won't help predictions!
    </p>
    <p>
        I would like to expand this in the future, maybe scraping websites like Zillow to gather
        more data. <br />
        We'll see.
    </p>
</article>