yee
138
doordesk/public/blog/20220529-housing.html
Normal file
|
@ -0,0 +1,138 @@
|
||||||
|
<article>
|
||||||
|
<p className="align-right date">May 29, 2022</p>
|
||||||
|
<h2 className="title">Predicting Housing Prices</h2>
|
||||||
|
<p>
|
||||||
|
A recent project I had for class was to use
|
||||||
|
<a href="https://scikit-learn.org/stable/index.html" target="new">scikit-learn</a>
|
||||||
|
to create a regression model that will predict the price of a house based on some
|
||||||
|
features of that house.
|
||||||
|
</p>
|
||||||
|
<h3>How?</h3>
|
||||||
|
<ol>
|
||||||
|
<li>
|
||||||
|
Pick out and analyze certain features from the dataset. Used here is the
|
||||||
|
<a href="https://www.kaggle.com/datasets/marcopale/housing" target="new"
|
||||||
|
>Ames Iowa Housing Data</a
|
||||||
|
>
|
||||||
|
set.
|
||||||
|
</li>
|
||||||
|
<li>
|
||||||
|
Do some signal processing to provide a clearer input down the line, improving
|
||||||
|
accuracy
|
||||||
|
</li>
|
||||||
|
<li>Make predictions on sale price</li>
|
||||||
|
<li>
|
||||||
|
Compare the predicted prices to recorded actual sale prices and score the results
|
||||||
|
</li>
|
||||||
|
</ol>
|
||||||
|
<h3>What's important?</h3>
|
||||||
|
<p>
|
||||||
|
Well, I don't know much about appraising houses. But I have heard the term "price per
|
||||||
|
square foot" so we'll start with that:
|
||||||
|
</p>
|
||||||
|
<p className="align-center"><img src="https://doordesk.net/pics/livarea_no_outliers.png" /></p>
|
||||||
|
<p>
|
||||||
|
There is a feature for 'Above Grade Living Area' meaning floor area that's not basement.
|
||||||
|
It looks linear, there were a couple outliers to take care of but this should be a good
|
||||||
|
signal.
|
||||||
|
</p>
|
||||||
|
<p>Next I calculated the age of every house at time of sale and plotted it:</p>
|
||||||
|
<p className="align-center"><img src="https://doordesk.net/pics/age.png" /></p>
|
||||||
|
<p>
|
||||||
|
Exactly what I'd expect to see. Price drops as age goes up, a few outliers. We'll
|
||||||
|
include that in the model.
|
||||||
|
</p>
|
||||||
|
<p>Next I chose the area of the lot:</p>
|
||||||
|
<p className="align-center"><img src="https://doordesk.net/pics/lot_area.png" /></p>
|
||||||
|
<p>
|
||||||
|
Lot area positively affects sale price because land has value. Most of the houses here
|
||||||
|
have similarly sized lots.
|
||||||
|
</p>
|
||||||
|
<h3>Pre-Processing</h3>
|
||||||
|
<div>
|
||||||
|
<p>
|
||||||
|
Here is an example where using
|
||||||
|
<a
|
||||||
|
href="https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html"
|
||||||
|
target="new"
|
||||||
|
>StandardScaler()</a
|
||||||
|
>
|
||||||
|
just doesn't cut it. The values are all scaled in a way where they can be compared
|
||||||
|
to one-another, but outliers have a huge effect on the clarity of the signal as a
|
||||||
|
whole.
|
||||||
|
</p>
|
||||||
|
<span>
|
||||||
|
<center>
|
||||||
|
<img src="https://doordesk.net/pics/age_liv_area_ss.png" />
|
||||||
|
<img src="https://doordesk.net/pics/age_liv_qt.png" />
|
||||||
|
</center>
|
||||||
|
</span>
|
||||||
|
</div>
|
||||||
|
<p>
|
||||||
|
You should clearly see in the second figure that an old shed represented in the top left
|
||||||
|
corner will sell for far less than a brand new mansion represented in the bottom right
|
||||||
|
corner. This is the result of using the
|
||||||
|
<a
|
||||||
|
href="https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.QuantileTransformer.html"
|
||||||
|
target="new"
|
||||||
|
>QuantileTransformer()</a
|
||||||
|
>
|
||||||
|
for scaling.
|
||||||
|
</p>
|
||||||
|
<h3>The Model</h3>
|
||||||
|
<p>
|
||||||
|
A simple
|
||||||
|
<a
|
||||||
|
href="https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html"
|
||||||
|
>LinearRegression()</a
|
||||||
|
>
|
||||||
|
should do just fine, with
|
||||||
|
<a
|
||||||
|
href="https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.QuantileTransformer.html"
|
||||||
|
target="new"
|
||||||
|
>QuantileTransformer()</a
|
||||||
|
>
|
||||||
|
scaling of course.
|
||||||
|
</p>
|
||||||
|
<center>
|
||||||
|
<img src="https://doordesk.net/pics/mod_out.png" />
|
||||||
|
</center>
|
||||||
|
<p>
|
||||||
|
Predictions were within about $35-$40k on average.<br />
|
||||||
|
It's a little fuzzy in the higher end of prices, I believe due to the small sample size.
|
||||||
|
There are a few outliers that can probably be reduced with some deeper cleaning however
|
||||||
|
I was worried about going too far and creating a different story. An "ideal" model in
|
||||||
|
this case would look like a straight line.
|
||||||
|
</p>
|
||||||
|
<h3>Conclusion</h3>
|
||||||
|
<p>
|
||||||
|
This model was designed with a focus on quality and consistency. With some refinement,
|
||||||
|
the margin of error should be able to be reduced to a reasonable number and then
|
||||||
|
reliable, accurate predictions can be made for any application where there is a need to
|
||||||
|
assess the value of a property.
|
||||||
|
</p>
|
||||||
|
<p>
|
||||||
|
I think a large limiting factor here is the size of the dataset compared to the quality
|
||||||
|
of the features provided. There are
|
||||||
|
<a href="http://jse.amstat.org/v19n3/decock/DataDocumentation.txt">more features</a>
|
||||||
|
from this dataset that can be included but I think the largest gains will be had from
|
||||||
|
simply feeding in more data. As you stray from the "low hanging fruit" features, the
|
||||||
|
quality of your model overall starts to go down.
|
||||||
|
</p>
|
||||||
|
<p>Here's an interesting case, Overall Condition of Property:<br /><br /></p>
|
||||||
|
<center>
|
||||||
|
<img src="https://doordesk.net/pics/overall_cond.png" />
|
||||||
|
</center>
|
||||||
|
<p>
|
||||||
|
You would expect sale price to increase with quality, no? Yet it goes down.. Why?<br />
|
||||||
|
I believe it's because a lot of sellers want to say that their house is of highest
|
||||||
|
quality, no matter the condition. It seems that most normal people (who aren't liars)
|
||||||
|
dont't care to rate their property and just say it's average. Both of these combined
|
||||||
|
actually create a negative trend for quality which definitely won't help predictions!
|
||||||
|
</p>
|
||||||
|
<p>
|
||||||
|
I would like to expand this in the future, maybe scraping websites like Zillow to gather
|
||||||
|
more data. <br />
|
||||||
|
We'll see.
|
||||||
|
</p>
|
||||||
|
</article>
|
BIN
doordesk/public/blog/20220529-housing/pics/age.png
Normal file
After Width: | Height: | Size: 80 KiB |
BIN
doordesk/public/blog/20220529-housing/pics/age_liv_area_ss.png
Normal file
After Width: | Height: | Size: 140 KiB |
BIN
doordesk/public/blog/20220529-housing/pics/age_liv_qt.png
Normal file
After Width: | Height: | Size: 252 KiB |
After Width: | Height: | Size: 79 KiB |
BIN
doordesk/public/blog/20220529-housing/pics/lot_area.png
Normal file
After Width: | Height: | Size: 65 KiB |
BIN
doordesk/public/blog/20220529-housing/pics/mod_out.png
Normal file
After Width: | Height: | Size: 128 KiB |
BIN
doordesk/public/blog/20220529-housing/pics/overall_cond.png
Normal file
After Width: | Height: | Size: 48 KiB |
128
doordesk/public/blog/20220614-reddit.html
Normal file
|
@ -0,0 +1,128 @@
|
||||||
|
<article>
|
||||||
|
<p className="align-right date">Jun 14, 2022</p>
|
||||||
|
<h2 className="title">What Goes Into a Successful Reddit Post?</h2>
|
||||||
|
<p>
|
||||||
|
In an attempt to find out what about a Reddit post makes it successful I will use some
|
||||||
|
classification models to try to determine which features have the highest influence on
|
||||||
|
making a correct prediction. In particular I use
|
||||||
|
<a
|
||||||
|
href="https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html"
|
||||||
|
>Random Forest</a
|
||||||
|
>
|
||||||
|
and
|
||||||
|
<a
|
||||||
|
href="https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html"
|
||||||
|
>KNNeighbors</a
|
||||||
|
>
|
||||||
|
classifiers. Then I'll score the results and see what the highest predictors are.
|
||||||
|
</p>
|
||||||
|
<p>
|
||||||
|
To find what goes into making a successful Reddit post we'll have to do a few things,
|
||||||
|
first of which is collecting data:
|
||||||
|
</p>
|
||||||
|
<h3>Introducing Scrapey!</h3>
|
||||||
|
<p>
|
||||||
|
<a href="projects/reddit/scrapey.html">Scrapey</a> is my scraper script that takes a snapshot
|
||||||
|
of Reddit/r/all hot and saves the data to a .csv file including a calculated age for
|
||||||
|
each post about every 12 minutes. Run time is about 2 minutes per iteration and each
|
||||||
|
time adds about 100 unique posts to the list while updating any post it's already seen.
|
||||||
|
</p>
|
||||||
|
<p>
|
||||||
|
I run this in the background in a terminal and it updates my data set every ~12 minutes.
|
||||||
|
I have records of all posts within about 12 minutes of them disappearing from /r/all.
|
||||||
|
</p>
|
||||||
|
<h3>EDA</h3>
|
||||||
|
<p>
|
||||||
|
<a href="projects/reddit/EDA.html">Next I take a quick look to see what looks useful</a>, what
|
||||||
|
doesn't, and check for outliers that will throw off the model. There were a few outliers
|
||||||
|
to drop from the num_comments column.
|
||||||
|
</p>
|
||||||
|
Chosen Features:
|
||||||
|
<ul>
|
||||||
|
<li>Title</li>
|
||||||
|
<li>Subreddit</li>
|
||||||
|
<li>Over_18</li>
|
||||||
|
<li>Is_Original_Content</li>
|
||||||
|
<li>Is_Self</li>
|
||||||
|
<li>Spoiler</li>
|
||||||
|
<li>Locked</li>
|
||||||
|
<li>Stickied</li>
|
||||||
|
<li>Num_Comments (Target)</li>
|
||||||
|
</ul>
|
||||||
|
<p>
|
||||||
|
Then I split the data I'm going to use into two dataframes (numeric and non) to prepare
|
||||||
|
for further processing.
|
||||||
|
</p>
|
||||||
|
<h3>Clean</h3>
|
||||||
|
<p><a href="projects/reddit/clean.html">Cleaning the data further</a> consists of:</p>
|
||||||
|
<ul>
|
||||||
|
<li>Scaling numeric features between 0-1</li>
|
||||||
|
<li>Converting '_' and '-' to whitespace</li>
|
||||||
|
<li>Removing any non a-z or A-Z or whitespace</li>
|
||||||
|
<li>Stripping any leftover whitespace</li>
|
||||||
|
<li>Deleting any titles that were reduced to empty strings</li>
|
||||||
|
</ul>
|
||||||
|
<h3>Model</h3>
|
||||||
|
<p>
|
||||||
|
If the number of comments of a post is greater than the median total number of comments
|
||||||
|
then it's assigned a 1, otherwise a 0. This is the target column. I then try some
|
||||||
|
lemmatizing, it doesn't seem to add much. After that I create and join some dummies,
|
||||||
|
then split and feed the new dataframe into
|
||||||
|
<a
|
||||||
|
href="https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html"
|
||||||
|
>Random Forest</a
|
||||||
|
>
|
||||||
|
and
|
||||||
|
<a
|
||||||
|
href="https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html"
|
||||||
|
>KNNeighbors</a
|
||||||
|
>
|
||||||
|
classifiers. Both actually scored the same with
|
||||||
|
<a
|
||||||
|
href="https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html"
|
||||||
|
>cross validation</a
|
||||||
|
>
|
||||||
|
so I mainly used the forest.
|
||||||
|
</p>
|
||||||
|
<p><a href="projects/reddit/model.html">Notebook Here</a></p>
|
||||||
|
<h3>Conclusion</h3>
|
||||||
|
<p>Some Predictors from Top 25:</p>
|
||||||
|
<ul>
|
||||||
|
<li>Is_Self</li>
|
||||||
|
<li>Subreddit_Memes</li>
|
||||||
|
<li>OC</li>
|
||||||
|
<li>Over_18</li>
|
||||||
|
<li>Subreddit_Shitposting</li>
|
||||||
|
<li>Is_Original_Content</li>
|
||||||
|
<li>Subreddit_Superstonk</li>
|
||||||
|
</ul>
|
||||||
|
<p>
|
||||||
|
Popular words: 'like', 'just', 'time', 'new', 'oc', 'good', 'got', 'day', 'today', 'im',
|
||||||
|
'dont', and 'love'.
|
||||||
|
</p>
|
||||||
|
<p>
|
||||||
|
People on Reddit (at least in the past few days) like their memes, porn, and talking
|
||||||
|
about their day. And it's preferred if the content is original and self posted. So yes,
|
||||||
|
post your memes to memes and shitposting, tag them NSFW, use some words from the list,
|
||||||
|
and rake in all that sweet karma!
|
||||||
|
</p>
|
||||||
|
<p>
|
||||||
|
But it's not that simple, this is a fairly simple model, with simple data. To go beyond
|
||||||
|
this I think the comments would have to be analyzed.
|
||||||
|
<a href="https://en.wikipedia.org/wiki/Lemmatisation">Lemmatisation</a> I thought would
|
||||||
|
be the most influential piece, and I still think that thinking is correct. But in this
|
||||||
|
case it doesn't apply because there is no real meaning to be had from reddit post
|
||||||
|
titles, at least to a computer. (or I did something wrong)
|
||||||
|
</p>
|
||||||
|
<p>
|
||||||
|
There's a lot more seen by a human than just the text in the title, there's often an
|
||||||
|
image attached, most posts reference a recent/current event, they could be an inside
|
||||||
|
joke of sorts. For some posts there could be emojis in the title, and depending on their
|
||||||
|
combination they can take on a meaning completely different from their individual
|
||||||
|
meanings. The next step from here I believe is to analyze the comments section of these
|
||||||
|
posts because in this moment I think that's the easiest way to truly describe the
|
||||||
|
meaning of a post to a computer. With what was gathered here I'm only to get 10% above
|
||||||
|
baseline and I think that's all there is to be had here, I mean we can tweak for a few
|
||||||
|
percent probably but I don't think there's much left on the table.
|
||||||
|
</p>
|
||||||
|
</article>
|
|
@ -2,16 +2,9 @@
|
||||||
<p className="align-right date">Jul 01, 2022</p>
|
<p className="align-right date">Jul 01, 2022</p>
|
||||||
<h2 className="title">It's a post about nothing!</h2>
|
<h2 className="title">It's a post about nothing!</h2>
|
||||||
<p>The progress update</p>
|
<p>The progress update</p>
|
||||||
<center>
|
<p className='align-center'>
|
||||||
<iframe
|
<img src="https://doordesk.net/pics/plates.gif" />
|
||||||
src="https://gfycat.com/ifr/DistantUnpleasantHyracotherium"
|
</p>
|
||||||
frameborder="0"
|
|
||||||
scrolling="no"
|
|
||||||
allowfullscreen
|
|
||||||
width="640"
|
|
||||||
height="535"
|
|
||||||
></iframe>
|
|
||||||
</center>
|
|
||||||
<h3>Bots</h3>
|
<h3>Bots</h3>
|
||||||
<p>
|
<p>
|
||||||
After finding a number of ways not to begin the project formerly known as my capstone,
|
After finding a number of ways not to begin the project formerly known as my capstone,
|
||||||
|
@ -20,12 +13,12 @@
|
||||||
href="https://www.kaggle.com/datasets/bwandowando/ukraine-russian-crisis-twitter-dataset-1-2-m-rows"
|
href="https://www.kaggle.com/datasets/bwandowando/ukraine-russian-crisis-twitter-dataset-1-2-m-rows"
|
||||||
>dataset</a
|
>dataset</a
|
||||||
>. The project is about detecting bots, starting with twitter. I've
|
>. The project is about detecting bots, starting with twitter. I've
|
||||||
<a href="projects/bots/docs/debot.pdf">studied</a> a
|
<a href="https://doordesk.net/projects/bots/docs/debot.pdf">studied</a> a
|
||||||
<a href="projects/bots/docs/botwalk.pdf">few</a>
|
<a href="https://doordesk.net/projects/bots/docs/botwalk.pdf">few</a>
|
||||||
<a href="projects/bots/docs/smu.pdf">different</a>
|
<a href="https://doordesk.net/projects/bots/docs/smu.pdf">different</a>
|
||||||
<a href="projects/bots/docs/div.pdf">methods</a> of bot detection and particularly like the
|
<a href="https://doordesk.net/projects/bots/docs/div.pdf">methods</a> of bot detection and particularly like the
|
||||||
<a href="projects/bots/docs/debot.pdf">DeBot</a> and
|
<a href="https://doordesk.net/projects/bots/docs/debot.pdf">DeBot</a> and
|
||||||
<a href="projects/bots/docs/botwalk.pdf">BotWalk</a> methods and think I will try to mimic them,
|
<a href="https://doordesk.net/projects/bots/docs/botwalk.pdf">BotWalk</a> methods and think I will try to mimic them,
|
||||||
in that order.
|
in that order.
|
||||||
</p>
|
</p>
|
||||||
<p>
|
<p>
|
||||||
|
|
|
@ -1,22 +1,39 @@
|
||||||
import { Component } from 'react'
|
import { Component } from 'react'
|
||||||
import './App.css'
|
import './App.css'
|
||||||
import Header from './components/Header.js'
|
import Header from './components/Header.js'
|
||||||
import Blog from './components/Blog.js'
|
import Blog from './components/Blog.js'
|
||||||
|
|
||||||
const BLOG_POSTS = [
|
const FAKE_IT_TIL_YOU_MAKE_IT: string[] = [
|
||||||
'blog/000000000-swim.html',
|
'Blog',
|
||||||
'blog/20220506-change.html'
|
'Games',
|
||||||
|
'Cartman',
|
||||||
|
'Enigma',
|
||||||
|
'Notebooks',
|
||||||
]
|
]
|
||||||
|
|
||||||
class App extends Component {
|
interface IAppProps {
|
||||||
constructor(props) {
|
}
|
||||||
|
|
||||||
|
interface IAppState {
|
||||||
|
currentPage: string;
|
||||||
|
}
|
||||||
|
|
||||||
|
class App extends Component<IAppProps, IAppState> {
|
||||||
|
constructor(props: IAppProps) {
|
||||||
super(props)
|
super(props)
|
||||||
|
this.state = {
|
||||||
|
currentPage: 'Blog'
|
||||||
|
}
|
||||||
}
|
}
|
||||||
render() {
|
render() {
|
||||||
|
let page;
|
||||||
|
if (this.state.currentPage === 'Blog') {
|
||||||
|
page = <Blog />
|
||||||
|
}
|
||||||
return (
|
return (
|
||||||
<div className="App">
|
<div className="App">
|
||||||
<Header />
|
<Header pages={FAKE_IT_TIL_YOU_MAKE_IT} currentPage={this.state.currentPage} />
|
||||||
<Blog />
|
{page}
|
||||||
</div>
|
</div>
|
||||||
)
|
)
|
||||||
}
|
}
|
||||||
|
|
|
@ -1,10 +1,21 @@
|
||||||
import { Component } from 'react'
|
import { Component } from 'react'
|
||||||
import BlogPost from './BlogPost.js'
|
import BlogPost from './BlogPost.js'
|
||||||
|
|
||||||
const BLOG_URLS: string[] = [
|
// should render one by one
|
||||||
|
|
||||||
|
// make api that has post id, title, date, etc with url to article; then
|
||||||
|
// distribute to blog posts
|
||||||
|
|
||||||
|
const FAKE_IT_TIL_YOU_MAKE_IT: string[] = [
|
||||||
|
'blog/20220701-progress.html',
|
||||||
|
'blog/20220614-reddit.html',
|
||||||
|
'blog/20220602-back.html',
|
||||||
|
'blog/20220529-housing.html',
|
||||||
|
'blog/20220520-nvidia.html',
|
||||||
'blog/20220506-change.html',
|
'blog/20220506-change.html',
|
||||||
'blog/000000000-swim.html'
|
'blog/000000000-swim.html',
|
||||||
]
|
]
|
||||||
|
|
||||||
interface IBlogProps {
|
interface IBlogProps {
|
||||||
}
|
}
|
||||||
|
|
||||||
|
@ -23,7 +34,7 @@ class Blog extends Component<IBlogProps, IBlogState> {
|
||||||
render() {
|
render() {
|
||||||
return (
|
return (
|
||||||
<>
|
<>
|
||||||
{this.renderPosts(BLOG_URLS)}
|
{this.renderPosts(FAKE_IT_TIL_YOU_MAKE_IT)}
|
||||||
</>
|
</>
|
||||||
)
|
)
|
||||||
}
|
}
|
||||||
|
|
|
@ -1,21 +1,32 @@
|
||||||
function Header() {
|
import { Component } from 'react'
|
||||||
return (
|
|
||||||
<div className="header">
|
interface IHeaderProps {
|
||||||
<div className="content">
|
pages: string[];
|
||||||
<header>
|
currentPage: string;
|
||||||
<h1>DoorDesk</h1>
|
}
|
||||||
</header>
|
|
||||||
<nav>
|
interface IHeaderState {
|
||||||
<p>
|
}
|
||||||
<a href="../index.html">[Home]</a> -
|
|
||||||
<a href="../games">[Games]</a> -
|
class Header extends Component<IHeaderProps, IHeaderState> {
|
||||||
<a href="https://github.com/adoyle0">[GitHub]</a> -
|
constructor(props: IHeaderProps) {
|
||||||
[Cartman]
|
super(props)
|
||||||
</p>
|
}
|
||||||
</nav>
|
render() {
|
||||||
|
return (
|
||||||
|
<div className="header">
|
||||||
|
<div className="content">
|
||||||
|
<header>
|
||||||
|
<h1>DoorDesk</h1>
|
||||||
|
</header>
|
||||||
|
<nav>
|
||||||
|
<p> {this.props.currentPage} </p>
|
||||||
|
<p> {this.props.pages} </p>
|
||||||
|
</nav>
|
||||||
|
</div>
|
||||||
</div>
|
</div>
|
||||||
</div>
|
)
|
||||||
)
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
export default Header
|
export default Header
|
||||||
|
|
5
start_frontend_ghetto
Executable file
|
@ -0,0 +1,5 @@
|
||||||
|
#!/bin/bash
|
||||||
|
|
||||||
|
cd doordesk &&
|
||||||
|
npm install &&
|
||||||
|
npm run dev
|