diff --git a/doordesk/public/blog/20220529-housing.html b/doordesk/public/blog/20220529-housing.html new file mode 100644 index 0000000..515d299 --- /dev/null +++ b/doordesk/public/blog/20220529-housing.html @@ -0,0 +1,138 @@ +
+

May 29, 2022

+

Predicting Housing Prices

+

+ A recent project I had for class was to use + scikit-learn + to create a regression model that will predict the price of a house based on some + features of that house. +

+

How?

+
    +
  1. + Pick out and analyze certain features from the dataset. Used here is the + Ames Iowa Housing Data + set. +
  2. +
  3. + Do some signal processing to provide a clearer input down the line, improving + accuracy +
  4. +
  5. Make predictions on sale price
  6. +
  7. + Compare the predicted prices to recorded actual sale prices and score the results +
  8. +
+

What's important?

+

+ Well, I don't know much about appraising houses. But I have heard the term "price per + square foot" so we'll start with that: +

+

+

+ There is a feature for 'Above Grade Living Area' meaning floor area that's not basement. + It looks linear, there were a couple outliers to take care of but this should be a good + signal. +

+

Next I calculated the age of every house at time of sale and plotted it:

+

+

+ Exactly what I'd expect to see. Price drops as age goes up, a few outliers. We'll + include that in the model. +

+

Next I chose the area of the lot:

+

+

+ Lot area positively affects sale price because land has value. Most of the houses here + have similarly sized lots. +

+

Pre-Processing

+
+

+ Here is an example where using + StandardScaler() + just doesn't cut it. The values are all scaled in a way where they can be compared + to one-another, but outliers have a huge effect on the clarity of the signal as a + whole. +

+ +
+ + +
+
+
+

+ You should clearly see in the second figure that an old shed represented in the top left + corner will sell for far less than a brand new mansion represented in the bottom right + corner. This is the result of using the + QuantileTransformer() + for scaling. +

+

The Model

+

+ A simple + LinearRegression() + should do just fine, with + QuantileTransformer() + scaling of course. +

+
+ +
+

+ Predictions were within about $35-$40k on average.
+ It's a little fuzzy in the higher end of prices, I believe due to the small sample size. + There are a few outliers that can probably be reduced with some deeper cleaning however + I was worried about going too far and creating a different story. An "ideal" model in + this case would look like a straight line. +

+

Conclusion

+

+ This model was designed with a focus on quality and consistency. With some refinement, + the margin of error should be able to be reduced to a reasonable number and then + reliable, accurate predictions can be made for any application where there is a need to + assess the value of a property. +

+

+ I think a large limiting factor here is the size of the dataset compared to the quality + of the features provided. There are + more features + from this dataset that can be included but I think the largest gains will be had from + simply feeding in more data. As you stray from the "low hanging fruit" features, the + quality of your model overall starts to go down. +

+

Here's an interesting case, Overall Condition of Property:

+
+ +
+

+ You would expect sale price to increase with quality, no? Yet it goes down.. Why?
+ I believe it's because a lot of sellers want to say that their house is of highest + quality, no matter the condition. It seems that most normal people (who aren't liars) + dont't care to rate their property and just say it's average. Both of these combined + actually create a negative trend for quality which definitely won't help predictions! +

+

+ I would like to expand this in the future, maybe scraping websites like Zillow to gather + more data.
+ We'll see. +

+
diff --git a/doordesk/public/blog/20220529-housing/pics/age.png b/doordesk/public/blog/20220529-housing/pics/age.png new file mode 100644 index 0000000..318184d Binary files /dev/null and b/doordesk/public/blog/20220529-housing/pics/age.png differ diff --git a/doordesk/public/blog/20220529-housing/pics/age_liv_area_ss.png b/doordesk/public/blog/20220529-housing/pics/age_liv_area_ss.png new file mode 100644 index 0000000..ffb5739 Binary files /dev/null and b/doordesk/public/blog/20220529-housing/pics/age_liv_area_ss.png differ diff --git a/doordesk/public/blog/20220529-housing/pics/age_liv_qt.png b/doordesk/public/blog/20220529-housing/pics/age_liv_qt.png new file mode 100644 index 0000000..1f9782a Binary files /dev/null and b/doordesk/public/blog/20220529-housing/pics/age_liv_qt.png differ diff --git a/doordesk/public/blog/20220529-housing/pics/livarea_no_outliers.png b/doordesk/public/blog/20220529-housing/pics/livarea_no_outliers.png new file mode 100644 index 0000000..520a4a3 Binary files /dev/null and b/doordesk/public/blog/20220529-housing/pics/livarea_no_outliers.png differ diff --git a/doordesk/public/blog/20220529-housing/pics/lot_area.png b/doordesk/public/blog/20220529-housing/pics/lot_area.png new file mode 100644 index 0000000..f5eb2bc Binary files /dev/null and b/doordesk/public/blog/20220529-housing/pics/lot_area.png differ diff --git a/doordesk/public/blog/20220529-housing/pics/mod_out.png b/doordesk/public/blog/20220529-housing/pics/mod_out.png new file mode 100644 index 0000000..7bad6cc Binary files /dev/null and b/doordesk/public/blog/20220529-housing/pics/mod_out.png differ diff --git a/doordesk/public/blog/20220529-housing/pics/overall_cond.png b/doordesk/public/blog/20220529-housing/pics/overall_cond.png new file mode 100644 index 0000000..8141f20 Binary files /dev/null and b/doordesk/public/blog/20220529-housing/pics/overall_cond.png differ diff --git a/doordesk/public/blog/20220614-reddit.html b/doordesk/public/blog/20220614-reddit.html new file mode 100644 index 0000000..830a076 --- /dev/null +++ b/doordesk/public/blog/20220614-reddit.html @@ -0,0 +1,128 @@ +
+

Jun 14, 2022

+

What Goes Into a Successful Reddit Post?

+

+ In an attempt to find out what about a Reddit post makes it successful I will use some + classification models to try to determine which features have the highest influence on + making a correct prediction. In particular I use + Random Forest + and + KNNeighbors + classifiers. Then I'll score the results and see what the highest predictors are. +

+

+ To find what goes into making a successful Reddit post we'll have to do a few things, + first of which is collecting data: +

+

Introducing Scrapey!

+

+ Scrapey is my scraper script that takes a snapshot + of Reddit/r/all hot and saves the data to a .csv file including a calculated age for + each post about every 12 minutes. Run time is about 2 minutes per iteration and each + time adds about 100 unique posts to the list while updating any post it's already seen. +

+

+ I run this in the background in a terminal and it updates my data set every ~12 minutes. + I have records of all posts within about 12 minutes of them disappearing from /r/all. +

+

EDA

+

+ Next I take a quick look to see what looks useful, what + doesn't, and check for outliers that will throw off the model. There were a few outliers + to drop from the num_comments column. +

+ Chosen Features: + +

+ Then I split the data I'm going to use into two dataframes (numeric and non) to prepare + for further processing. +

+

Clean

+

Cleaning the data further consists of:

+ +

Model

+

+ If the number of comments of a post is greater than the median total number of comments + then it's assigned a 1, otherwise a 0. This is the target column. I then try some + lemmatizing, it doesn't seem to add much. After that I create and join some dummies, + then split and feed the new dataframe into + Random Forest + and + KNNeighbors + classifiers. Both actually scored the same with + cross validation + so I mainly used the forest. +

+

Notebook Here

+

Conclusion

+

Some Predictors from Top 25:

+ +

+ Popular words: 'like', 'just', 'time', 'new', 'oc', 'good', 'got', 'day', 'today', 'im', + 'dont', and 'love'. +

+

+ People on Reddit (at least in the past few days) like their memes, porn, and talking + about their day. And it's preferred if the content is original and self posted. So yes, + post your memes to memes and shitposting, tag them NSFW, use some words from the list, + and rake in all that sweet karma! +

+

+ But it's not that simple, this is a fairly simple model, with simple data. To go beyond + this I think the comments would have to be analyzed. + Lemmatisation I thought would + be the most influential piece, and I still think that thinking is correct. But in this + case it doesn't apply because there is no real meaning to be had from reddit post + titles, at least to a computer. (or I did something wrong) +

+

+ There's a lot more seen by a human than just the text in the title, there's often an + image attached, most posts reference a recent/current event, they could be an inside + joke of sorts. For some posts there could be emojis in the title, and depending on their + combination they can take on a meaning completely different from their individual + meanings. The next step from here I believe is to analyze the comments section of these + posts because in this moment I think that's the easiest way to truly describe the + meaning of a post to a computer. With what was gathered here I'm only to get 10% above + baseline and I think that's all there is to be had here, I mean we can tweak for a few + percent probably but I don't think there's much left on the table. +

+
diff --git a/doordesk/public/blog/20220701-progress.html b/doordesk/public/blog/20220701-progress.html index a8b8eaa..8bb7634 100644 --- a/doordesk/public/blog/20220701-progress.html +++ b/doordesk/public/blog/20220701-progress.html @@ -2,16 +2,9 @@

Jul 01, 2022

It's a post about nothing!

The progress update

-
- -
+

+ +

Bots

After finding a number of ways not to begin the project formerly known as my capstone, @@ -20,12 +13,12 @@ href="https://www.kaggle.com/datasets/bwandowando/ukraine-russian-crisis-twitter-dataset-1-2-m-rows" >dataset. The project is about detecting bots, starting with twitter. I've - studied a - few - different - methods of bot detection and particularly like the - DeBot and - BotWalk methods and think I will try to mimic them, + studied a + few + different + methods of bot detection and particularly like the + DeBot and + BotWalk methods and think I will try to mimic them, in that order.

diff --git a/doordesk/src/App.tsx b/doordesk/src/App.tsx index 16beded..06b403f 100644 --- a/doordesk/src/App.tsx +++ b/doordesk/src/App.tsx @@ -1,22 +1,39 @@ -import { Component } from 'react' +import { Component } from 'react' import './App.css' import Header from './components/Header.js' import Blog from './components/Blog.js' -const BLOG_POSTS = [ - 'blog/000000000-swim.html', - 'blog/20220506-change.html' +const FAKE_IT_TIL_YOU_MAKE_IT: string[] = [ + 'Blog', + 'Games', + 'Cartman', + 'Enigma', + 'Notebooks', ] -class App extends Component { - constructor(props) { +interface IAppProps { +} + +interface IAppState { + currentPage: string; +} + +class App extends Component { + constructor(props: IAppProps) { super(props) + this.state = { + currentPage: 'Blog' + } } render() { + let page; + if (this.state.currentPage === 'Blog') { + page = + } return (

-
- +
+ {page}
) } diff --git a/doordesk/src/components/Blog.tsx b/doordesk/src/components/Blog.tsx index 5234e70..81e4c59 100644 --- a/doordesk/src/components/Blog.tsx +++ b/doordesk/src/components/Blog.tsx @@ -1,10 +1,21 @@ import { Component } from 'react' import BlogPost from './BlogPost.js' -const BLOG_URLS: string[] = [ +// should render one by one + +// make api that has post id, title, date, etc with url to article; then +// distribute to blog posts + +const FAKE_IT_TIL_YOU_MAKE_IT: string[] = [ + 'blog/20220701-progress.html', + 'blog/20220614-reddit.html', + 'blog/20220602-back.html', + 'blog/20220529-housing.html', + 'blog/20220520-nvidia.html', 'blog/20220506-change.html', - 'blog/000000000-swim.html' + 'blog/000000000-swim.html', ] + interface IBlogProps { } @@ -23,7 +34,7 @@ class Blog extends Component { render() { return ( <> - {this.renderPosts(BLOG_URLS)} + {this.renderPosts(FAKE_IT_TIL_YOU_MAKE_IT)} ) } diff --git a/doordesk/src/components/Header.tsx b/doordesk/src/components/Header.tsx index f341751..91f56fd 100644 --- a/doordesk/src/components/Header.tsx +++ b/doordesk/src/components/Header.tsx @@ -1,21 +1,32 @@ -function Header() { - return ( -
-
-
-

DoorDesk

-
- +import { Component } from 'react' + +interface IHeaderProps { + pages: string[]; + currentPage: string; +} + +interface IHeaderState { +} + +class Header extends Component { + constructor(props: IHeaderProps) { + super(props) + } + render() { + return ( +
+
+
+

DoorDesk

+
+ +
-
- ) + ) + } } export default Header diff --git a/start_frontend_ghetto b/start_frontend_ghetto new file mode 100755 index 0000000..af5e4c8 --- /dev/null +++ b/start_frontend_ghetto @@ -0,0 +1,5 @@ +#!/bin/bash + +cd doordesk && + npm install && + npm run dev