diff --git a/doordesk/public/blog/20220529-housing.html b/doordesk/public/blog/20220529-housing.html
new file mode 100644
index 0000000..515d299
--- /dev/null
+++ b/doordesk/public/blog/20220529-housing.html
@@ -0,0 +1,138 @@
+
+
May 29, 2022
+
Predicting Housing Prices
+
+ A recent project I had for class was to use
+ scikit-learn
+ to create a regression model that will predict the price of a house based on some
+ features of that house.
+
+
How?
+
+
+ Pick out and analyze certain features from the dataset. Used here is the
+ Ames Iowa Housing Data
+ set.
+
+
+ Do some signal processing to provide a clearer input down the line, improving
+ accuracy
+
+
Make predictions on sale price
+
+ Compare the predicted prices to recorded actual sale prices and score the results
+
+
+
What's important?
+
+ Well, I don't know much about appraising houses. But I have heard the term "price per
+ square foot" so we'll start with that:
+
+
+
+ There is a feature for 'Above Grade Living Area' meaning floor area that's not basement.
+ It looks linear, there were a couple outliers to take care of but this should be a good
+ signal.
+
+
Next I calculated the age of every house at time of sale and plotted it:
+
+
+ Exactly what I'd expect to see. Price drops as age goes up, a few outliers. We'll
+ include that in the model.
+
+
Next I chose the area of the lot:
+
+
+ Lot area positively affects sale price because land has value. Most of the houses here
+ have similarly sized lots.
+
+
Pre-Processing
+
+
+ Here is an example where using
+ StandardScaler()
+ just doesn't cut it. The values are all scaled in a way where they can be compared
+ to one-another, but outliers have a huge effect on the clarity of the signal as a
+ whole.
+
+
+
+
+
+
+
+
+
+ You should clearly see in the second figure that an old shed represented in the top left
+ corner will sell for far less than a brand new mansion represented in the bottom right
+ corner. This is the result of using the
+ QuantileTransformer()
+ for scaling.
+
+ Predictions were within about $35-$40k on average.
+ It's a little fuzzy in the higher end of prices, I believe due to the small sample size.
+ There are a few outliers that can probably be reduced with some deeper cleaning however
+ I was worried about going too far and creating a different story. An "ideal" model in
+ this case would look like a straight line.
+
+
Conclusion
+
+ This model was designed with a focus on quality and consistency. With some refinement,
+ the margin of error should be able to be reduced to a reasonable number and then
+ reliable, accurate predictions can be made for any application where there is a need to
+ assess the value of a property.
+
+
+ I think a large limiting factor here is the size of the dataset compared to the quality
+ of the features provided. There are
+ more features
+ from this dataset that can be included but I think the largest gains will be had from
+ simply feeding in more data. As you stray from the "low hanging fruit" features, the
+ quality of your model overall starts to go down.
+
+
Here's an interesting case, Overall Condition of Property:
+
+
+
+
+ You would expect sale price to increase with quality, no? Yet it goes down.. Why?
+ I believe it's because a lot of sellers want to say that their house is of highest
+ quality, no matter the condition. It seems that most normal people (who aren't liars)
+ dont't care to rate their property and just say it's average. Both of these combined
+ actually create a negative trend for quality which definitely won't help predictions!
+
+
+ I would like to expand this in the future, maybe scraping websites like Zillow to gather
+ more data.
+ We'll see.
+
+
diff --git a/doordesk/public/blog/20220529-housing/pics/age.png b/doordesk/public/blog/20220529-housing/pics/age.png
new file mode 100644
index 0000000..318184d
Binary files /dev/null and b/doordesk/public/blog/20220529-housing/pics/age.png differ
diff --git a/doordesk/public/blog/20220529-housing/pics/age_liv_area_ss.png b/doordesk/public/blog/20220529-housing/pics/age_liv_area_ss.png
new file mode 100644
index 0000000..ffb5739
Binary files /dev/null and b/doordesk/public/blog/20220529-housing/pics/age_liv_area_ss.png differ
diff --git a/doordesk/public/blog/20220529-housing/pics/age_liv_qt.png b/doordesk/public/blog/20220529-housing/pics/age_liv_qt.png
new file mode 100644
index 0000000..1f9782a
Binary files /dev/null and b/doordesk/public/blog/20220529-housing/pics/age_liv_qt.png differ
diff --git a/doordesk/public/blog/20220529-housing/pics/livarea_no_outliers.png b/doordesk/public/blog/20220529-housing/pics/livarea_no_outliers.png
new file mode 100644
index 0000000..520a4a3
Binary files /dev/null and b/doordesk/public/blog/20220529-housing/pics/livarea_no_outliers.png differ
diff --git a/doordesk/public/blog/20220529-housing/pics/lot_area.png b/doordesk/public/blog/20220529-housing/pics/lot_area.png
new file mode 100644
index 0000000..f5eb2bc
Binary files /dev/null and b/doordesk/public/blog/20220529-housing/pics/lot_area.png differ
diff --git a/doordesk/public/blog/20220529-housing/pics/mod_out.png b/doordesk/public/blog/20220529-housing/pics/mod_out.png
new file mode 100644
index 0000000..7bad6cc
Binary files /dev/null and b/doordesk/public/blog/20220529-housing/pics/mod_out.png differ
diff --git a/doordesk/public/blog/20220529-housing/pics/overall_cond.png b/doordesk/public/blog/20220529-housing/pics/overall_cond.png
new file mode 100644
index 0000000..8141f20
Binary files /dev/null and b/doordesk/public/blog/20220529-housing/pics/overall_cond.png differ
diff --git a/doordesk/public/blog/20220614-reddit.html b/doordesk/public/blog/20220614-reddit.html
new file mode 100644
index 0000000..830a076
--- /dev/null
+++ b/doordesk/public/blog/20220614-reddit.html
@@ -0,0 +1,128 @@
+
+
Jun 14, 2022
+
What Goes Into a Successful Reddit Post?
+
+ In an attempt to find out what about a Reddit post makes it successful I will use some
+ classification models to try to determine which features have the highest influence on
+ making a correct prediction. In particular I use
+ Random Forest
+ and
+ KNNeighbors
+ classifiers. Then I'll score the results and see what the highest predictors are.
+
+
+ To find what goes into making a successful Reddit post we'll have to do a few things,
+ first of which is collecting data:
+
+
Introducing Scrapey!
+
+ Scrapey is my scraper script that takes a snapshot
+ of Reddit/r/all hot and saves the data to a .csv file including a calculated age for
+ each post about every 12 minutes. Run time is about 2 minutes per iteration and each
+ time adds about 100 unique posts to the list while updating any post it's already seen.
+
+
+ I run this in the background in a terminal and it updates my data set every ~12 minutes.
+ I have records of all posts within about 12 minutes of them disappearing from /r/all.
+
Deleting any titles that were reduced to empty strings
+
+
Model
+
+ If the number of comments of a post is greater than the median total number of comments
+ then it's assigned a 1, otherwise a 0. This is the target column. I then try some
+ lemmatizing, it doesn't seem to add much. After that I create and join some dummies,
+ then split and feed the new dataframe into
+ Random Forest
+ and
+ KNNeighbors
+ classifiers. Both actually scored the same with
+ cross validation
+ so I mainly used the forest.
+
+ Popular words: 'like', 'just', 'time', 'new', 'oc', 'good', 'got', 'day', 'today', 'im',
+ 'dont', and 'love'.
+
+
+ People on Reddit (at least in the past few days) like their memes, porn, and talking
+ about their day. And it's preferred if the content is original and self posted. So yes,
+ post your memes to memes and shitposting, tag them NSFW, use some words from the list,
+ and rake in all that sweet karma!
+
+
+ But it's not that simple, this is a fairly simple model, with simple data. To go beyond
+ this I think the comments would have to be analyzed.
+ Lemmatisation I thought would
+ be the most influential piece, and I still think that thinking is correct. But in this
+ case it doesn't apply because there is no real meaning to be had from reddit post
+ titles, at least to a computer. (or I did something wrong)
+
+
+ There's a lot more seen by a human than just the text in the title, there's often an
+ image attached, most posts reference a recent/current event, they could be an inside
+ joke of sorts. For some posts there could be emojis in the title, and depending on their
+ combination they can take on a meaning completely different from their individual
+ meanings. The next step from here I believe is to analyze the comments section of these
+ posts because in this moment I think that's the easiest way to truly describe the
+ meaning of a post to a computer. With what was gathered here I'm only to get 10% above
+ baseline and I think that's all there is to be had here, I mean we can tweak for a few
+ percent probably but I don't think there's much left on the table.
+
After finding a number of ways not to begin the project formerly known as my capstone,
@@ -20,12 +13,12 @@
href="https://www.kaggle.com/datasets/bwandowando/ukraine-russian-crisis-twitter-dataset-1-2-m-rows"
>dataset. The project is about detecting bots, starting with twitter. I've
- studied a
- few
- different
- methods of bot detection and particularly like the
- DeBot and
- BotWalk methods and think I will try to mimic them,
+ studied a
+ few
+ different
+ methods of bot detection and particularly like the
+ DeBot and
+ BotWalk methods and think I will try to mimic them,
in that order.