yee

2023-02-05 00:30:55 -05:00 · 2023-02-05 00:30:55 -05:00 · 02a0bd70f2
commit 02a0bd70f2
parent 836294c229
14 changed files with 347 additions and 44 deletions
--- a/doordesk/public/blog/20220529-housing.html
+++ b/doordesk/public/blog/20220529-housing.html
@ -0,0 +1,138 @@
+<article>
+    <p className="align-right date">May 29, 2022</p>
+    <h2 className="title">Predicting Housing Prices</h2>
+    <p>
+        A recent project I had for class was to use
+        <a href="https://scikit-learn.org/stable/index.html" target="new">scikit-learn</a>
+        to create a regression model that will predict the price of a house based on some
+        features of that house.
+    </p>
+    <h3>How?</h3>
+    <ol>
+        <li>
+            Pick out and analyze certain features from the dataset. Used here is the
+            <a href="https://www.kaggle.com/datasets/marcopale/housing" target="new"
+            >Ames Iowa Housing Data</a
+            >
+            set.
+        </li>
+        <li>
+            Do some signal processing to provide a clearer input down the line, improving
+            accuracy
+        </li>
+        <li>Make predictions on sale price</li>
+        <li>
+            Compare the predicted prices to recorded actual sale prices and score the results
+        </li>
+    </ol>
+    <h3>What's important?</h3>
+    <p>
+        Well, I don't know much about appraising houses. But I have heard the term "price per
+        square foot" so we'll start with that:
+    </p>
+    <p className="align-center"><img src="https://doordesk.net/pics/livarea_no_outliers.png" /></p>
+    <p>
+        There is a feature for 'Above Grade Living Area' meaning floor area that's not basement.
+        It looks linear, there were a couple outliers to take care of but this should be a good
+        signal.
+    </p>
+    <p>Next I calculated the age of every house at time of sale and plotted it:</p>
+    <p className="align-center"><img src="https://doordesk.net/pics/age.png" /></p>
+    <p>
+        Exactly what I'd expect to see. Price drops as age goes up, a few outliers. We'll
+        include that in the model.
+    </p>
+    <p>Next I chose the area of the lot:</p>
+    <p className="align-center"><img src="https://doordesk.net/pics/lot_area.png" /></p>
+    <p>
+        Lot area positively affects sale price because land has value. Most of the houses here
+        have similarly sized lots.
+    </p>
+    <h3>Pre-Processing</h3>
+    <div>
+        <p>
+            Here is an example where using
+            <a
+                href="https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html"
+                target="new"
+            >StandardScaler()</a
+            >
+            just doesn't cut it. The values are all scaled in a way where they can be compared
+            to one-another, but outliers have a huge effect on the clarity of the signal as a
+            whole.
+        </p>
+        <span>
+            <center>
+                <img src="https://doordesk.net/pics/age_liv_area_ss.png" />
+                <img src="https://doordesk.net/pics/age_liv_qt.png" />
+            </center>
+        </span>
+    </div>
+    <p>
+        You should clearly see in the second figure that an old shed represented in the top left
+        corner will sell for far less than a brand new mansion represented in the bottom right
+        corner. This is the result of using the
+        <a
+            href="https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.QuantileTransformer.html"
+            target="new"
+        >QuantileTransformer()</a
+        >
+        for scaling.
+    </p>
+    <h3>The Model</h3>
+    <p>
+        A simple
+        <a
+            href="https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html"
+        >LinearRegression()</a
+        >
+        should do just fine, with
+        <a
+            href="https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.QuantileTransformer.html"
+            target="new"
+        >QuantileTransformer()</a
+        >
+        scaling of course.
+    </p>
+    <center>
+        <img src="https://doordesk.net/pics/mod_out.png" />
+    </center>
+    <p>
+        Predictions were within about $35-$40k on average.<br />
+        It's a little fuzzy in the higher end of prices, I believe due to the small sample size.
+        There are a few outliers that can probably be reduced with some deeper cleaning however
+        I was worried about going too far and creating a different story. An "ideal" model in
+        this case would look like a straight line.
+    </p>
+    <h3>Conclusion</h3>
+    <p>
+        This model was designed with a focus on quality and consistency. With some refinement,
+        the margin of error should be able to be reduced to a reasonable number and then
+        reliable, accurate predictions can be made for any application where there is a need to
+        assess the value of a property.
+    </p>
+    <p>
+        I think a large limiting factor here is the size of the dataset compared to the quality
+        of the features provided. There are
+        <a href="http://jse.amstat.org/v19n3/decock/DataDocumentation.txt">more features</a>
+        from this dataset that can be included but I think the largest gains will be had from
+        simply feeding in more data. As you stray from the "low hanging fruit" features, the
+        quality of your model overall starts to go down.
+    </p>
+    <p>Here's an interesting case, Overall Condition of Property:<br /><br /></p>
+    <center>
+        <img src="https://doordesk.net/pics/overall_cond.png" />
+    </center>
+    <p>
+        You would expect sale price to increase with quality, no? Yet it goes down.. Why?<br />
+        I believe it's because a lot of sellers want to say that their house is of highest
+        quality, no matter the condition. It seems that most normal people (who aren't liars)
+        dont't care to rate their property and just say it's average. Both of these combined
+        actually create a negative trend for quality which definitely won't help predictions!
+    </p>
+    <p>
+        I would like to expand this in the future, maybe scraping websites like Zillow to gather
+        more data. <br />
+        We'll see.
+    </p>
+</article>
--- a/doordesk/public/blog/20220529-housing/pics/age.png
+++ b/doordesk/public/blog/20220529-housing/pics/age.png
--- a/doordesk/public/blog/20220529-housing/pics/age_liv_area_ss.png
+++ b/doordesk/public/blog/20220529-housing/pics/age_liv_area_ss.png
--- a/doordesk/public/blog/20220529-housing/pics/age_liv_qt.png
+++ b/doordesk/public/blog/20220529-housing/pics/age_liv_qt.png
--- a/doordesk/public/blog/20220529-housing/pics/livarea_no_outliers.png
+++ b/doordesk/public/blog/20220529-housing/pics/livarea_no_outliers.png
--- a/doordesk/public/blog/20220529-housing/pics/lot_area.png
+++ b/doordesk/public/blog/20220529-housing/pics/lot_area.png
--- a/doordesk/public/blog/20220529-housing/pics/mod_out.png
+++ b/doordesk/public/blog/20220529-housing/pics/mod_out.png
--- a/doordesk/public/blog/20220529-housing/pics/overall_cond.png
+++ b/doordesk/public/blog/20220529-housing/pics/overall_cond.png
--- a/doordesk/public/blog/20220614-reddit.html
+++ b/doordesk/public/blog/20220614-reddit.html
@ -0,0 +1,128 @@
+<article>
+    <p className="align-right date">Jun 14, 2022</p>
+    <h2 className="title">What Goes Into a Successful Reddit Post?</h2>
+    <p>
+        In an attempt to find out what about a Reddit post makes it successful I will use some
+        classification models to try to determine which features have the highest influence on
+        making a correct prediction. In particular I use
+        <a
+            href="https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html"
+        >Random Forest</a
+        >
+        and
+        <a
+            href="https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html"
+        >KNNeighbors</a
+        >
+        classifiers. Then I'll score the results and see what the highest predictors are.
+    </p>
+    <p>
+        To find what goes into making a successful Reddit post we'll have to do a few things,
+        first of which is collecting data:
+    </p>
+    <h3>Introducing Scrapey!</h3>
+    <p>
+        <a href="projects/reddit/scrapey.html">Scrapey</a> is my scraper script that takes a snapshot
+        of Reddit/r/all hot and saves the data to a .csv file including a calculated age for
+        each post about every 12 minutes. Run time is about 2 minutes per iteration and each
+        time adds about 100 unique posts to the list while updating any post it's already seen.
+    </p>
+    <p>
+        I run this in the background in a terminal and it updates my data set every ~12 minutes.
+        I have records of all posts within about 12 minutes of them disappearing from /r/all.
+    </p>
+    <h3>EDA</h3>
+    <p>
+        <a href="projects/reddit/EDA.html">Next I take a quick look to see what looks useful</a>, what
+        doesn't, and check for outliers that will throw off the model. There were a few outliers
+        to drop from the num_comments column.
+    </p>
+    Chosen Features:
+    <ul>
+        <li>Title</li>
+        <li>Subreddit</li>
+        <li>Over_18</li>
+        <li>Is_Original_Content</li>
+        <li>Is_Self</li>
+        <li>Spoiler</li>
+        <li>Locked</li>
+        <li>Stickied</li>
+        <li>Num_Comments (Target)</li>
+    </ul>
+    <p>
+        Then I split the data I'm going to use into two dataframes (numeric and non) to prepare
+        for further processing.
+    </p>
+    <h3>Clean</h3>
+    <p><a href="projects/reddit/clean.html">Cleaning the data further</a> consists of:</p>
+    <ul>
+        <li>Scaling numeric features between 0-1</li>
+        <li>Converting '_' and '-' to whitespace</li>
+        <li>Removing any non a-z or A-Z or whitespace</li>
+        <li>Stripping any leftover whitespace</li>
+        <li>Deleting any titles that were reduced to empty strings</li>
+    </ul>
+    <h3>Model</h3>
+    <p>
+        If the number of comments of a post is greater than the median total number of comments
+        then it's assigned a 1, otherwise a 0. This is the target column. I then try some
+        lemmatizing, it doesn't seem to add much. After that I create and join some dummies,
+        then split and feed the new dataframe into
+        <a
+            href="https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html"
+        >Random Forest</a
+        >
+        and
+        <a
+            href="https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html"
+        >KNNeighbors</a
+        >
+        classifiers. Both actually scored the same with
+        <a
+            href="https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html"
+        >cross validation</a
+        >
+        so I mainly used the forest.
+    </p>
+    <p><a href="projects/reddit/model.html">Notebook Here</a></p>
+    <h3>Conclusion</h3>
+    <p>Some Predictors from Top 25:</p>
+    <ul>
+        <li>Is_Self</li>
+        <li>Subreddit_Memes</li>
+        <li>OC</li>
+        <li>Over_18</li>
+        <li>Subreddit_Shitposting</li>
+        <li>Is_Original_Content</li>
+        <li>Subreddit_Superstonk</li>
+    </ul>
+    <p>
+        Popular words: 'like', 'just', 'time', 'new', 'oc', 'good', 'got', 'day', 'today', 'im',
+        'dont', and 'love'.
+    </p>
+    <p>
+        People on Reddit (at least in the past few days) like their memes, porn, and talking
+        about their day. And it's preferred if the content is original and self posted. So yes,
+        post your memes to memes and shitposting, tag them NSFW, use some words from the list,
+        and rake in all that sweet karma!
+    </p>
+    <p>
+        But it's not that simple, this is a fairly simple model, with simple data. To go beyond
+        this I think the comments would have to be analyzed.
+        <a href="https://en.wikipedia.org/wiki/Lemmatisation">Lemmatisation</a> I thought would
+        be the most influential piece, and I still think that thinking is correct. But in this
+        case it doesn't apply because there is no real meaning to be had from reddit post
+        titles, at least to a computer. (or I did something wrong)
+    </p>
+    <p>
+        There's a lot more seen by a human than just the text in the title, there's often an
+        image attached, most posts reference a recent/current event, they could be an inside
+        joke of sorts. For some posts there could be emojis in the title, and depending on their
+        combination they can take on a meaning completely different from their individual
+        meanings. The next step from here I believe is to analyze the comments section of these
+        posts because in this moment I think that's the easiest way to truly describe the
+        meaning of a post to a computer. With what was gathered here I'm only to get 10% above
+        baseline and I think that's all there is to be had here, I mean we can tweak for a few
+        percent probably but I don't think there's much left on the table.
+    </p>
+</article>
--- a/doordesk/public/blog/20220701-progress.html
+++ b/doordesk/public/blog/20220701-progress.html
@ -2,16 +2,9 @@
    <p className="align-right date">Jul 01, 2022</p>
    <h2 className="title">It's a post about nothing!</h2>
    <p>The progress update</p>
-    <center>
-        <iframe
-            src="https://gfycat.com/ifr/DistantUnpleasantHyracotherium"
-            frameborder="0"
-            scrolling="no"
-            allowfullscreen
-            width="640"
-            height="535"
-        ></iframe>
-    </center>
+    <p className='align-center'>
+        <img src="https://doordesk.net/pics/plates.gif" />
+    </p>
    <h3>Bots</h3>
    <p>
        After finding a number of ways not to begin the project formerly known as my capstone,
@ -20,12 +13,12 @@
            href="https://www.kaggle.com/datasets/bwandowando/ukraine-russian-crisis-twitter-dataset-1-2-m-rows"
        >dataset</a
        >. The project is about detecting bots, starting with twitter. I've
-        <a href="projects/bots/docs/debot.pdf">studied</a> a
-        <a href="projects/bots/docs/botwalk.pdf">few</a>
-        <a href="projects/bots/docs/smu.pdf">different</a>
-        <a href="projects/bots/docs/div.pdf">methods</a> of bot detection and particularly like the
-        <a href="projects/bots/docs/debot.pdf">DeBot</a> and
-        <a href="projects/bots/docs/botwalk.pdf">BotWalk</a> methods and think I will try to mimic them,
+        <a href="https://doordesk.net/projects/bots/docs/debot.pdf">studied</a> a
+        <a href="https://doordesk.net/projects/bots/docs/botwalk.pdf">few</a>
+        <a href="https://doordesk.net/projects/bots/docs/smu.pdf">different</a>
+        <a href="https://doordesk.net/projects/bots/docs/div.pdf">methods</a> of bot detection and particularly like the
+        <a href="https://doordesk.net/projects/bots/docs/debot.pdf">DeBot</a> and
+        <a href="https://doordesk.net/projects/bots/docs/botwalk.pdf">BotWalk</a> methods and think I will try to mimic them,
        in that order.
    </p>
    <p>
--- a/doordesk/src/App.tsx
+++ b/doordesk/src/App.tsx
@ -1,22 +1,39 @@
-import { Component }  from 'react'
+import { Component } from 'react'
 import './App.css'
 import Header from './components/Header.js'
 import Blog from './components/Blog.js'

-const BLOG_POSTS = [
-    'blog/000000000-swim.html',
-    'blog/20220506-change.html'
+const FAKE_IT_TIL_YOU_MAKE_IT: string[] = [
+    'Blog',
+    'Games',
+    'Cartman',
+    'Enigma',
+    'Notebooks',
 ]

-class App extends Component {
-    constructor(props) {
+interface IAppProps {
+}
+
+interface IAppState {
+    currentPage: string;
+}
+
+class App extends Component<IAppProps, IAppState> {
+    constructor(props: IAppProps) {
        super(props)
+        this.state = {
+            currentPage: 'Blog'
+        }
    }
    render() {
+        let page;
+        if (this.state.currentPage === 'Blog') {
+            page = <Blog />
+        }
        return (
            <div className="App">
-                <Header />
-                <Blog />
+                <Header pages={FAKE_IT_TIL_YOU_MAKE_IT} currentPage={this.state.currentPage} />
+                {page}
            </div>
        )
    }
--- a/doordesk/src/components/Blog.tsx
+++ b/doordesk/src/components/Blog.tsx
@ -1,10 +1,21 @@
 import { Component } from 'react'
 import BlogPost from './BlogPost.js'

-const BLOG_URLS: string[] = [
+// should render one by one
+
+// make api that has post id, title, date, etc with url to article; then
+// distribute to blog posts
+
+const FAKE_IT_TIL_YOU_MAKE_IT: string[] = [
+    'blog/20220701-progress.html',
+    'blog/20220614-reddit.html',
+    'blog/20220602-back.html',
+    'blog/20220529-housing.html',
+    'blog/20220520-nvidia.html',
    'blog/20220506-change.html',
-    'blog/000000000-swim.html'
+    'blog/000000000-swim.html',
 ]
+
 interface IBlogProps {
 }

@ -23,7 +34,7 @@ class Blog extends Component<IBlogProps, IBlogState> {
    render() {
        return (
            <>
-                {this.renderPosts(BLOG_URLS)}
+                {this.renderPosts(FAKE_IT_TIL_YOU_MAKE_IT)}
            </>
        )
    }
--- a/doordesk/src/components/Header.tsx
+++ b/doordesk/src/components/Header.tsx
@ -1,21 +1,32 @@
-function Header() {
-    return (
-        <div className="header">
-            <div className="content">
-                <header>
-                    <h1>DoorDesk</h1>
-                </header>
-                <nav>
-                    <p>
-                        <a href="../index.html">[Home]</a> -
-                        <a href="../games">[Games]</a> -
-                        <a href="https://github.com/adoyle0">[GitHub]</a> -
-                        [Cartman]
-                    </p>
-                </nav>
+import { Component } from 'react'
+
+interface IHeaderProps {
+    pages: string[];
+    currentPage: string;
+}
+
+interface IHeaderState {
+}
+
+class Header extends Component<IHeaderProps, IHeaderState> {
+    constructor(props: IHeaderProps) {
+        super(props)
+    }
+    render() {
+        return (
+            <div className="header">
+                <div className="content">
+                    <header>
+                        <h1>DoorDesk</h1>
+                    </header>
+                    <nav>
+                        <p> {this.props.currentPage} </p>
+                        <p> {this.props.pages} </p>
+                    </nav>
+                </div>
            </div>
-        </div>
-    )
+        )
+    }
 }

 export default Header
--- a/5
+++ b/5
@ -0,0 +1,5 @@
+#!/bin/bash
+
+cd doordesk &&
+    npm install &&
+    npm run dev