48 lines
3.9 KiB
Markdown
48 lines
3.9 KiB
Markdown
# Better Data:
|
|
|
|
https://www.kaggle.com/datasets/bwandowando/ukraine-russian-crisis-twitter-dataset-1-2-m-rows
|
|
|
|
# Data:
|
|
|
|
### Russian Troll Tweets
|
|
|
|
Great stuff in here for targets: https://github.com/fivethirtyeight/russian-troll-tweets/
|
|
|
|
Dictionary:
|
|
|
|
| Header | Definition |
|
|
| -------------------- | ----------------------------------------------------------------------------------------------------------------------------- |
|
|
| `external_author_id` | An author account ID from Twitter |
|
|
| `author` | The handle sending the tweet |
|
|
| `content` | The text of the tweet |
|
|
| `region` | A region classification, as [determined by Social Studio](https://help.salesforce.com/articleView?id=000199367&type=1) |
|
|
| `language` | The language of the tweet |
|
|
| `publish_date` | The date and time the tweet was sent |
|
|
| `harvested_date` | The date and time the tweet was collected by Social Studio |
|
|
| `following` | The number of accounts the handle was following at the time of the tweet |
|
|
| `followers` | The number of followers the handle had at the time of the tweet |
|
|
| `updates` | The number of “update actions” on the account that authored the tweet, including tweets, retweets and likes |
|
|
| `post_type` | Indicates if the tweet was a retweet or a quote-tweet |
|
|
| `account_type` | Specific account theme, as coded by Linvill and Warren |
|
|
| `retweet` | A binary indicator of whether or not the tweet is a retweet |
|
|
| `account_category` | General account theme, as coded by Linvill and Warren |
|
|
| `new_june_2018` | A binary indicator of whether the handle was newly listed in June 2018 |
|
|
| `alt_external_id` | Reconstruction of author account ID from Twitter, derived from `article_url` variable and the first list provided to Congress |
|
|
| `tweet_id` | Unique id assigned by twitter to each status update, derived from `article_url` |
|
|
| `article_url` | Link to original tweet. Now redirects to "Account Suspended" page |
|
|
| `tco1_step1` | First redirect for the first http(s)://t.co/ link in a tweet, if it exists |
|
|
| `tco2_step1` | First redirect for the second http(s)://t.co/ link in a tweet, if it exists |
|
|
| `tco3_step1` | First redirect for the third http(s)://t.co/ link in a tweet, if it exists |
|
|
|
|
### now what?
|
|
|
|
Precise:
|
|
|
|
- Find intersting date range from target data
|
|
- Archive everything https://github.com/twintproject/twint
|
|
- Check target/archive duplicates
|
|
|
|
lazy:
|
|
|
|
- grab any other collection of tweets
|
|
- check for duplicates
|