echen102 / covid-19-tweetids Goto Github PK

The repository contains an ongoing collection of tweets IDs associated with the novel coronavirus COVID-19 (SARS-CoV-2), which commenced on January 28, 2020.

License: Other

Python 100.00%

covid-19-tweetids's Introduction

COVID-19-TweetIDs

The repository contains an ongoing collection of tweets IDs associated with the novel coronavirus COVID-19 (SARS-CoV-2), which commenced on January 28, 2020. We used the Twitter’s search API to gather historical Tweets from the preceding 7 days, leading to the first Tweets in our dataset dating back to January 21, 2020. We leveraged Twitter’s streaming API to follow specified accounts and also collect in real-time tweets that mention specific keywords. To comply with Twitter’s Terms of Service, we are only publicly releasing the Tweet IDs of the collected Tweets. The data is released for non-commercial research use.

The associated paper to this repository can be found here: Tracking Social Media Discourse About the COVID-19 Pandemic: Development of a Public Coronavirus Twitter Data Set

Due to Twitter's changing policies around their free API, we are unsure of how this will impact academic access to the API. We will continue to collect tweets and update this repository for as long as we can.

Data Organization

The Tweet-IDs are organized as follows:

Tweet-ID files are stored in folders that indicate the year and month of the collection (YEAR-MONTH).
Individual Tweet-ID files contain a collection of Tweet IDs, and the file names all follow the same structure, with a prefix “coronavirus-tweet-id-” followed by the YEAR-MONTH-DATE-HOUR.
Note that Twitter returns Tweets in UTC, and thus all Tweet ID folders and file names are all in UTC as well.

Notes About the Data

Data Collection Method Migrated to AWS (Release v2.0)

We have recently migrated our data collection to AWS. Because of our recent shift and upgrade of computing and network specifications, we're excited to announce that we are now able to collect (and consequently release) a significantly greater number of Tweet IDs. We will be continuing to leverage AWS for the foreseeable future - please be aware that from release v2.0 and onwards, there will be a significant increase in the number of Tweet-IDs contained in each hourly file. We are increasing the major version of the releases to reflect this change in collection infrastructure. No other parameters have changed (e.g. keywords tracked, accounts followed) that have not previously been documented, and there is not a gap in data collection as we switched to AWS, as we ensured that was an overlap in hours collected during the migration.

Other Notes

We will be continuously maintaining this database for the foreseeable future, with new data being uploaded at least once every 2-3 weeks.
There may be a few hours of missing data due to technical difficulties. We have done our best to recover as many Tweets from those time frames by using Twitter’s search API.
We will keep a running summary of basic statistics as we upload data in each new release.
The file keywords.txt and accounts.txt contains the updated keywords and accounts respectively that we tracked in our data collection. Each keyword and account will be followed by the date we began tracking them, and date we removed them (if the keyword or account has been removed) from our tracking list.
Consider using tools such as the Hydrator and Twarc to rehydrate the Tweet IDs. Instructions for both are in the next section.
Hydrating may take a while, and Tweets may have been deleted since our initial collection. If that is the case, unfortunately you will not be able to get the deleted Tweets from querying Twitter's API. Ed Summers (edsu) hydrated the Tweets in release v1.0, taking approximately 25 hours to complete, and found that there was an approximate 6% of the Tweets that were deleted at the time of hydration, with final gzipped data size of 6.9 GB.
We have seen an increased rate of errors from Twitter's API endpoint, resulting in more missing tweets starting mid-November 2022.

How to Hydrate

Hydrating using Hydrator (GUI)

Navigate to the Hydrator github repository and follow the instructions for installation in their README. As there are a lot of separate Tweet ID files in this repository, it might be advisable to first merge files from timeframes of interest into a larger file before hydrating the Tweets through the GUI.

Hydrating using Twarc (CLI)

Many thanks to Ed Summers (edsu) for writing this script that uses Twarc to hydrate all Tweet-IDs stored in their corresponding folders.

First install Twarc and tqdm

pip3 install twarc
pip3 install tqdm

Configure Twarc with your Twitter API tokens (note you must apply for a Twitter developer account first in order to obtain the needed tokens). You can also configure the API tokens in the script, if unable to configure through CLI.

twarc configure

Run the script. The hydrated Tweets will be stored in the same folder as the Tweet-ID file, and is saved as a compressed jsonl file

python3 hydrate.py

Data Usage Agreement / How to Cite

This dataset is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International Public License (CC BY-NC-SA 4.0). By using this dataset, you agree to abide by the stipulations in the license, remain in compliance with Twitter’s Terms of Service, and cite the following manuscript:

Chen E, Lerman K, Ferrara E Tracking Social Media Discourse About the COVID-19 Pandemic: Development of a Public Coronavirus Twitter Data Set JMIR Public Health Surveillance 2020;6(2):e19273 DOI: 10.2196/19273 PMID: 32427106

BibTeX:

@article{chen2020tracking,
  title={Tracking social media discourse about the covid-19 pandemic: Development of a public coronavirus twitter data set},
  author={Chen, Emily and Lerman, Kristina and Ferrara, Emilio},
  journal={JMIR Public Health and Surveillance},
  volume={6},
  number={2},
  pages={e19273},
  year={2020},
  publisher={JMIR Publications Inc., Toronto, Canada}
}

Statistics Summary (v2.106)

Number of Tweets : 2,775,946,436

Language breakdown of top 10 most prevalent languages :

Language	ISO	No. tweets	% total Tweets
English	en	1,785,043,839	64.3%
Spanish	es	307,973,203	11.09%
Portuguese	pt	107,505,532	3.87%
French	fr	102,743,271	3.7%
Undefined	und	75,618,129	2.72%
Indonesian	in	74,180,508	2.67%
German	de	64,650,071	2.33%
Japanese	ja	41,290,208	1.49%
Thai	th	38,024,206	1.37%
Italian	it	31,850,251	1.15%

Known Gaps

Date	Time
2/1/2020	4:00 - 9:00 UTC
2/8/2020	6:00 - 7:00 UTC
2/22/2020	21:00 - 24:00 UTC
2/23/2020	0:00 - 24:00 UTC
2/24/2020	0:00 - 4:00 UTC
2/25/2020	0:00 - 3:00 UTC
3/2/2020	Intermittent Internet Connectivity Issues
5/14/2020	7:00 - 8:00 UTC

Inquiries

Please read through the README and the closed issues to see if your question has already been addressed first.

If you have technical questions about the data collection, please contact Emily Chen at echen920[at]usc[dot]edu.

If you have any further questions about this dataset please contact Dr. Emilio Ferrara at emiliofe[at]usc[dot]edu.

Related Papers

covid-19-tweetids's People

Contributors

Stargazers

Watchers

Forkers

drhanlau vincenthuang1229 hohnerjulian ramorel edsu zputech discoveryanalyticscenter algunion michaelcapizzi shengguanwsu hxxiagroup ejhortala andrewm-bose sshugars ataraxies anubrata zjj-cathy frederatic ckardatzke namisan liyuanlucasliu yingjun2 test00dezwebsite panda628 shaoqingh terence-guan wxbks yeyeyeid bingrao jiyfeng qeeeem xinke0802 xiao9905 paytonweatherspoon zoey7407 liucsthu yueminli ecogan1 ashok98485 ziwenzu-zz suryansh2020 ednasawe fabricesamonte liyi3344520 roujingli guruprasaad123 giserh mohit3011 yenchiah frankgandiao elwj88 serena-sj amirunpri2018 ybli lucasmiranda42 manification10 fpohlmann hanktown rushikarao studymakim mpatan2 yingyingfan0059 kelaxon peacegui covid19resourcecentre schnecken apurvamulay elroyg1 betty0713 cnjelita omerkara zeroteb mohdabdo akirawisnu jpark1200 mirca dariushghasemi raihan2108 masterofnone69 dadoscope nishantmonu51 octophi tapanyemre urbancolab dargcsic idsdarg stevenherrera24 siriusctrl greatgezby jichengyang chenyiqing49 sumedhvdatar apermuy matchading jgajgajj akankshamishra rmomizo calciu lopezbec cshong9

covid-19-tweetids's Issues

Updates dataset

Thank you very much this repo 😊. Will you update with the latest ids for march at som point?

twarc issue

This could just be an issue on my side, but

twarc configure

did not work on my system, Instead, I just added it to the python script directly.

twarc client error

In the most recent update, there appears to be a tweet that belongs to an account that has since been locked (although I'm not sure about this). The hydrate script fails on 2021-10\coronavirus-tweet-id-2021-10-15-00.txt and throws the following exception:

WARNING:twarc:401 Authentication required for https://api.twitter.com/1.1/statuses/lookup.json
 27%|████████████████████▎                                                      | 22665/83672 [03:21<09:01, 112.62it/s]
Traceback (most recent call last):
  File "F:\COVID-19-TweetIDs-pulled\hydrate.py", line 70, in <module>
    main()
  File "F:\COVID-19-TweetIDs-pulled\hydrate.py", line 30, in main
    hydrate(path)
  File "F:\COVID-19-TweetIDs-pulled\hydrate.py", line 61, in hydrate
    for tweet in twarc.hydrate(id_file.open()):
  File "C:\Users\Ryan\AppData\Local\Programs\Python\Python39\lib\site-packages\twarc\client.py", line 606, in hydrate
    resp = self.post(
  File "C:\Users\Ryan\AppData\Local\Programs\Python\Python39\lib\site-packages\twarc\decorators.py", line 31, in new_f
    resp.raise_for_status()
  File "C:\Users\Ryan\AppData\Local\Programs\Python\Python39\lib\site-packages\requests\models.py", line 943, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url: https://api.twitter.com/1.1/statuses/lookup.json
This is a protected or locked account, or the credentials provided are no longer valid.

I want to know the number of Korean tweets.

First of all, thank you for organizing a good dataset. I have a question. Only Language breakdown of top 15 most prestigious languages are written in README, and the share of Korean is not stated. There is too much data for me to count Korean tweets. Can I know the share of Korean?

twarc configure and hydrate issue

I ran twarc configure and entered the consumer and API keys that it asks for. Yet when executing python3 hydrate.py the following error throws up

geolocations missing

First of all, thank you so much for sharing your work.

I run the script and get some tweets full content. However, I found all the geolocations are missing, even though some users' "geo_enable = true". My question is if there's a way that I can get the geolocation information.

Thanks again!

Multi-language result

Hi all. Thanks for your effort. In you paper, you mention that you use the English-related keyword for the search. But, in the result, you show that there are Janpanese tweets? It is really strange. Can you explain for this reason. And also, show the sample ids in your data. Many thanks. I am quite confused about the multi-language result. Many thanks.

new data

Do you have a general time period/rate of when new tweet ids will be released?

Unexpected higher volume of ids at the end of February

Hello!

First of all, many thanks for releasing this great dataset for public use :).

I'm about to start hydrating the data. However, by doing a quick inspection of the sizes of the tweet id txt files, I realized that the tweets from the 29th of February and the 1st of March are abnormally big (~x4 times the average id.txt size). Is this something expected? I want to run some analyses of the COVID conversation in Twitter, so I want to make sure to take all the possible biases into consideration.

On a somewhat unrelated point, I was wondering if using hydrate.py is as fast as using twarc from the terminal? If it's not, do you have any ideas of an easy way to scan through all the directories and .txt files directly from the terminal? Maybe with a bash script or so would do the trick...

Thanks again!

How to get only English tweets

Failed to apply for a developer account

I need some data from last year for my research, but I failed to apply for a developer account. Is there anyone who can help me, or is there any other way to get these data sets?

Full version of the data

Hi,

Is it possible to release the texts instead of the TweetIDs? I think I still need APIs to download/extract the original tweets from the web.

Thanks.

derived analysis

I am teaching some high school student on data science, and think this is a good sample project. Is there a list of report or analysis coming out from these data?

I saw the initial post on arxiv, and I am wondering if people have done more after march. Thanks!

Feb 23 Data

Hi,

Thanks so much for making the data available. I am trying to write a paper that uses your amazing dataset. However, when I was doing exploratory data analysis, I found that data for Feb 23, 2020 are missing. Is there any way to recover that data?

Appreciatively,

Nga Than

The data for the Feb-23 is missing

Data Absence

Hi, thanks for sharing this brilliant dataset!

We noticed some issues of the dataset when we process the data.
The Twitter ID files seem absent for some time periods in Feb., including 05-08 at Feb 1, 06 at Feb 8, 21-23 at Feb. 22, the whole day for Feb. 23 and 00-03 at Feb. 24, 00-02 at Feb. 25.

Are there any ways you can fix this issue and complement the data (especially the data on Feb. 23)?

Thanks a lot!

Sensitive data in tweets

I just saw a paper rejected from a conference on ethical grounds that used this dataset.

I appreciate that you ask people to abide by Twitter's terms of service as a condition of using this dataset. However, people don't seem to be reading it. In particular, Twitter terms forbid the encoding of sensitive data including people's health. This would include someone's COVID-19 status (even if negative) and their exposure. Twitter also permits children as young as 13 to use the platform, who are not legally capable of giving their consent for some use cases, including health-related cases, in many countries.

So, I recommend that you add a more explicit warning that this is very sensitive data.

Does the provided hydrate.py script still work?

Hello,
as per the title, I wonder if the provided script still works now that Twitter has become "X" (and its API has changed as well, maybe?).

I configured twarc correctly with my consumer_key, consumer_secret, access_token and access_token_secret, but I get the following error:

  File "/Users/fulgor/Desktop/tweets-codiv/hydrate.py", line 82, in <module>
    main()
  File "/Users/fulgor/Desktop/tweets-codiv/hydrate.py", line 43, in main
    hydrate(path)
  File "/Users/fulgor/Desktop/tweets-codiv/hydrate.py", line 75, in hydrate
    for tweet in twarc.hydrate(id_file.open()):
  File "/Users/fulgor/Desktop/tweets-codiv/env/lib/python3.12/site-packages/twarc/client.py", line 641, in hydrate
    resp = self.post(
           ^^^^^^^^^^
  File "/Users/fulgor/Desktop/tweets-codiv/env/lib/python3.12/site-packages/twarc/decorators.py", line 88, in new_f
    resp.raise_for_status()
  File "/Users/fulgor/Desktop/tweets-codiv/env/lib/python3.12/site-packages/requests/models.py", line 1024, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 403 Client Error: Forbidden for url: https://api.twitter.com/1.1/statuses/lookup.json

Any ideas?

Thank you!

Program to get the full dataset

Hello,

Can you please suggest any program/open-source code which allows crawling a large number of tweets from TweetId?

I have a written a python script but it's not smart enough to handle millions of tweet id.

inquiry for the missing files from 2020-02-01-04 to 2020-02-01-08 (five files)

Hello! I was wondering if there are those five files available? I hydrated the data and found that there are five files missed in February section. Thank you in advance for your support!

How to hydrate only 1% of all tweet ids

Hi there, because of the rate limiting I was wondering how I can randomly hydrate 1% of all tweet ids from the text files. Currently I use a random number generator and my JSON files contains 1% of hydrated tweets, however in the hydrate.py code "twarc.hydrate()" still passes in every id because it accepts the text file as a whole and so rate limiting still occurs at the normal rate. Is there a way to modify the code or all text files so that I only hydrate 1% of all tweets?

Frequency of Updates

Hi there,

How often will the data, tweets be updated and what is the plans for the future of the COVID-19 Tweets

Sampling

Hi there! Happy to see that you are releasing this data.

Since I'm also collecting/sharing this data (see here) I noticed a while back that unfortunately I'm running into "soft" limits (Twitter caps the volume at 1% of the total stream), so I'm unfortunately missing quite a bit of data (I collect roughly 4M tweets/day currently, probably around a fifth of the total, although it could be higher).

The problem is that this subsampling might impact certain type of analysis (say counts over time, networks, etc.), see here e.g..

I'm considering merging datasets from different sources in the final analysis (however, granted, it's already a lot of data to process), if I do this I will probably merge various datasets in order to get closer to the full dataset. If I do this, I will of course cite your work.

If you think you face a similar sampling, this is maybe something which could be mentioned in the README as well - especially for others who are not aware of this.

How to download just one specific month from this repo?

Hello,

I was trying to download the repo as zip file. It seems something broken right at the 6.91G point. Did anybody bump into this same issue?

Location: https://codeload.github.com/echen102/COVID-19-TweetIDs/zip/refs/heads/master [following]
--2021-07-11 10:07:32--  https://codeload.github.com/echen102/COVID-19-TweetIDs/zip/refs/heads/master
Resolving codeload.github.com (codeload.github.com)... 140.82.113.9
Connecting to codeload.github.com (codeload.github.com)|140.82.113.9|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/zip]
Saving to: ‘master.zip’

master.zip                                             [                                   <=>                                                                           ]   6.91G  5.99MB/s    in 20m 0s

2021-07-11 10:27:33 (5.90 MB/s) - Read error at byte 7424091751 (Success).Retrying.

--2021-07-11 10:27:34--  (try: 2)  https://codeload.github.com/echen102/COVID-19-TweetIDs/zip/refs/heads/master
Connecting to codeload.github.com (codeload.github.com)|140.82.113.9|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/zip]
Saving to: ‘master.zip’

master.zip                                             [                              <=>                                                                                ]   6.91G  6.19MB/s    in 20m 0s

2021-07-11 10:47:34 (5.89 MB/s) - Read error at byte 7424091751 (Success).Retrying.

Open Source Helps!

Thanks for your work to help the people in need! Your site has been added! I currently maintain the Open-Source-COVID-19 page, which collects all open source projects related to COVID-19, including maps, data, news, api, analysis, medical and supply information, etc. Please share to anyone who might need the information in the list, or will possibly contribute to some of those projects. You are also welcome to recommend more projects.

http://open-source-covid-19.weileizeng.com/

Cheers!

echen102 / covid-19-tweetids Goto Github PK

covid-19-tweetids's Introduction

COVID-19-TweetIDs

Data Organization

Notes About the Data

Data Collection Method Migrated to AWS (Release v2.0)

Other Notes

How to Hydrate

Hydrating using Hydrator (GUI)

Hydrating using Twarc (CLI)

Data Usage Agreement / How to Cite

Statistics Summary (v2.106)

Known Gaps

Inquiries

Related Papers

covid-19-tweetids's People

Contributors

Stargazers

Watchers

Forkers

covid-19-tweetids's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs