docnow / twarc-csv Goto Github PK
View Code? Open in Web Editor NEWA plugin for twarc2 for converting tweet JSON into DataFrames and exporting to CSV.
License: MIT License
A plugin for twarc2 for converting tweet JSON into DataFrames and exporting to CSV.
License: MIT License
Hi! Thanks Igor for the quick reply to my issue #16. I was able to find a workaround to get what I wanted (keep original tweets which were retweeted but exclude truncated retweets) and get German characters and emojis. However, I get more tweets than the account's tweet count. Do you have any idea what I may be missing here? Below is my code (probably very long and inefficient, so apologies!):
Download full archive
!twarc2 search "(from:volkspartei)" --archive > OVP.jsonl
Convert JSONL file to CSV
!twarc2 csv OVP.jsonl OVP.csv
Convert to data frame to delete blank lines
OVP = pd.read_csv("OVP.csv")
Delete rows with blank values in "text"
OVP = OVP.dropna(axis=0, subset=['text'])
Filter "text" column starting with "RT" to exclude truncated retweets and keep original RT tweet
OVP = OVP[~OVP.text.str.startswith("RT")]
Keep only selected columns
OVP = OVP[['author.created_at','author.name','author.username','created_at','text','type',
'public_metrics.like_count','public_metrics.retweet_count','public_metrics.quote_count',
'public_metrics.reply_count','id','conversation_id','lang','author.public_metrics.followers_count',
'author.public_metrics.following_count','author.public_metrics.listed_count',
'author.public_metrics.tweet_count']]
OVP = OVP.rename(columns={'author.created_at': 'account_created_at', 'author.name': 'account_name',
'author.username': 'username', 'public_metrics.like_count': 'like_count',
'public_metrics.retweet_count': 'retweet_count', 'public_metrics.quote_count':
'quote_count', 'public_metrics.reply_count': 'reply_count',
'author.public_metrics.followers_count': 'account_followers',
'author.public_metrics.following_count' : 'account_following',
'author.public_metrics.listed_count': 'account_listed',
'author.public_metrics.tweet_count': 'account_tweet_count'})
Reset index number after dropping NA rows
OVP = OVP.reset_index(drop=True)
Save back to CSV
OVP.to_csv("OVP_1.csv")
As you can see, the final file has a total of 12186 tweets, whereas the account's tweet count is only 11917. Do you have any idea why I end up with more tweets? If all truncated Retweets begin with "RT", then my solution above should work, right?
Thanks a lot!
There are still some encoding issues i'm investigating that break CSVs. If you have encountered any, post a reply here.
Right now it reads the entire result set into memory, which means it can't handle anything larger than a moderately sized dataset.
Pending DocNow/twarc#572
See DocNow/twarc#657
A recurring thing is having to work with "medium data" - several GB datasets that are challenging to work with on a single machine, but may not warrant a distributed system, but are definitely too big for standard approaches.
eg: https://twittercommunity.com/t/saving-tweet-to-csv/153357/41?u=igorbrigadir and other cases.
Need to add more documentation / examples of working with these dataset sizes effectively.
When processing retweets, https://github.com/DocNow/twarc-csv/blob/main/dataframe_converter.py#L283-L293
tweet["entities"] = retweeted_tweet.pop("entities", None)
should really be
tweet["entities"] = retweeted_tweet.pop("entities", tweet.pop("entities", None))
to ensure that things don't get replaced by None
if there was something in tweet
but nothing in retweeted_tweet
Currently it leaves referenced_tweets
list alone, the column in the CSV ends up like this:
[{"type": "replied_to", "id": "1380226330034372610"}]
[{"type": "quoted", "id": "1380226330034372610"}]
[{"type"": "retweeted", "id": "1261081519566675969"}]
but we could expand this into separate columns:
referenced_tweets.replied_to
referenced_tweets.quoted
referenced_tweets.retweeted
and by extension, type
column should be a list like ["reply"]
or ["retweet","reply","quote"]
if it's a quote tweet that's a reply to someone that was then retweeted. type
should also be __inferred_tweet_type
or something to indicate where this field is coming from.
Check objects and order in CSV vs in json when using user objects and tweets
eg:
twarc2 followers --limit 10 user | twarc2 csv --input-users-columns --output-columns "id,username,name"
gives:
1347718171470557185,1347718171470557185,2021-01-09T01:34:04.000Z,2021-01-09T01:34:04.000Z,AlexPineapple_,Alex 🍍
16832937,16832937,2008-10-17T23:36:09.000Z,2008-10-17T23:36:09.000Z,ColinNC,Colin
1321944084810997763,1321944084810997763,2020-10-29T22:37:16.000Z,2020-10-29T22:37:16.000Z,Ger11645317,Ger
17673550,17673550,2008-11-27T08:59:57.000Z,2008-11-27T08:59:57.000Z,mikemcc28,Mike
I have a large set of tweets, and I would like to wrangle and write them to file as I go. DataFrameConverter would be ideal for this, but if I try to pass a tweet to it, I get an error message: TypeError: process() missing 1 required positional argument: 'objects'
. I followed these instructions:
from twarc_csv import DataFrameConverter
json_objects = [...]
df = DataFrameConverter.process(json_objects)
passing converter a tweet or a page scraped as described in examples.
What am I doing wrong or can I not use this at all as I would like to?
I've been downloading thousands tweets with search and flatten:
twarc2 search 'xxx' --start-time 2018-03-20 --end-time 2018-04-30 --limit 2 --archive --flatten > xxx.jsonl
When I convert to csv with twarc2 csv
:
twarc2 csv xxx.jsonl xxx.csv
things seem to go well, but when I look into the column author.username
is full of other strings of text that are not usernames:
csvcut -c author.username xxx.csv
such as date-time strings 2016-06-08T10:08:00.000Z
, ids 1604281548
or numbers 1186.0
.
If I look in one of the files exactly I see that problems start in line 6095:
when the variable reply_settings
is filled with [{"type": "retweeted", "id": "976557335748534273"}]
, and from there all the variable are mixed up.
When I make the same experiment with a few tweets from a short search, this problem is not there.
Thanks for the upgrade, however, a 2gb jsonl file I am trying to convert to csv is taking 4hrs and more to run without completion. However, smaller files below 200mb successfully convert to csv within 5mins.
I have upgraded twarc-csv to 2.10, what else could be the problem, and what can i do?
twarc2 search --limit 50 "beauty" tweets.jsonl
0%| | Processed a moment/6 days [00:00<?, 0 tweets total ]
⚡ Client Forbidden
Hello,
would you mind to let me know, while I use the twarc2 search --limit, do the collected tweets are based on random search or based on specific algorithm?
Hi, I'm trying to run twarc-csv on a jsonl file obtained through the Academic API. Using a new macbook pro with M1 chip. I run this command:
twarc2 csv result.jsonl result.csv
It gets stuck at 37% every time with the output below. Is this a known error? Am I doing something wrong? Thank you in advance.
37%|█████▉ | Processed 286M/766M of input file [00:33<00:37, 13.5MB/s]Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.10/bin/twarc2", line 8, in
sys.exit(twarc2())
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/click/core.py", line 1128, in call
return self.main(*args, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/click/core.py", line 1053, in main
rv = self.invoke(ctx)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/click/core.py", line 1659, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/click/core.py", line 1395, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/click/core.py", line 754, in invoke
return __callback(*args, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/twarc_csv.py", line 148, in csv
writer.process()
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/csv_writer.py", line 81, in process
self._write_output(self.converter.process(batch), first_batch)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/csv_writer.py", line 65, in _write_output
_df.to_csv(
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pandas/core/generic.py", line 3466, in to_csv
return DataFrameRenderer(formatter).to_csv(
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pandas/io/formats/format.py", line 1105, in to_csv
csv_formatter.save()
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pandas/io/formats/csvs.py", line 257, in save
self._save()
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pandas/io/formats/csvs.py", line 262, in _save
self._save_body()
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pandas/io/formats/csvs.py", line 300, in _save_body
self._save_chunk(start_i, end_i)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pandas/io/formats/csvs.py", line 311, in _save_chunk
libwriters.write_csv_rows(
File "pandas/_libs/writers.pyx", line 72, in pandas._libs.writers.write_csv_rows
_csv.Error: need to escape, but no escapechar set
37%|█████▉ | Processed 286M/766M of input file [00:33<00:56, 8.83MB/s]
I am facing some problems with the twarc2 csv command. I am using it to convert a (large) json file as follows:
twarc2 csv treatment.json treatment.csv
When the progress bar reaches about 89% it returns the error:
"withheld.scope"
to fix, add these with --input-columns. Skipping entire batch of 670325 tweets!
Now I assumed that this means I need to add that option to the command line in order to recover those tweets. I am doing it like this :
twarc2 csv --input-columns "withheld.scope" treatment.json treatment.csv
but it returns an error again. I have updated twarc to its last version and the code alwys worked properly to me but in this file.
Thanks in advance!
Instead of CSVs, append the parsed dataframes to parquet https://stackoverflow.com/a/47839247/11090908
Hello!
Recently I've collected some tweets using twarc2 and, after the end of the retrieval, I converted the output into a '.csv' using twarc-csv.
However, some columns that I need to use store IDs as floats, as shown in the figure. When I try to convert it to integers, sometimes it yields a tweet ID that isn't correlated to the original post ( probably a rounding imprecision) . Is there a specific/correct way to convert these IDs to integers or the information was lost during the process of conversion?
Thanks! :)
I noticed that unicode is JSON escaped in the CSV file. I think it should be converted to UTF-8 since it's no longer JSON?
twarc search 👋 --flatten > wave.jsonl
twarc csv wave.jsonl
Currently there are user IDs only, but it would help to also have user names. Not all user columns should be added, as this would make an unreasonably wide dataframe (would need all author columns for each quoted, retweeted, etc user) but just adding names is enough to make things easier.
Additionally, document extracting and converting user objects too maybe.
Hello
Please, I have a very large .jsonl file (870MB) and I am using python 3.8 on Ubuntu
But twarc csv fails to transform to CSVs:
twarc2 csv results.jsonl tweets_with_attacks_journalists.csv
💔 ERROR: 4 Unexpected items in data!
Are you sure you specified the correct --input-data-type?
If the object type is correct, add extra columns with:
--extra-input-columns "edit_controls.is_edit_eligible,edit_controls.editable_until,edit_history_tweet_ids,edit_controls.edits_remaining"
Skipping entire batch of 9944 tweets!
Is there any other way to convert to CSV?
It appears that referenced tweets don't have their URLs pulled over into their rows in the CSV? This came up in DocNow/twarc#538
$ twarc2 tweet 1438733968287977476 | twarc2 csv - > tweet.csv
$ xsv select id,entities.urls tweet.csv
id,entities.urls
1438486160867745801,
1438733968287977476,"[{""start"": 162, ""end"": 185, ""url"": ""https://t.co/wRfX2O9S7W"", ""expanded_url"": ""https://twitter.com/dereckapurnell/status/1438486160867745801"", ""display_url"": ""twitter.com/dereckapurnell\u2026""}]"
Installing in newer pip / python versions gives a warning:
DEPRECATION: twarc-csv is being installed using the legacy 'setup.py install' method, because it does not have a 'pyproject.toml' and the 'wheel' package is not installed. pip 23.1 will enforce this behaviour change. A possible replacement is to enable the '--use-pep517' option. Discussion can be found at https://github.com/pypa/pip/issues/8559
Missing --hide-progress
command line switch
Reproducible code:
!twarc2 search “(vaccine OR jab OR vaxine) (-is:retweet) (lang:en)” --archive --start-time 2021-03-01T00:00:00 --end-time 2021-03-03T00:00:00 --limit 300 raw_output.json
!twarc2 flatten ‘raw_output.json’ ‘flattened_output.json’
!twarc2 csv --output-columns “id,created_at,text” ‘flattened_output.json’ ‘outputshort.csv’
I noticed that when I collected a conversation, and then exported to CSV that I ended up with duplicate rows for tweets, presumably because they are included in other tweets. I'm not sure what the solution is here, but it definitely seems problematic from a usability perspective.
twarc2 search conversation_id:1385008025140871168 > results.jsonl
twarc2 csv results.jsonl > results.csv
then:
>>> import pandas
>>> df = pandas.read_csv('results.csv')
>>> df.value_counts('id')
id
1385008025140871168 55
1385012190860611586 3
1385008825418321920 3
1385009576731422726 2
1385017514308956160 2
..
1385015116987453441 1
1385015148184809474 1
1385015948273496068 1
1385018187419377665 1
1385192675549327366 1
Length: 68, dtype: int64
```
The current output favors preserving as much information as possible from the original json, but there is some duplication, and a bunch of columns can be removed as they're rarely super useful.
The new --optimized
mode will generate CSVs that drop a bunch of columns to save space:
edit_controls.edits_remaining
edit_controls.editable_until
entities.cashtags
entities.hashtags
entities.mentions
withheld.scope
withheld.copyright
author.id
author.entities.description.cashtags
author.entities.description.hashtags
author.entities.description.mentions
author.url
author.withheld.scope
author.withheld.copyright
geo.coordinates.coordinates
geo.coordinates.type
geo.country
geo.full_name
geo.geo.type
matching_rules
__twarc.retrieved_at
__twarc.url
__twarc.version
(exact list to be revised later)
These are the most commonly not present or duplicate ones, where the missing data can be inferred from the columns left over, or with the cashtags, hashtags, mentions, with twitter-text for example.
I may change how the lists of hashtags and mentions are output.
Currenty, the json list is preserved as is. I may change this to output a list like ["@one","@two","@three"]
as opposed to preserving the start / end indexes, as these are generally not used.
Dealing with URLs in the same way is possible but i'm not sure how much processing to do on the URLs (show all t.co and unwound urls? or show all? etc.)
Hi there,
Is there an easy way to get the quoted tweet text field in the CSV output? The original tweet text is available but not the quoted text, which is very often where the hashtags of interest (used in the original search query) are located.
Thanks!
This bug cropped up where passing a generator to json_normalize would leave out the very first item pandas-dev/pandas#35923 Fixed in 0.3.1 bc418d9
twarc2 search --archive --start-time 2020-01-01 --limit 10 "reliance" tweets_reliance3.jsonl
twarc2 csv tweets_reliance3.jsonl tweets_reliance3.csv
I have used the following commands and I get error while converting jsonl to csv.
ERROR: Unexpected Data: "author.withheld.scope" to fix, add these with --extra-input-columns. Skipping entire batch of 666 tweets!
Even after using the command :
twarc2 csv --extra-input-columns "author.witheld.scope" tweets_reliance3.jsonl tweets_reliance5.csv
. I get the same error.
Here is the json file.
Jsonl file
For example, a tweet like 1249702384659554308
"geo": {
"coordinates": {
"type": "Point",
"coordinates": [
42.77810097,
88.01785747
]
}
Doesn't get saved in the CSV.
Full tweet json:
{
"data": [
{
"lang": "pl",
"entities": {
"urls": [
{
"start": 212,
"end": 235,
"url": "https://t.co/MZ9QlqGyfA",
"expanded_url": "https://www.instagram.com/p/B-7ItQwBJ93/?igshid=187dqx4b5lu2y",
"display_url": "instagram.com/p/B-7ItQwBJ93/…",
"status": 200,
"unwound_url": "https://www.instagram.com/p/B-7ItQwBJ93/?igshid=187dqx4b5lu2y"
}
],
"hashtags": [
{
"start": 24,
"end": 39,
"tag": "harrypotterdiy"
},
{
"start": 40,
"end": 72,
"tag": "harrypotterandphilosophersstone"
},
{
"start": 73,
"end": 84,
"tag": "potterhead"
},
{
"start": 85,
"end": 97,
"tag": "harrypotter"
},
{
"start": 98,
"end": 115,
"tag": "philosopherstone"
},
{
"start": 116,
"end": 131,
"tag": "czasnaczytanie"
},
{
"start": 132,
"end": 144,
"tag": "zostańwdomu"
},
{
"start": 145,
"end": 157,
"tag": "zostanwdomu"
},
{
"start": 158,
"end": 164,
"tag": "magic"
},
{
"start": 165,
"end": 177,
"tag": "harrypotter"
},
{
"start": 178,
"end": 187,
"tag": "funkopop"
},
{
"start": 188,
"end": 197,
"tag": "bookpile"
},
{
"start": 198,
"end": 210,
"tag": "bookstagram"
}
]
},
"created_at": "2020-04-13T14:14:01.000Z",
"public_metrics": {
"retweet_count": 1,
"reply_count": 0,
"like_count": 0,
"quote_count": 0
},
"reply_settings": "everyone",
"text": "Za co lubicie Harry'ego?#harrypotterdiy #harrypotterandphilosophersstone #potterhead #harrypotter #philosopherstone #czasnaczytanie #zostańwdomu #zostanwdomu #magic #harrypotter #funkopop #bookpile #bookstagram… https://t.co/MZ9QlqGyfA",
"possibly_sensitive": false,
"geo": {
"coordinates": {
"type": "Point",
"coordinates": [
42.77810097,
88.01785747
]
}
},
"id": "1249702384659554308",
"context_annotations": [
{
"domain": {
"id": "66",
"name": "Interests and Hobbies Category",
"description": "A grouping of interests and hobbies entities, like Novelty Food or Destinations"
},
"entity": {
"id": "1206704182717104128",
"name": "Model figures"
}
},
{
"domain": {
"id": "130",
"name": "Multimedia Franchise",
"description": "Franchises which span multiple forms of media like 'Harry Potter'"
},
"entity": {
"id": "933033311844286464",
"name": "Harry Potter",
"description": "This entity includes all conversation about the franchise, as well as any individual installments in the series, if applicable.\t\t\t"
}
}
],
"author_id": "2344192110",
"conversation_id": "1249702384659554308",
"source": "Instagram"
}
],
"includes": {
"users": [
{
"name": "Kama",
"username": "kamanonickname",
"protected": false,
"verified": false,
"public_metrics": {
"followers_count": 64,
"following_count": 152,
"tweet_count": 8743,
"listed_count": 0
},
"created_at": "2014-02-14T22:26:08.000Z",
"description": "There should be bio but Mróz is busy writing his 666th novel",
"id": "2344192110",
"url": "",
"profile_image_url": "https://pbs.twimg.com/profile_images/1422798418645225472/cRbGyIvp_normal.jpg"
}
]
},
"__twarc": {
"url": "https://api.twitter.com/2/tweets?expansions=author_id%2Cin_reply_to_user_id%2Creferenced_tweets.id%2Creferenced_tweets.id.author_id%2Centities.mentions.username%2Cattachments.poll_ids%2Cattachments.media_keys%2Cgeo.place_id&tweet.fields=attachments%2Cauthor_id%2Ccontext_annotations%2Cconversation_id%2Ccreated_at%2Centities%2Cgeo%2Cid%2Cin_reply_to_user_id%2Clang%2Cpublic_metrics%2Ctext%2Cpossibly_sensitive%2Creferenced_tweets%2Creply_settings%2Csource%2Cwithheld&user.fields=created_at%2Cdescription%2Centities%2Cid%2Clocation%2Cname%2Cpinned_tweet_id%2Cprofile_image_url%2Cprotected%2Cpublic_metrics%2Curl%2Cusername%2Cverified%2Cwithheld&media.fields=alt_text%2Cduration_ms%2Cheight%2Cmedia_key%2Cpreview_image_url%2Ctype%2Curl%2Cwidth%2Cpublic_metrics&poll.fields=duration_minutes%2Cend_datetime%2Cid%2Coptions%2Cvoting_status&place.fields=contained_within%2Ccountry%2Ccountry_code%2Cfull_name%2Cgeo%2Cid%2Cname%2Cplace_type&ids=1249702384659554308",
"version": "2.8.1",
"retrieved_at": "2021-12-14T16:17:36+00:00"
}
}
The CSV is missing the point coordinates.
Check that manually specified columns are valid
When using twarc2 csv tweets.jsonl tweets.csv # convert to CSV
It loads the JSONL and runs the script at 0%:
0%| | 0.00/2.72M [00:00<?, ?B/s]
Instead of "no such file exists" error for 0 byte files it should warn you about twarc potentially not finding any results in a previous command or something. Seeing the file in a folder but twarc2 csv saying it doesn't exist confuses people.
Hi!
I'm new to Python so forgive my ignorance. I've been downloading tweets with Twarc2. I was including "--no-inline-referenced-tweets" so I don't get duplicate RT entries when exporting to CSV. I just noticed that the RT lines do not include the full text of the RTs (whereas the original Tweet does if I include it).
My problem is that if I don't include "--no-inline-referenced-tweets", then the following code I found in here to deal with foreign characters does not work "df['text'] = df['text'].apply(json.loads)". Is there any way to get both RT and original tweet lines (I can then delete the duplicate) and keep the character conversion by using ".apply(json.loads)"?
Below is my code:
!twarc2 search "(from:volkspartei)" --archive > OVP.jsonl
!twarc2 csv --json-encode-text --no-inline-referenced-tweets OVP.jsonl OVP.csv
import pandas as pd
import json
OVP = pd.read_csv("OVP.csv")
OVP['text'] = OVP['text'].apply(json.loads)
OVP[['text','created_at','author.created_at']]
OVP.to_csv("OVP_1.csv")
Thanks!
Alex
Counts are supported by twarc2 counts
command, and can output csv already, but to make it compatible with all types of API responses, this plugin should also be able to take the JSON formatted counts, and output the CSV format.
For me, in Python 3.7, pandas 1.4.1 seems to work, and in 3.8 but not on Mac? Python 3.10 Also fails.
https://twittercommunity.com/t/trouble-working-with-twarc-csv/167401/3?u=igorbrigadir
Need to test and make sure different versions are compatible.
Would it be possible to make flattening implicit when converting to CSV? Wasn't the edge case you fixed in flattening over on the twarc v2 branch about making flattening a no-op if data was already flat?
If users have already flattened their data and don't want to incur the expense of re-flattening they could pass a --noflat option?
Hello,
I am collecting data for research, and I have an inquiry. If someone can help me, I will be appreciated it. Could you please let me know whether it is possible to pass a list of supreme leaders’ usernames and the president’s usernames of countries into the twarc2 and collect the data with a specific keyword?
Support passing any type of input file to convert into a CSV or dataframe.
{"data":[...],"meta":.., ...} \n {"data":[...],"meta":.., ...} \n ...
data
, 1 tweet object per line. {tweet} \n {tweet} \n {tweet}
Some nice to have but non essential
The tool should ideally detect what's being passed to it and act accordingly. It should be possible to do some checks on the file size, first line, or first few json tokens to determine this efficiently.
Should the text
column include the full text of the original retweet instead of the truncated version?
Support passing a parameter to pandas to save as anything it supports.
Since pandas does the CSV saving, we can support anything it supports https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html
What i'm thinking of doing is this:
If this is the default command:
twarc2 csv "input.jsonl" "output.csv"
The "advanced" command to save in any format you like can either be:
twarc2 csv --format "to_parquet" --parameters "engine='auto', compression='snappy', index=None" "input.jsonl" "output.csv"
where --format "to_parquet"
is the name of the pandas function to call https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_parquet.html
and "--parameters "engine='auto'"
would define any arguments for pandas.
or, add a new command dataframe
like this with multiple options https://click.palletsprojects.com/en/7.x/options/#multiple-options
twarc2 dataframe --format "to_parquet" "input.jsonl" "output.parquet"
I'm open to suggestions for exactly how to structure the command line.
Currently it only accepts 2 file handles,
from twarc_csv import CSVConverter
with open("input.json", "r") as infile:
with open("output.csv", "w") as outfile:
converter = CSVConverter(infile=infile, outfile=outfile)
converter.process()
ideally you should be able to pass it tweet / response objects and get back CSV rows so it's easier to use in scripts
I am trying to search for two different URL (URL1 OR URL2), but I am not being able to make it work or to escape the characters. Is this the right method?
twarc2 search '(https://www.elconfidencial.com/espana/madrid/2021-09-07/universidad-periodismo-complutense-profesores_3218500, OR https://www.infolibre.es/noticias/opinion/columnas/2021/09/08/la_verdad_sobre_caso_quiros_una_cronica_primera_persona_124235_1023.html)' > search_210913.json
⚡ There were errors processing your request: no viable alternative at character '/' (at position 122), no viable alternative at character '/' (at position 8), no viable alternative at character '/' (at position 9), no viable alternative at character '/' (at position 123)
Update docs on:
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.