GithubHelp home page GithubHelp logo

docnow / twarc-csv Goto Github PK

View Code? Open in Web Editor NEW
30.0 30.0 10.0 776 KB

A plugin for twarc2 for converting tweet JSON into DataFrames and exporting to CSV.

License: MIT License

Python 100.00%
csv dataframe pandas pandas-dataframe twarc twitter twitter-api

twarc-csv's People

Contributors

edsu avatar igorbrigadir avatar sebastian-nagel avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

twarc-csv's Issues

More tweets than the account's tweet count

Hi! Thanks Igor for the quick reply to my issue #16. I was able to find a workaround to get what I wanted (keep original tweets which were retweeted but exclude truncated retweets) and get German characters and emojis. However, I get more tweets than the account's tweet count. Do you have any idea what I may be missing here? Below is my code (probably very long and inefficient, so apologies!):

  1. Download full archive
    !twarc2 search "(from:volkspartei)" --archive > OVP.jsonl

  2. Convert JSONL file to CSV
    !twarc2 csv OVP.jsonl OVP.csv

  3. Convert to data frame to delete blank lines
    OVP = pd.read_csv("OVP.csv")

  4. Delete rows with blank values in "text"
    OVP = OVP.dropna(axis=0, subset=['text'])

  5. Filter "text" column starting with "RT" to exclude truncated retweets and keep original RT tweet
    OVP = OVP[~OVP.text.str.startswith("RT")]

  6. Keep only selected columns

OVP = OVP[['author.created_at','author.name','author.username','created_at','text','type',
                  'public_metrics.like_count','public_metrics.retweet_count','public_metrics.quote_count',
                  'public_metrics.reply_count','id','conversation_id','lang','author.public_metrics.followers_count',
                  'author.public_metrics.following_count','author.public_metrics.listed_count',
                   'author.public_metrics.tweet_count']]
  1. Rename columns
OVP = OVP.rename(columns={'author.created_at': 'account_created_at', 'author.name': 'account_name',
                                   'author.username': 'username', 'public_metrics.like_count': 'like_count', 
                                   'public_metrics.retweet_count': 'retweet_count', 'public_metrics.quote_count': 
                                   'quote_count', 'public_metrics.reply_count': 'reply_count', 
                                   'author.public_metrics.followers_count': 'account_followers', 
                                   'author.public_metrics.following_count' : 'account_following',
                                   'author.public_metrics.listed_count': 'account_listed', 
                                   'author.public_metrics.tweet_count': 'account_tweet_count'})
  1. Reset index number after dropping NA rows
    OVP = OVP.reset_index(drop=True)

  2. Save back to CSV
    OVP.to_csv("OVP_1.csv")

As you can see, the final file has a total of 12186 tweets, whereas the account's tweet count is only 11917. Do you have any idea why I end up with more tweets? If all truncated Retweets begin with "RT", then my solution above should work, right?

Thanks a lot!

Encoding issues

There are still some encoding issues i'm investigating that break CSVs. If you have encountered any, post a reply here.

Support large files

Right now it reads the entire result set into memory, which means it can't handle anything larger than a moderately sized dataset.

Expand Referenced Tweets

Currently it leaves referenced_tweets list alone, the column in the CSV ends up like this:

[{"type": "replied_to", "id": "1380226330034372610"}]
[{"type": "quoted", "id": "1380226330034372610"}]
[{"type"": "retweeted", "id": "1261081519566675969"}]

but we could expand this into separate columns:

referenced_tweets.replied_to
referenced_tweets.quoted
referenced_tweets.retweeted

and by extension, type column should be a list like ["reply"] or ["retweet","reply","quote"] if it's a quote tweet that's a reply to someone that was then retweeted. type should also be __inferred_tweet_type or something to indicate where this field is coming from.

Duplicate columns being created when using --output-columns

eg:

twarc2 followers --limit 10 user | twarc2 csv --input-users-columns --output-columns "id,username,name"

gives:

1347718171470557185,1347718171470557185,2021-01-09T01:34:04.000Z,2021-01-09T01:34:04.000Z,AlexPineapple_,Alex 🍍
16832937,16832937,2008-10-17T23:36:09.000Z,2008-10-17T23:36:09.000Z,ColinNC,Colin
1321944084810997763,1321944084810997763,2020-10-29T22:37:16.000Z,2020-10-29T22:37:16.000Z,Ger11645317,Ger
17673550,17673550,2008-11-27T08:59:57.000Z,2008-11-27T08:59:57.000Z,mikemcc28,Mike

DataFrameConverter to single tweet

I have a large set of tweets, and I would like to wrangle and write them to file as I go. DataFrameConverter would be ideal for this, but if I try to pass a tweet to it, I get an error message: TypeError: process() missing 1 required positional argument: 'objects'. I followed these instructions:

from twarc_csv import DataFrameConverter

json_objects = [...]

df = DataFrameConverter.process(json_objects)

passing converter a tweet or a page scraped as described in examples.

What am I doing wrong or can I not use this at all as I would like to?

Mixed column variables when converting from flattened search into csv

I've been downloading thousands tweets with search and flatten:
twarc2 search 'xxx' --start-time 2018-03-20 --end-time 2018-04-30 --limit 2 --archive --flatten > xxx.jsonl

When I convert to csv with twarc2 csv:
twarc2 csv xxx.jsonl xxx.csv

things seem to go well, but when I look into the column author.username is full of other strings of text that are not usernames:
csvcut -c author.username xxx.csv

such as date-time strings 2016-06-08T10:08:00.000Z, ids 1604281548 or numbers 1186.0.

If I look in one of the files exactly I see that problems start in line 6095:
Screenshot from 2021-04-27 17-14-27

when the variable reply_settings is filled with [{"type": "retweeted", "id": "976557335748534273"}], and from there all the variable are mixed up.

When I make the same experiment with a few tweets from a short search, this problem is not there.

Converting Large Jsonl file to Csv

Thanks for the upgrade, however, a 2gb jsonl file I am trying to convert to csv is taking 4hrs and more to run without completion. However, smaller files below 200mb successfully convert to csv within 5mins.
I have upgraded twarc-csv to 2.10, what else could be the problem, and what can i do?

twarc2 search --limit inqueries

Hello,
would you mind to let me know, while I use the twarc2 search --limit, do the collected tweets are based on random search or based on specific algorithm?

twarc2 csv _csv.Error: need to escape, but no escapechar set

Hi, I'm trying to run twarc-csv on a jsonl file obtained through the Academic API. Using a new macbook pro with M1 chip. I run this command:
twarc2 csv result.jsonl result.csv

It gets stuck at 37% every time with the output below. Is this a known error? Am I doing something wrong? Thank you in advance.

37%|█████▉ | Processed 286M/766M of input file [00:33<00:37, 13.5MB/s]Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.10/bin/twarc2", line 8, in
sys.exit(twarc2())
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/click/core.py", line 1128, in call
return self.main(*args, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/click/core.py", line 1053, in main
rv = self.invoke(ctx)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/click/core.py", line 1659, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/click/core.py", line 1395, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/click/core.py", line 754, in invoke
return __callback(*args, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/twarc_csv.py", line 148, in csv
writer.process()
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/csv_writer.py", line 81, in process
self._write_output(self.converter.process(batch), first_batch)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/csv_writer.py", line 65, in _write_output
_df.to_csv(
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pandas/core/generic.py", line 3466, in to_csv
return DataFrameRenderer(formatter).to_csv(
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pandas/io/formats/format.py", line 1105, in to_csv
csv_formatter.save()
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pandas/io/formats/csvs.py", line 257, in save
self._save()
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pandas/io/formats/csvs.py", line 262, in _save
self._save_body()
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pandas/io/formats/csvs.py", line 300, in _save_body
self._save_chunk(start_i, end_i)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pandas/io/formats/csvs.py", line 311, in _save_chunk
libwriters.write_csv_rows(
File "pandas/_libs/writers.pyx", line 72, in pandas._libs.writers.write_csv_rows
_csv.Error: need to escape, but no escapechar set
37%|█████▉ | Processed 286M/766M of input file [00:33<00:56, 8.83MB/s]

Error skipping batch of tweets

I am facing some problems with the twarc2 csv command. I am using it to convert a (large) json file as follows:
twarc2 csv treatment.json treatment.csv
When the progress bar reaches about 89% it returns the error:

 "withheld.scope"
 to fix, add these with --input-columns. Skipping entire batch of 670325 tweets!

Now I assumed that this means I need to add that option to the command line in order to recover those tweets. I am doing it like this :
twarc2 csv --input-columns "withheld.scope" treatment.json treatment.csv

but it returns an error again. I have updated twarc to its last version and the code alwys worked properly to me but in this file.

Thanks in advance!

Float IDs

Hello!

Recently I've collected some tweets using twarc2 and, after the end of the retrieval, I converted the output into a '.csv' using twarc-csv.

However, some columns that I need to use store IDs as floats, as shown in the figure. When I try to convert it to integers, sometimes it yields a tweet ID that isn't correlated to the original post ( probably a rounding imprecision) . Is there a specific/correct way to convert these IDs to integers or the information was lost during the process of conversion?

Thanks! :)

unnamed

Escaped Unicode

I noticed that unicode is JSON escaped in the CSV file. I think it should be converted to UTF-8 since it's no longer JSON?

twarc search 👋 --flatten > wave.jsonl
twarc csv wave.jsonl

Add usernames from expansions to columns

Currently there are user IDs only, but it would help to also have user names. Not all user columns should be added, as this would make an unreasonably wide dataframe (would need all author columns for each quoted, retweeted, etc user) but just adding names is enough to make things easier.

Additionally, document extracting and converting user objects too maybe.

ERROR: 4 - fails to transform a large JSONL file

Hello
Please, I have a very large .jsonl file (870MB) and I am using python 3.8 on Ubuntu
But twarc csv fails to transform to CSVs:

twarc2 csv results.jsonl tweets_with_attacks_journalists.csv

💔 ERROR: 4 Unexpected items in data!
Are you sure you specified the correct --input-data-type?
If the object type is correct, add extra columns with:
--extra-input-columns "edit_controls.is_edit_eligible,edit_controls.editable_until,edit_history_tweet_ids,edit_controls.edits_remaining"
Skipping entire batch of 9944 tweets!

Is there any other way to convert to CSV?

entities.urls missing for referenced tweets

It appears that referenced tweets don't have their URLs pulled over into their rows in the CSV? This came up in DocNow/twarc#538

$ twarc2 tweet 1438733968287977476 | twarc2 csv - > tweet.csv
$ xsv select id,entities.urls tweet.csv
id,entities.urls
1438486160867745801,
1438733968287977476,"[{""start"": 162, ""end"": 185, ""url"": ""https://t.co/wRfX2O9S7W"", ""expanded_url"": ""https://twitter.com/dereckapurnell/status/1438486160867745801"", ""display_url"": ""twitter.com/dereckapurnell\u2026""}]"

Add `pyproject.toml`

Installing in newer pip / python versions gives a warning:

DEPRECATION: twarc-csv is being installed using the legacy 'setup.py install' method, because it does not have a 'pyproject.toml' and the 'wheel' package is not installed. pip 23.1 will enforce this behaviour change. A possible replacement is to enable the '--use-pep517' option. Discussion can be found at https://github.com/pypa/pip/issues/8559

Tweet ID with no text

Reproducible code:

!twarc2 search “(vaccine OR jab OR vaxine) (-is:retweet) (lang:en)” --archive --start-time 2021-03-01T00:00:00 --end-time 2021-03-03T00:00:00 --limit 300 raw_output.json

!twarc2 flatten ‘raw_output.json’ ‘flattened_output.json’

!twarc2 csv --output-columns “id,created_at,text” ‘flattened_output.json’ ‘outputshort.csv’

outputshort.csv

Duplicate tweets

I noticed that when I collected a conversation, and then exported to CSV that I ended up with duplicate rows for tweets, presumably because they are included in other tweets. I'm not sure what the solution is here, but it definitely seems problematic from a usability perspective.

twarc2 search conversation_id:1385008025140871168 > results.jsonl
twarc2 csv results.jsonl > results.csv

then:

>>> import pandas
>>> df = pandas.read_csv('results.csv')
>>> df.value_counts('id')
id
1385008025140871168    55
1385012190860611586     3
1385008825418321920     3
1385009576731422726     2
1385017514308956160     2
                       ..
1385015116987453441     1
1385015148184809474     1
1385015948273496068     1
1385018187419377665     1
1385192675549327366     1
Length: 68, dtype: int64
```

Add `--optimized` mode

The current output favors preserving as much information as possible from the original json, but there is some duplication, and a bunch of columns can be removed as they're rarely super useful.

The new --optimized mode will generate CSVs that drop a bunch of columns to save space:

edit_controls.edits_remaining
edit_controls.editable_until
entities.cashtags
entities.hashtags
entities.mentions
withheld.scope
withheld.copyright
author.id
author.entities.description.cashtags
author.entities.description.hashtags
author.entities.description.mentions
author.url
author.withheld.scope
author.withheld.copyright
geo.coordinates.coordinates
geo.coordinates.type
geo.country
geo.full_name
geo.geo.type
matching_rules
__twarc.retrieved_at
__twarc.url
__twarc.version

(exact list to be revised later)

These are the most commonly not present or duplicate ones, where the missing data can be inferred from the columns left over, or with the cashtags, hashtags, mentions, with twitter-text for example.

Should probably fix #36 and #47 first before this.

Merge / modify entities output

I may change how the lists of hashtags and mentions are output.

Currenty, the json list is preserved as is. I may change this to output a list like ["@one","@two","@three"] as opposed to preserving the start / end indexes, as these are generally not used.

Dealing with URLs in the same way is possible but i'm not sure how much processing to do on the URLs (show all t.co and unwound urls? or show all? etc.)

Missing retweet (quote) text

Hi there,

Is there an easy way to get the quoted tweet text field in the CSV output? The original tweet text is available but not the quoted text, which is very often where the hashtags of interest (used in the original search query) are located.

Thanks!

Error unexpected data : Problem converting jsonl file to csv file.

twarc2 search --archive --start-time 2020-01-01 --limit 10 "reliance" tweets_reliance3.jsonl

twarc2 csv tweets_reliance3.jsonl tweets_reliance3.csv

I have used the following commands and I get error while converting jsonl to csv.

ERROR: Unexpected Data: "author.withheld.scope" to fix, add these with --extra-input-columns. Skipping entire batch of 666 tweets!
Even after using the command :
twarc2 csv --extra-input-columns "author.witheld.scope" tweets_reliance3.jsonl tweets_reliance5.csv . I get the same error.
Here is the json file.
Jsonl file

Geo Point coordinates not saved

For example, a tweet like 1249702384659554308

"geo": {
        "coordinates": {
          "type": "Point",
          "coordinates": [
            42.77810097,
            88.01785747
          ]
        }

Doesn't get saved in the CSV.

Full tweet json:

{
  "data": [
    {
      "lang": "pl",
      "entities": {
        "urls": [
          {
            "start": 212,
            "end": 235,
            "url": "https://t.co/MZ9QlqGyfA",
            "expanded_url": "https://www.instagram.com/p/B-7ItQwBJ93/?igshid=187dqx4b5lu2y",
            "display_url": "instagram.com/p/B-7ItQwBJ93/…",
            "status": 200,
            "unwound_url": "https://www.instagram.com/p/B-7ItQwBJ93/?igshid=187dqx4b5lu2y"
          }
        ],
        "hashtags": [
          {
            "start": 24,
            "end": 39,
            "tag": "harrypotterdiy"
          },
          {
            "start": 40,
            "end": 72,
            "tag": "harrypotterandphilosophersstone"
          },
          {
            "start": 73,
            "end": 84,
            "tag": "potterhead"
          },
          {
            "start": 85,
            "end": 97,
            "tag": "harrypotter"
          },
          {
            "start": 98,
            "end": 115,
            "tag": "philosopherstone"
          },
          {
            "start": 116,
            "end": 131,
            "tag": "czasnaczytanie"
          },
          {
            "start": 132,
            "end": 144,
            "tag": "zostańwdomu"
          },
          {
            "start": 145,
            "end": 157,
            "tag": "zostanwdomu"
          },
          {
            "start": 158,
            "end": 164,
            "tag": "magic"
          },
          {
            "start": 165,
            "end": 177,
            "tag": "harrypotter"
          },
          {
            "start": 178,
            "end": 187,
            "tag": "funkopop"
          },
          {
            "start": 188,
            "end": 197,
            "tag": "bookpile"
          },
          {
            "start": 198,
            "end": 210,
            "tag": "bookstagram"
          }
        ]
      },
      "created_at": "2020-04-13T14:14:01.000Z",
      "public_metrics": {
        "retweet_count": 1,
        "reply_count": 0,
        "like_count": 0,
        "quote_count": 0
      },
      "reply_settings": "everyone",
      "text": "Za co lubicie Harry'ego?#harrypotterdiy #harrypotterandphilosophersstone #potterhead #harrypotter #philosopherstone #czasnaczytanie #zostańwdomu #zostanwdomu #magic #harrypotter #funkopop #bookpile #bookstagram… https://t.co/MZ9QlqGyfA",
      "possibly_sensitive": false,
      "geo": {
        "coordinates": {
          "type": "Point",
          "coordinates": [
            42.77810097,
            88.01785747
          ]
        }
      },
      "id": "1249702384659554308",
      "context_annotations": [
        {
          "domain": {
            "id": "66",
            "name": "Interests and Hobbies Category",
            "description": "A grouping of interests and hobbies entities, like Novelty Food or Destinations"
          },
          "entity": {
            "id": "1206704182717104128",
            "name": "Model figures"
          }
        },
        {
          "domain": {
            "id": "130",
            "name": "Multimedia Franchise",
            "description": "Franchises which span multiple forms of media like 'Harry Potter'"
          },
          "entity": {
            "id": "933033311844286464",
            "name": "Harry Potter",
            "description": "This entity includes all conversation about the franchise, as well as any individual installments in the series, if applicable.\t\t\t"
          }
        }
      ],
      "author_id": "2344192110",
      "conversation_id": "1249702384659554308",
      "source": "Instagram"
    }
  ],
  "includes": {
    "users": [
      {
        "name": "Kama",
        "username": "kamanonickname",
        "protected": false,
        "verified": false,
        "public_metrics": {
          "followers_count": 64,
          "following_count": 152,
          "tweet_count": 8743,
          "listed_count": 0
        },
        "created_at": "2014-02-14T22:26:08.000Z",
        "description": "There should be bio but Mróz is busy writing his 666th novel",
        "id": "2344192110",
        "url": "",
        "profile_image_url": "https://pbs.twimg.com/profile_images/1422798418645225472/cRbGyIvp_normal.jpg"
      }
    ]
  },
  "__twarc": {
    "url": "https://api.twitter.com/2/tweets?expansions=author_id%2Cin_reply_to_user_id%2Creferenced_tweets.id%2Creferenced_tweets.id.author_id%2Centities.mentions.username%2Cattachments.poll_ids%2Cattachments.media_keys%2Cgeo.place_id&tweet.fields=attachments%2Cauthor_id%2Ccontext_annotations%2Cconversation_id%2Ccreated_at%2Centities%2Cgeo%2Cid%2Cin_reply_to_user_id%2Clang%2Cpublic_metrics%2Ctext%2Cpossibly_sensitive%2Creferenced_tweets%2Creply_settings%2Csource%2Cwithheld&user.fields=created_at%2Cdescription%2Centities%2Cid%2Clocation%2Cname%2Cpinned_tweet_id%2Cprofile_image_url%2Cprotected%2Cpublic_metrics%2Curl%2Cusername%2Cverified%2Cwithheld&media.fields=alt_text%2Cduration_ms%2Cheight%2Cmedia_key%2Cpreview_image_url%2Ctype%2Curl%2Cwidth%2Cpublic_metrics&poll.fields=duration_minutes%2Cend_datetime%2Cid%2Coptions%2Cvoting_status&place.fields=contained_within%2Ccountry%2Ccountry_code%2Cfull_name%2Cgeo%2Cid%2Cname%2Cplace_type&ids=1249702384659554308",
    "version": "2.8.1",
    "retrieved_at": "2021-12-14T16:17:36+00:00"
  }
}

The CSV is missing the point coordinates.

0% JSON to CSV

When using twarc2 csv tweets.jsonl tweets.csv # convert to CSV

It loads the JSONL and runs the script at 0%:

0%| | 0.00/2.72M [00:00<?, ?B/s]

0 byte file check warning

Instead of "no such file exists" error for 0 byte files it should warn you about twarc potentially not finding any results in a previous command or something. Seeing the file in a folder but twarc2 csv saying it doesn't exist confuses people.

Retweets and foreign language characters

Hi!

I'm new to Python so forgive my ignorance. I've been downloading tweets with Twarc2. I was including "--no-inline-referenced-tweets" so I don't get duplicate RT entries when exporting to CSV. I just noticed that the RT lines do not include the full text of the RTs (whereas the original Tweet does if I include it).

My problem is that if I don't include "--no-inline-referenced-tweets", then the following code I found in here to deal with foreign characters does not work "df['text'] = df['text'].apply(json.loads)". Is there any way to get both RT and original tweet lines (I can then delete the duplicate) and keep the character conversion by using ".apply(json.loads)"?

Below is my code:

Download @volkspartei tweets

!twarc2 search "(from:volkspartei)" --archive > OVP.jsonl

Convert JSONL file to CSV + eliminate RT duplicates

!twarc2 csv --json-encode-text --no-inline-referenced-tweets OVP.jsonl OVP.csv

Convert to data frame to delete blank lines in CSV file

import pandas as pd
import json
OVP = pd.read_csv("OVP.csv")

Convert German characters and emojis

OVP['text'] = OVP['text'].apply(json.loads)

Check data`

OVP[['text','created_at','author.created_at']]

Save back to CSV

OVP.to_csv("OVP_1.csv")

Thanks!
Alex

Counts support

Counts are supported by twarc2 counts command, and can output csv already, but to make it compatible with all types of API responses, this plugin should also be able to take the JSON formatted counts, and output the CSV format.

Flattening

Would it be possible to make flattening implicit when converting to CSV? Wasn't the edge case you fixed in flattening over on the twarc v2 branch about making flattening a no-op if data was already flat?

If users have already flattened their data and don't want to incur the expense of re-flattening they could pass a --noflat option?

collecting data based on specific list of usernames.

Hello,
I am collecting data for research, and I have an inquiry. If someone can help me, I will be appreciated it. Could you please let me know whether it is possible to pass a list of supreme leaders’ usernames and the president’s usernames of countries into the twarc2 and collect the data with a specific keyword?

Other input formats

Support passing any type of input file to convert into a CSV or dataframe.

  • 1 original request from api per line {"data":[...],"meta":.., ...} \n {"data":[...],"meta":.., ...} \n ...
  • "flattened" or extracted directly from data, 1 tweet object per line. {tweet} \n {tweet} \n {tweet}
  • Untested: A dataset of users as opposed to tweets.

Some nice to have but non essential

The tool should ideally detect what's being passed to it and act accordingly. It should be possible to do some checks on the file size, first line, or first few json tokens to determine this efficiently.

Retweet text truncated

Should the text column include the full text of the original retweet instead of the truncated version?

Advanced Output Options

Support passing a parameter to pandas to save as anything it supports.

Since pandas does the CSV saving, we can support anything it supports https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html

What i'm thinking of doing is this:

If this is the default command:

twarc2 csv "input.jsonl" "output.csv"

The "advanced" command to save in any format you like can either be:

twarc2 csv --format "to_parquet" --parameters "engine='auto', compression='snappy', index=None" "input.jsonl" "output.csv"

where --format "to_parquet" is the name of the pandas function to call https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_parquet.html

and "--parameters "engine='auto'" would define any arguments for pandas.

or, add a new command dataframe like this with multiple options https://click.palletsprojects.com/en/7.x/options/#multiple-options

twarc2 dataframe --format "to_parquet" "input.jsonl" "output.parquet"

I'm open to suggestions for exactly how to structure the command line.

Make CSVConverter easier to use in scripts

Currently it only accepts 2 file handles,

from twarc_csv import CSVConverter

with open("input.json", "r") as infile:
    with open("output.csv", "w") as outfile:
        converter = CSVConverter(infile=infile, outfile=outfile)
        converter.process()

ideally you should be able to pass it tweet / response objects and get back CSV rows so it's easier to use in scripts

How to search for URLs?

I am trying to search for two different URL (URL1 OR URL2), but I am not being able to make it work or to escape the characters. Is this the right method?

twarc2 search '(https://www.elconfidencial.com/espana/madrid/2021-09-07/universidad-periodismo-complutense-profesores_3218500, OR https://www.infolibre.es/noticias/opinion/columnas/2021/09/08/la_verdad_sobre_caso_quiros_una_cronica_primera_persona_124235_1023.html)' > search_210913.json

⚡ There were errors processing your request: no viable alternative at character '/' (at position 122), no viable alternative at character '/' (at position 8), no viable alternative at character '/' (at position 9), no viable alternative at character '/' (at position 123)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.