timvw / qv Goto Github PK

View Code? Open in Web Editor NEW

252.0 252.0 16.0 366 KB

Quickly view your data

License: Apache License 2.0

Shell 1.26% Rust 93.91% Dockerfile 4.83%

qv's Introduction

Hi there 👋

qv's People

Contributors

Stargazers

Watchers

Forkers

isgasho alippai skymsg wseaton tardunge chapeupreto jpadhye jghoman rohankumardubey drmanganese iq-scm duanmeng jiangzhx joedursun xuxiaotuan bagayi

qv's Issues

Support paritioned parquet dataset

File location patterns mydata/*.parquet and mydata/partition=01/*.parquet are common storage formats. Is it possible to support reading mydata in these cases? In theory Datafusion should have some support

Will this work on single files without deltalake?

Hi,
When I try to view a schema on a csv file I got an error for some _delta_log.
Is this normal?

∴ qv -s  ./fixtures/good/usage_data.csv 
Error: ObjectStore(Generic { store: "LocalFileSystem", source: UnableToWalkDir { source: Error { depth: 0, inner: Io { path: Some("/home/guda/projects/toki/invoicing/fixtures/good/usage_data.csv/_delta_log"), err: Os { code: 20, kind: NotADirectory, message: "Not a directory" } } } } })

 ∴ qv -V
qv 0.3.1

Problem with special characters in file path: "No such file or directory"

Hi, I tried your tool but it does not work with special characters in the path:

# qv "/tmp/test@dir#with_special characters.parquet"
Error: ObjectStore(NotFound { path: "/tmp/test@dir%23with_special%20characters.parquet", source: Os { code: 2, kind: NotFound, message: "No such file or directory" } })

latest version published to cargo out of date

I noticed the --profile flag wasn't working for some reason, and then realized I was using an out of date version of qv installed via cargo install qv.

❯ cargo search qv
qv = "0.1.22"               # quickly view your data

❯ cargo install qv
    Updating crates.io index
     Ignored package `qv v0.1.22` is already installed, use --force to override

looks like the latest version https://crates.io/crates/qv/versions is 0.1.22. would it be possible to upload the history of versions to cargo?

[Features] nice to have direct uplink connection to storj network instead of s3 gateway

More details: https://github.com/storj/uplink

Support for google cloud storage blobs.

can we extend support for google cloud storage as well??

It would be very nice to have this api generalised for any cloud storage provider and provide a simple api for common use cases like fetch, get, e.t.c

It's really a great tool with some understanding use-case(PS: data-engg)

Tyvm for putting effort though.

Dependency Dashboard

This issue lists Renovate updates and detected dependencies. Read the Dependency Dashboard docs to learn more.

Open

These updates have all been created already. Click a checkbox below to force a retry/rebase of any.

chore(deps): update rust docker tag to v1.78
fix(deps): update rust crate object_store to 0.10
fix(deps): update rust crate datafusion to v38
Click on this checkbox to rebase all open PRs at once

Detected dependencies

cargo

Cargo.toml

aws-config 1.2.1

aws-sdk-glue 1.27

aws-types 1.2

aws-credential-types 1.2

chrono 0.4.38

clap 4.5.4

datafusion 35

deltalake 0.17

futures 0.3

glob 0.3

object_store 0.9

regex 1.10

tokio 1

url 2.5

assert_cmd 2.0.14

predicates 3.1

dockerfile

Dockerfile

rust 1.77

github-actions

.github/workflows/binaries.yml

actions/checkout v4

taiki-e/setup-cross-toolchain-action v1

taiki-e/upload-rust-binary-action v1

.github/workflows/release-plz.yml

actions/checkout v4

MarcoIeni/release-plz-action v0.5

.github/workflows/test_suite.yml

actions/checkout v4

actions-rust-lang/setup-rust-toolchain v1

taiki-e/install-action v2

mikepenz/action-junit-report v4

codecov/codecov-action v4

actions/checkout v4

actions-rust-lang/setup-rust-toolchain v1

actions-rust-lang/rustfmt v1

actions/checkout v4

actions-rust-lang/setup-rust-toolchain v1

Check this box to trigger a request for Renovate to run again on this repository

Support for Iceberg table format

Can we have support for iceberg table format as well just like the deltalake..

Thanks.

Support for saving dataset

Is it in scope for this tool to support saving datasets or do you just want to keep it as a view tool? It would be useful for instance for converting formats, filtering data from a CSV quickly and so on. I know there is the datafusion CLI for that, but a simple tool like AWK but with friendly syntax would be welcome.

Slow startup time

It might be because of the high number of CPUs (if arrow starts a thread-per-core threadpool), but reading a 1MB parquet file (with limits) takes 3s.
When running qv table.parquet likely only one thread / CPU is needed (or one per column at most) as in theory we are reading only one batch (few rows)?!

failed to map column projection- incompatible data types list field element vs item

I have a table that reads correctly using Spark + Delta Lake Libraries, but I'm having trouble reading via pv.

do you know which downstream dependency could be giving me this error?

Error: ArrowError(ExternalError(Execution("Failed to map column projection for field mycolumn. Incompatible data types List(Field { name: "element", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: None }) and List(Field { name: "item", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: None })")))

I checked the schema from the delta transaction log and didn't see a hardcoded item or element:

❯ aws s3 cp s3://mybucket/year=2022/month=6/day=9/myprefix/_delta_log/00000000000000000000.json - | head -n 3 | tail -n 1 | jq '.metaData.schemaString | fromjson | .fields[] | select(.name == "mycolumn")'
{
  "name": "mycolumn",
  "type": {
    "type": "array",
    "elementType": "string",
    "containsNull": true
  },
  "nullable": true,
  "metadata": {}
}

When I look at the schema of a sample parquet file on s3, I do indeed see that the item in the list is called element:

pqrs schema =(s5cmd cat s3://mybucket/year=2022/month=6/day=9/myprefix/_partition=00001/part-00037-cb2e71c3-4f26-4de0-9e9a-18298489ccdc.c000.snappy.parquet)

...
message spark_schema {
  ...
  OPTIONAL group mycolumn (LIST) {
    REPEATED group list {
      OPTIONAL BYTE_ARRAY element (UTF8);
    }
  }
  ...
}

I see this exact error is from here: https://github.com/apache/arrow-datafusion/blob/aad82fbb32dc1bb4d03e8b36297f8c9a3148df89/datafusion/core/src/physical_plan/file_format/mod.rs#L253

And I also see that element is hardcoded in delta-rs here:

https://github.com/delta-io/delta-rs/blob/83b8296fa5d55ebe050b022ed583dc57152221fe/rust/src/delta_arrow.rs#L38-L48 (pr: delta-io/delta-rs#228)

But I can't seem to find where the schema mismatch is coming from.

compressed json

looks like datafusion just shipped the ability to read compressed data – such as gziped json
apache/datafusion#3642 (review)