timvw / qv Goto Github PK
View Code? Open in Web Editor NEWQuickly view your data
License: Apache License 2.0
Quickly view your data
License: Apache License 2.0
File location patterns mydata/*.parquet
and mydata/partition=01/*.parquet
are common storage formats. Is it possible to support reading mydata
in these cases? In theory Datafusion should have some support
Hi,
When I try to view a schema on a csv file I got an error for some _delta_log.
Is this normal?
∴ qv -s ./fixtures/good/usage_data.csv
Error: ObjectStore(Generic { store: "LocalFileSystem", source: UnableToWalkDir { source: Error { depth: 0, inner: Io { path: Some("/home/guda/projects/toki/invoicing/fixtures/good/usage_data.csv/_delta_log"), err: Os { code: 20, kind: NotADirectory, message: "Not a directory" } } } } })
∴ qv -V
qv 0.3.1
Hi, I tried your tool but it does not work with special characters in the path:
# qv "/tmp/test@dir#with_special characters.parquet"
Error: ObjectStore(NotFound { path: "/tmp/test@dir%23with_special%20characters.parquet", source: Os { code: 2, kind: NotFound, message: "No such file or directory" } })
I noticed the --profile
flag wasn't working for some reason, and then realized I was using an out of date version of qv
installed via cargo install qv
.
❯ cargo search qv
qv = "0.1.22" # quickly view your data
❯ cargo install qv
Updating crates.io index
Ignored package `qv v0.1.22` is already installed, use --force to override
looks like the latest version https://crates.io/crates/qv/versions is 0.1.22. would it be possible to upload the history of versions to cargo?
More details: https://github.com/storj/uplink
can we extend support for google cloud storage as well??
It would be very nice to have this api generalised for any cloud storage provider and provide a simple api for common use cases like fetch, get, e.t.c
It's really a great tool with some understanding use-case(PS: data-engg)
Tyvm for putting effort though.
This issue lists Renovate updates and detected dependencies. Read the Dependency Dashboard docs to learn more.
These updates have all been created already. Click a checkbox below to force a retry/rebase of any.
Cargo.toml
aws-config 1.2.1
aws-sdk-glue 1.27
aws-types 1.2
aws-credential-types 1.2
chrono 0.4.38
clap 4.5.4
datafusion 35
deltalake 0.17
futures 0.3
glob 0.3
object_store 0.9
regex 1.10
tokio 1
url 2.5
assert_cmd 2.0.14
predicates 3.1
Dockerfile
rust 1.77
.github/workflows/binaries.yml
actions/checkout v4
taiki-e/setup-cross-toolchain-action v1
taiki-e/upload-rust-binary-action v1
.github/workflows/release-plz.yml
actions/checkout v4
MarcoIeni/release-plz-action v0.5
.github/workflows/test_suite.yml
actions/checkout v4
actions-rust-lang/setup-rust-toolchain v1
taiki-e/install-action v2
mikepenz/action-junit-report v4
codecov/codecov-action v4
actions/checkout v4
actions-rust-lang/setup-rust-toolchain v1
actions-rust-lang/rustfmt v1
actions/checkout v4
actions-rust-lang/setup-rust-toolchain v1
Can we have support for iceberg table format as well just like the deltalake..
Thanks.
Is it in scope for this tool to support saving datasets or do you just want to keep it as a view tool? It would be useful for instance for converting formats, filtering data from a CSV quickly and so on. I know there is the datafusion CLI for that, but a simple tool like AWK but with friendly syntax would be welcome.
It might be because of the high number of CPUs (if arrow starts a thread-per-core threadpool), but reading a 1MB parquet file (with limits) takes 3s.
When running qv table.parquet
likely only one thread / CPU is needed (or one per column at most) as in theory we are reading only one batch (few rows)?!
I have a table that reads correctly using Spark + Delta Lake Libraries, but I'm having trouble reading via pv
.
do you know which downstream dependency could be giving me this error?
Error: ArrowError(ExternalError(Execution("Failed to map column projection for field mycolumn. Incompatible data types List(Field { name: "element", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: None }) and List(Field { name: "item", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: None })")))
I checked the schema from the delta transaction log and didn't see a hardcoded item
or element
:
❯ aws s3 cp s3://mybucket/year=2022/month=6/day=9/myprefix/_delta_log/00000000000000000000.json - | head -n 3 | tail -n 1 | jq '.metaData.schemaString | fromjson | .fields[] | select(.name == "mycolumn")'
{
"name": "mycolumn",
"type": {
"type": "array",
"elementType": "string",
"containsNull": true
},
"nullable": true,
"metadata": {}
}
When I look at the schema of a sample parquet file on s3, I do indeed see that the item in the list is called element
:
pqrs schema =(s5cmd cat s3://mybucket/year=2022/month=6/day=9/myprefix/_partition=00001/part-00037-cb2e71c3-4f26-4de0-9e9a-18298489ccdc.c000.snappy.parquet)
...
message spark_schema {
...
OPTIONAL group mycolumn (LIST) {
REPEATED group list {
OPTIONAL BYTE_ARRAY element (UTF8);
}
}
...
}
I see this exact error is from here: https://github.com/apache/arrow-datafusion/blob/aad82fbb32dc1bb4d03e8b36297f8c9a3148df89/datafusion/core/src/physical_plan/file_format/mod.rs#L253
And I also see that element
is hardcoded in delta-rs here:
https://github.com/delta-io/delta-rs/blob/83b8296fa5d55ebe050b022ed583dc57152221fe/rust/src/delta_arrow.rs#L38-L48 (pr: delta-io/delta-rs#228)
But I can't seem to find where the schema mismatch is coming from.
looks like datafusion just shipped the ability to read compressed data – such as gziped json
apache/datafusion#3642 (review)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.