Comments (6)
The issue with the way how DataFusion drakes the file into pieces.
let session_config = SessionConfig::new().with_repartition_file_scans(false);
let ctx = SessionContext::new_with_config(session_config);
I've found that if repartition is disabled, it works flawlessly.
So I suspect something is wrong here in case of ZStd.
After splitting the file into 10 slices it does decodes some of them, but fails with the others.
from arrow-datafusion.
Nice find!
It seems like the current code disables repartitioning for gzip:
datafusion/datafusion/core/src/datasource/physical_plan/json.rs
Lines 157 to 159 in 9f0e016
Maybe we have to do something similar for zstd and other compression types ๐ค
from arrow-datafusion.
Thanks for the report -- can you possiblly share an example of such a file (or instructions for how to create one)?
from arrow-datafusion.
Here is an example file data.zst.json
And the code, which shows that the file could be perfectly decoded with async_compression
which is used in DataFusion. Meanwhile it could not be used to read as DataFrame.
use arrow::datatypes::{Field, Schema};
use datafusion::common::arrow::datatypes::{DataType, TimeUnit};
use datafusion::datasource::file_format::options::NdJsonReadOptions;
use datafusion::datasource::file_format::file_compression_type::FileCompressionType;
use datafusion::prelude::*;
use std::io::Error;
use datafusion::error::Result;
use async_compression::tokio::bufread::ZstdDecoder;
use tokio::io::AsyncReadExt;
const FILE_PATH: &str = "data.zst";
#[tokio::main]
async fn main() -> Result<(), Error> {
// read file with tokio and create a StreamReader
let file = tokio::fs::File::open(FILE_PATH).await?;
let mut reader = ZstdDecoder::new(tokio::io::BufReader::new(file));
let mut buf = vec![];
reader.read_to_end(&mut buf).await?;
println!("๐ฆ Read {} bytes", buf.len());
let schema = Schema::new(vec![
Field::new("OriginalRequest", DataType::Utf8, false),
Field::new(
"RequestStarted",
DataType::Timestamp(TimeUnit::Millisecond, None),
false,
),
]);
// Create context
let ctx = SessionContext::new();
// Read data
let json_options = NdJsonReadOptions::default()
.file_extension("zst")
.file_compression_type(FileCompressionType::ZSTD)
.schema(&schema);
let df = ctx.read_json(FILE_PATH, json_options).await?;
println!("๐คจ Hello, ZStd issue!");
df.show_limit(10).await?;
Ok(())
}
from arrow-datafusion.
Thank you @Smotrov ๐
from arrow-datafusion.
Given we now have a good reproducer on this issue I think it is ready for someone to take a look if they have time
from arrow-datafusion.
Related Issues (20)
- bug: `CAST(<array>)` causes internal error HOT 3
- Implement `LogicalPlanBuilder::from` for `Arc<LogicalPlan>` HOT 5
- to_date with a date string and format fails with error parsing timestamp
- Docker CLI build fails in WSL2 - "Ubuntu 22.04.4 LTS"
- Seperate out common types from Datafusion Proto HOT 1
- Connection reset by peer on AWS S3 object store. HOT 1
- Document committer / PMC process
- Create presentation for DataFusion SIGMOD 2024 paper
- Keynote presentation for SiMoD workshop at SIGMOD 2024 HOT 1
- DataFusion weekly project plan (Andrew Lamb) - May 13, 2024 HOT 1
- Convert internal representation of LogicalPlanBuilder from `LogicalPlan` to `Arc<LogicalPlan>` HOT 3
- [Regression] Query using ARRAY_AGG(DISTINCT) causes panic HOT 5
- Add `ProgressiveEval` operator for optimize `SortPreservingMerge` HOT 3
- SortMergeJoin: The query stuck when join filter is set and more matched rows than batch size
- API to get all `Column` references in an `Expr` without cloning `Columns`
- Strengthen TypeSignature and Coercion rule.
- Excessive memory consumption on sorting HOT 3
- feat: enable optional UDF arguments in `regexp_*` functions HOT 1
- cannot import datafusion-37.1.0 in python 3.8 of windows 7 x64 HOT 2
- Release DataFusion `39.0.0`
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. ๐๐๐
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google โค๏ธ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from arrow-datafusion.