GithubHelp home page GithubHelp logo

Comments (6)

Smotrov avatar Smotrov commented on June 2, 2024 1

The issue with the way how DataFusion drakes the file into pieces.

let session_config = SessionConfig::new().with_repartition_file_scans(false);
let ctx = SessionContext::new_with_config(session_config);

I've found that if repartition is disabled, it works flawlessly.

So I suspect something is wrong here in case of ZStd.

let repartitioned_file_groups_option = FileGroupPartitioner::new()

After splitting the file into 10 slices it does decodes some of them, but fails with the others.

from arrow-datafusion.

alamb avatar alamb commented on June 2, 2024 1

Nice find!

It seems like the current code disables repartitioning for gzip:

if self.file_compression_type == FileCompressionType::GZIP {
return Ok(None);
}

Maybe we have to do something similar for zstd and other compression types ๐Ÿค”

from arrow-datafusion.

alamb avatar alamb commented on June 2, 2024

Thanks for the report -- can you possiblly share an example of such a file (or instructions for how to create one)?

from arrow-datafusion.

Smotrov avatar Smotrov commented on June 2, 2024

Here is an example file data.zst.json

And the code, which shows that the file could be perfectly decoded with async_compression which is used in DataFusion. Meanwhile it could not be used to read as DataFrame.

use arrow::datatypes::{Field, Schema};
use datafusion::common::arrow::datatypes::{DataType, TimeUnit};
use datafusion::datasource::file_format::options::NdJsonReadOptions;
use datafusion::datasource::file_format::file_compression_type::FileCompressionType;
use datafusion::prelude::*;
use std::io::Error;
use datafusion::error::Result;
use async_compression::tokio::bufread::ZstdDecoder;
use tokio::io::AsyncReadExt;

const FILE_PATH: &str = "data.zst";

#[tokio::main]
async fn main() -> Result<(), Error>  {


    // read file with tokio and create a StreamReader
    let file = tokio::fs::File::open(FILE_PATH).await?;
    let mut reader = ZstdDecoder::new(tokio::io::BufReader::new(file));

    let mut buf = vec![];
    reader.read_to_end(&mut buf).await?;
    
    println!("๐Ÿ“ฆ Read {} bytes", buf.len());


    let schema = Schema::new(vec![
        Field::new("OriginalRequest", DataType::Utf8, false),
        Field::new(
            "RequestStarted",
            DataType::Timestamp(TimeUnit::Millisecond, None),
            false,
        ),
    ]);

    // Create context
    let ctx = SessionContext::new();

    // Read data
    let json_options = NdJsonReadOptions::default()
        .file_extension("zst")
        .file_compression_type(FileCompressionType::ZSTD)
        .schema(&schema);
    let df = ctx.read_json(FILE_PATH, json_options).await?;

    println!("๐Ÿคจ Hello, ZStd issue!");
    df.show_limit(10).await?;
    
    Ok(())
}

from arrow-datafusion.

alamb avatar alamb commented on June 2, 2024

Thank you @Smotrov ๐Ÿ™

from arrow-datafusion.

alamb avatar alamb commented on June 2, 2024

Given we now have a good reproducer on this issue I think it is ready for someone to take a look if they have time

from arrow-datafusion.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.