Comments (15)
Note that you probably want to take dask out of the equation too - it might be where you want the files to be processes eventually, but I think you should be able to find valid offsets without it, simplifying the process (at the expense of no parallelism).
from kerchunk.
Thank you for this long and thoughtful description of the problem and your attempt to fit it into fsspec-reference-maker. I see that our documentation, such as it is, is clearly substandard and has caused you some misconceptions. In truth, the case is much simpler and indeed Version 0 if fully acceptable in Version 1 too - no need for any templates or generated keys at all.
Let me give a very simple example with a file from https://github.com/datapackage-examples/sample-csv .
Here are two references in version 1 format:
refs = {
"version": 1,
"refs": {
"file1": ["https://people.sc.fsu.edu/~jburkardt/data/csv/addresses.csv", 0, 48],
"file2": ["https://people.sc.fsu.edu/~jburkardt/data/csv/addresses.csv", 224, 31]
}
}
(this could be in a JSON file, but works with python dicts too)
Now we can use fsspec to play with this:
fs = fsspec.filesystem("reference", fo=refs, target_protocol="http")
fs.ls("", False)
["file1", "file2]
fs.cat("file1")
b'John,Doe,120 jefferson st.,Riverside, NJ, 08075\n'
fs.cat("file2")
b',Blankman,,SomeTown, SD, 00298\n'
And we could get Dask to read a set of virtual CSVs like
dd.read_csv("reference://*", storage_options={"fo": refs, "target_protocol": "http"})
(plus extra read_csv kwargs, since this dataset doesn't have any column headers, as it happens). Normally, the references would be stored in a JSON file, which all the workers must be able to see. Note that the glob "*" will be expanded to the two virtual files in reference. So the only task you need to accomplish, is to find a set of offset/size pairs that can load the data.
(aside: "templates" only exist to reduce the redundancy of including the same URL in the references multiple times)
from kerchunk.
I would not attempt to solve the compression and parsing issues in one go, it would be better to use an uncompressed target at first, I think.
from kerchunk.
Would you like to try to encode your method described in dask/dask as a function/class for this repo? It would take a URL/storage-options, maybe preferred blocksize, and return valid byte offsets for line-ends that are actual row terminators (i.e,. not the parsed data at all). In this repo it would run in serial.
from kerchunk.
Already on it! 👍
I'm writing a class fsspec_reference_maker.csv.SingleCsvToPartitions
, following the hdf.py
module's example.
The class has the following method signatures:
def __init__(
self, csv: BinaryIO, url: str, blocksize: int | None, lineseparator: str, spec=1, sample_rows: int = 10,
):
def translate(self):
"""Translate content of one CSV file into partition offsets format."""
...
return {"version": 1, "templates": {"u": self._uri}, "refs": self.store}
def _visitoffsets(self, callback: Callable):
"""Visit each of the offsets"""
def _translator(self, offset: int):
"""Produce offset metadata for all partitions in the CSV file."""
I suppose that passing blocksize of None would be the way to override the calculation of evenly spaced offsets
from kerchunk.
I don't think it's right to initialise the SingleCsvToPartitions
class with the blocksize
: IMO it should be initialised with the list of offsets, and then 'edit' those on the dask.Delayed
objects, rather than putting CSV-specific routines into the general purpose bytes.core.read_bytes
function.
- From what I can see, you can't edit a
dask.Delayed
object, so I'd just recreate them- (Actually, after trying I think I can, unclear if will work properly: will test as it may avoid need for excessive code reuse)
In this scenario, the out
list of lists of dask.Delayed
would be edited to have the correct offsets (within read_pandas
, after returning from read_bytes
), rather than being created with the correct offsets.
Below is the undesirable alternative, which would require using a popped kwarg at the start of read_bytes
(putting CSV format-specific code in the general-purpose bytes.core
module)
Click to show the alternative idea I rejected
I'm not sure if the dask read_bytes
interface should be changed (if a new argument is added to it then it may break existing code that uses it), so a way to keep the API intact would be to pop a kwarg at the very start of read_bytes
, defaulting to 0 if not present.
sample_tail_rows = kwargs.pop("sample_tail_rows", 0)
passed only by read_pandas
(here) leaving the remaining kwargs valid to be passed as storage_options
, as currently.
This variable is needed to change the assignment of the offsets in bytes.core
into a conditional:
if sample_tail_rows > 0:
off, length = zip([
(s["offset"], s["size"])
for s in SingleCsvToPartition(...).store
])
else:
off = list(range(0, size, blocksize))
length = [blocksize] * len(off)
Would this be acceptable? I'm conscious that this introduces format-specific behaviour into the cross-format bytes.core
module.
read_bytes
will always assign offsets based on blocksize so you can't assign them any earlier- I'd say it's messy to put format specific routines into a general purpose module (unless you strongly believe otherwise), so you can't change their assignment during
- the only remaining point to assign offsets therefore must be after they've been computed [and tokenised] within
read_bytes
My conclusion is that this is undesirable and SingleCsvToPartitions
should instead be used to modify the dask.Delayed
objects within the dd.io.csv
module, would appreciate your view
from kerchunk.
Some time has passed, @lmmx !
Do you think you can contribute the code you have, which takes one file and rough offsets as inputs, and returns exact offsets of valid, non-quoted, newline characters?
from kerchunk.
Hey Martin, that it has! New puppy arrived, sorry for the pause 😅 It's still up on my whiteboard, I'll bump it to the top of my weekend to-do list 👍
from kerchunk.
Will be very glad to see something woking here! If you have an example that Dask currently struggles with (I think you did), all the better.
from kerchunk.
Me too! Yes it was the WIT dataset, and can be observed using the dataset's sample file as described in my initial comment at https://storage.googleapis.com/gresearch/wit/wit_v1.train.all-1percent_sample.tsv.gz
.
Weekend came & went but have cleared everything now + getting to grips with this again, will be in touch 👍
I'm using the repo csv-validation-sandbox to work out the minimum viable example. So far I went through my assumptions and found that some of them weren't what I first thought, and I can use the pandas parser (phew) which simplifies things, I just have to enforce the python engine rather than the C one. The code in that repo is structured in the form of tests to demonstrate said assumptions (I felt my long issue comments were too hard to review and worked some of them out in that repo's issues instead).
from kerchunk.
Ah, yes, had not been on my mind but one of the scripts I'm working with in testing [in an environment with a locally editable installation of both dask
and fsspec
with breakpoints etc at points of interest] is working with
wit_v1.train.all-1percent_sample.tsv
as you suggest
from kerchunk.
Yeah, sorry, to clarify I was working with a local editable dask install to look at certain parameters set within the program which I'm trying to excise into this module (e.g. default blocksize is 33554432, from 2 ** 25
set as a magic number of sorts)
from kerchunk.
I am still convinced that you shouldn't need dask at all to find the offsets, and it would make life much simpler to separate out a) generating the offsets (here) and b) using them (with dask).
from kerchunk.
This has become more interesting with the appearance of indexed_gzip, as a kerchunked read_csv could be the only way to read massive .csv.gz in parallel with dask, as it allows quasy random access into gzip streams.
from kerchunk.
Popping back in to note that I just yesterday came across the chunked CSV reader capability in Parquet (as used in PyArrow) and after checking it indeed seems to be capable of handling intraline newlines: so I'm curious how it's done
- "How can i chunk through a csv using arrow?" https://stackoverflow.com/questions/68555085/how-can-i-chunk-through-a-csv-using-arrow
pyarrow.csv.open_csv
https://arrow.apache.org/docs/python/generated/pyarrow.csv.open_csv.htmlpyarrow.csv.CSVStreamingReader
: https://arrow.apache.org/docs/python/generated/pyarrow.csv.CSVStreamingReader.html#pyarrow.csv.CSVStreamingReader
The code for CSVStreamingReader
features this param
newlines_in_values : bool, optional (default False)
Whether newline characters are allowed in CSV values.
Setting this to True reduces the performance of multi-threaded
CSV reading.
from kerchunk.
Related Issues (20)
- MultiZarrToZarr combination associativity; issues in map/reduce workflow HOT 8
- Process/Decode Chunk Issue HOT 5
- Restrict/target SingleHdf5ToZarr on/to Dataset of interest in a multi-Dataset HDF5/netCDF4 file HOT 3
- numcodecs version dependency HOT 1
- Tutorial has mistakes HOT 6
- Single value variable of type int32 in NetCDF becomes float64 in Kerchunk HOT 5
- Shifted/missing chunks after running MultiZarrToZarr HOT 3
- changelog? HOT 1
- Non-deterministic behaviour of MultiZarrToZarr
- Append fails with shape error HOT 1
- Can kerchunk return references where the final chunk is smaller? HOT 3
- Missing Values are unexpectedly filled in Kerchunked version of dataset HOT 2
- Have ZarrToZarr write to json in the function? HOT 1
- add autodask example
- example of auto_dask writing to a parquet for for the multi references HOT 1
- inline_threshold kwarg not changing result of SingleHdf5ToZarr call HOT 8
- Invalid try/except block in `scan_grib`? HOT 8
- Create a MultiZarr json file from netcdf files of unequal time length. HOT 2
- Support for small files in `_split_file` HOT 5
- Support for failed chunk requests HOT 13
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from kerchunk.