Lately I've been trying to optimize raw tabular file parsing + dat import speed. <

Like <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

I'd like to limit it to these three, though. <p dir="au

I'm going to go with this for now <a href="https://github.com/brycebaril/multibuffer-s

Use the Redis protocol about dat HOT 12 CLOSED

dat-ecosystem commented on August 16, 2024

Use the Redis protocol

from dat.

Comments (12)

waldoj commented on August 16, 2024

What's the alternative to Redis? Is it inventing your own framing protocol? Or is it simply not supporting a command line pipe chain?

from dat.

konklone commented on August 16, 2024

All of your points and upsides make sense to me. Are there any downsides? This doesn't introduce a dependency on Redis, just a framing protocol that shares its name. (Sort of equivalent to making use of MongoDB's BSON, which some projects do.)

You say you don't think it'll add much downside, and I haven't looked at binary-csv's implementation, but is it already investigating the contents of lines to handle cases like newlines inside of quoted strings?

Also, if possible, I would recommend integrating binary-csv into dat so that you can cut out the double pipe. Doing curl http://some-website.com/huge_data.csv | dat is a lot simpler of an introduction to people cutting their teeth on dat and the command line generally.

from dat.

waldoj commented on August 16, 2024

Also, if possible, I would recommend integrating binary-csv into dat so that you can cut out the double pipe.

👍

from dat.

brycebaril commented on August 16, 2024

The only downside I can really see is there are features in the protocol it doesn't look like you're using, e.g. multibulk replies & errors.

This weekend I wrote multibuffer for similar purposes. I also similarly based it off of the Redis protocol but changed a couple of things based on experience with working with the Redis protocol. Rather than having to scan through the buffer and read chunks I optimized it for Node's Buffer (and bops) interactions by using a fixed-width length prefix for each buffer segment.

I left off the initial frame count because I didn't need it, but it would be easy to add in.

from dat.

junosuarez commented on August 16, 2024

The only downside that sprang to mind is being cut off from other newline-delimited text processing tools. This could be mitigated by having converters in the pipe chain, but I would be wary of encouraging people to build "plugins" in this ecosystem as opposed to general purpose unix utilities.

from dat.

mranney commented on August 16, 2024

Like @brycebaril says, Redis has a lot of extra stuff in there that you probably don't need. However, there are already multiple implementations of a Redis client in node, so the extra stuff might not be a burden.

In your Unix pipe use case, the actual Redis protocol includes responses, but Unix pipes are one-way. So perhaps you can use TCP at that layer.

If you are using TCP, something to keep in mind with the Redis protocol is that massive performance gains can be achieved by allowing pipelining. That is, keeping a window of requests that have been sent but whose responses have not yet been read. However, the current Redis protocol does not support responding to commands out of order, or even reliably identifying which command a response is from. The downside then is that, when there are bugs, you might end up mixing up the commands and their replies. If you are doing write-only, then it probably doesn't matter, but as soon as you start reading data with this protocol, the lack of command id is a serious design consideration.

from dat.

max-mapper commented on August 16, 2024

@konklone @waldoj I am most likely going to put CSV and JSON support into dat core as both of those formats are a) really prevalent and b) non-trivial to parse in a streaming fashion. There will also be a generic newline separated data parser as well. I'd like to limit it to these three, though. Anything else can be built on top and use the redis protocol if it wants to be fast.

This means that you'll have to tell dat what type of data you're importing. The data will be split into rows and the raw rows will be stored. This will make importing super fast. On read the data will be converted to JSON.

Rough ideas for API (optional, longhand and shorthand versions are shown on same line for brevity):

cat data.csv | dat --csv --sep="\n" -s "\n" --delim="," -d ","
cat newline-separated-data | dat --sep="\n" -s "\n" --preview -p
cat stream-of-json | dat --json --path "rows.*"

So basically,

JSON or CSV parsers built in to dat
Newline separated raw data or redis-protocol delimited raw data parsers built in to dat
All data replication/more complex use cases in dat should dogfood the above parsers

@jden Great point. For the case of CSV it's actually impossible to write sane modular command line workflows using newlines because of the special use of newlines within the CSV spec. I think CSV is an outlier though, and shouldn't technically be considered a newline delimited format.

@brycebaril @mranney the extra features are a good point to bring up. For a lot of bulk loading use cases I don't actually care about the response, for example if I am a command line utility piping data to another command line utility on the same machine (from a file into a local database) I just want to give data, not receive conformation that everything went okay. The responses seem more for remote operations. Do you see any issues posed here?

Also regarding the redis protocol I actually was trying to figure out, but couldn't find anything conclusive, if you have to escape newlines with the redis protocol.

For example, will this break?

*1
$9
hey
there

According to the logic of the parser in node_redis it won't break, but I wasn't sure if other parsers are too strict.

from dat.

waldoj commented on August 16, 2024

I'd like to limit it to these three, though.

I totally support this decision. Maybe in a few years, JSON will be old and busted, and something else will be the new hotness, but right now, you've got 95% of the bases covered with maybe 20% of the effort that would be required to support something like XML.

from dat.

mranney commented on August 16, 2024

Redis protocol is binary safe, so if you say that 9 bytes are to follow, they can all be newlines, or a JPEG, or whatever.

from dat.

max-mapper commented on August 16, 2024

I'm going to go with this for now https://github.com/brycebaril/multibuffer-stream

Thanks @brycebaril. I'll keep this issue open as I still want to consider Redis protocol support -- I just don't know yet if I need the response semantics. If I do then it makes sense, if I don't then multibuffer-stream will do just fine

from dat.

phred commented on August 16, 2024

Just a quick thought — what about something like find(1)'s -print0 and xargs(1)' -0 options? That is, using a literal NUL (\0) to separate records in the pipeline. See:

http://en.wikipedia.org/wiki/Xargs#The_separator_problem

Not sure how the CSV spec handles a literal \0, but it might be a simple way to handle this.

from dat.

max-mapper commented on August 16, 2024

in case anyone is interested I wrote up a bit on multibuffers, which i'll be referring to within dat as '.buff https://github.com/maxogden/dat/blob/master/notes.md#the-buff-format

@phred thanks for the link, I didn't know that's how print0 and xargs do it. I think the above buff format will work a little bit better as it is a nice compromise between a delimited and a framed format

from dat.

Use the Redis protocol about dat HOT 12 CLOSED

Comments (12)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs