fishrock123 / bob Goto Github PK

View Code? Open in Web Editor NEW

77.0 10.0 8.0 433 KB

🚰 binary data "streams+" via data producers, data consumers, and pull flow.

License: MIT License

JavaScript 57.14% C++ 31.50% Python 0.67% C 10.69%

bob streams pull-streams sink source node nodejs

bob's Introduction

BOB

A binary data "streams+" API & implementations via data producers, data consumers, and pull flow.

The name? BLOB — Matteo Collina.

Bytes Over Buffers — Thomas Watson

This is a Node.js strategic initiative aiming to improve Node.js streaming data interfaces, both within Node.js core internally, and hopefully also as future public APIs.

Published Modules

The following modules contain usable components (sources, sinks, or transforms) and are published to npm.

The status codes enum: bob-status (npm)
A file system source: fs-source (npm)
A file system sink: fs-sink (npm)
A zlib transform: zlib-transform (npm)
A crc32 transform: crc-transform (npm)
Header for the C++ api: bob-base (npm)

The following modules are not published but are 'functional'.

A TCP socket "duplex": in "socket"
A TCP server of "duplex" sockets: also in "socket"

API Reference

The following files serve as the API's reference:

The Status Enum - Status codes
A Source - The data provider
A Sink - The data consumer
A Passthrough - A good example of the whole API
A Verify Passthrough - A typechecking API enforcement passthrough
A Buffered Transform - An example of buffering
bob.h - The C++ header in 'bob-base'

Examples

The composition of the classes looks like this:

const { Stream } = require('bob-streams')

const source = new Source(/* args */)
const xform = new Transform(/* args */)
const sink = new Sink(/* args */)

const stream = new Stream(source, xform, sink)
stream.start(error => {
  // The stream is finished when this is called.
})

An entire passthrough could look like this:

class PassThrough {
  bindSource (source) {
    source.bindSink(this)
    this.source = source
    return this
  }

  bindSink (sink) {
    this.sink = sink
  }

  next (status, error, buffer, bytes) {
    this.sink.next(status, error, buffer, bytes)
  }

  pull (error, buffer) {
    this.source.pull(error, buffer)
  }
}

API Extension Reference

The following files serve as API extension references:

extension-stop - Tell a source to stop.
- Useful for dealing with timeouts on network APIs.

Project Approach

High-level timeline:

Prototype separate from core entirely.
Move into nodejs org once JS & C++ APIs are significantly prototyped.
Begin transitioning Node.js internals once the APIs and perf are proved.
If an internal transition works out well, begin planning public APIs.

All of these steps necessitate the buy-in of many stakeholders, both in Node.js core and the greater Node.js ecosystem. This is a long-term project by necessity and design.

Goals

Some collective goals for this initiative.

Both performance and ease-of-use are key.
Implementable in a performant and usable way for both JS and C++.
Browser portability is preferable.

Protocol

As a preface, "protocol" refers to a system with "producer / source" and "consumer / sink" endpoints.

The Protocol itself must be simple:

Pull-based: The consumer requests ("pulls") data from the producer.
Binary-only: Data is binary buffers only, "object mode" and string encodings are not supported at the protocol level.
Stateless: The protocol must not require state to be maintained out-of-band.
- Non-normative: While the protocol itself does not require out-of-band state, actual operations almost always do.
- Minimize state assumed between calls.
One-to-one: The protocol assumes a one-to-one relationship between producer and consumer.
Timing agnostic: The protocol makes no timing (sync or async) assumptions.
No buffering: The protocol must not require buffering (although specific implementations might).
- Non-normative: While the protocol itself does not require buffering, starting sources almost always do (including transforms).
In-line errors and EOF: Errors, data, and EOF ("end") should flow through the same call path.

Consumer

Should make no assumption on the timing of when data will be received (sync or async).
Should own any preallocated memory (the buffer).
Must never make more than one data request upstream at the same time.

Performance

Please see performance.md for profiling results & information.

Current results estimate a 30% decrease of CPU time in bad cases, and up to 8x decrease in good cases. This should correlate to overall throughput but may not be exact.

Project Layout

API reference examples sit in the top-level directory and are prefixed by reference-. These are functional and tested when practical, notably reference-verify, reference-passthrough, and verify-buffered-transform.

Other helpers, such as Stream(), reside in the /helpers/ and /tests/helpers directories. All useful and usable components in this repo are exported from index.js with the bob-streams npm module.

Functional sources, sinks, and so on can be found in their own npm modules. See [Published Modules](#Published Modules).

Development

Tests

npm install && npm test

Building the addons

The addons are presently very out-of-date.

You must have a local install of Node master @ ~ 694ac6de5ba2591c8d3d56017b2423bd3e39f769

npm i node-gyp
node-gyp rebuild --nodedir=your/local/node/dir -C ./addons/passthrough
node-gyp rebuild --nodedir=your/local/node/dir -C ./addons/fs-sink
node-gyp rebuild --nodedir=your/local/node/dir -C ./addons/fs-source

License

MIT Licensed — Contributions via DCO 1.1

bob's People

Contributors

Stargazers

Watchers

Forkers

nicknaso adarshsaraogi antsmartian jasnell mirzasmrkovic

bob's Issues

Potential streams3 adaptor bugs

See #35 (review)

Progress 23/7/2018 - July

The current status is:

Completed moving moving sinks / sources to npm modules:

The status codes enum: bob-status
A file system source: fs-source
A file system sink: fs-sink
A zlib transform: zlib-transform

Next up:

Make a pull-streaming http / socket "duplex". Needs to use BOB streams as far down as possible.
- Likely to be based on Matteo's work of making a lighter TCP socket + http implementation.

Previous status - 1/6/2018 (Berlin Collab Summit): #11

Slides from Montréal collab summit 2019

Slides at: https://fishrock123.github.io/nodejs-collab-summit-montreal-2019/#0

Issue for the summit was https://github.com/openjs-foundation/summit/issues?q=is%3Aissue+is%3Aclosed

Buffer allocation hints

As much as I'd like to avoid it, it seems like some kind of hint system would be useful for telling who should allocate buffers and of what size.

Ideally, this would be done out of the regular pull flow (to avoid passing like 7 arguments every time). So, probably going to be connected to an other (yet to be made) issue about doing some kind of "construct" flow...

See #30 (comment)

One piece of contention in the current design of the sink API is whom allocates the buffer. If I have data already in buffers that needs to be written to a socket or disk it doesn't make sense for the sink to allocate a write buffer and tell me to copy values into it.

Proposal: change binding

Binding is ugly and I bet it is the least comprehensible part right now.

Ideally I guess it would look something like Stream(source, transform, sink).

Non-single-logical-flow (multiple pulls)

Moving out from #23 (comment)

It seems that newer network protocols like QUIC desire multiple chunks of data to be in-flight at once (besides consider re-sending).

This probably violates these two core design ideas:

One-to-one: The protocol assumes a one-to-one relationship between producer and consumer.

In-line errors and EOF: Errors, data, and EOF ("end") should flow through the same call path.

It may also unleash zalgo? lol.

Anyways, I think it is possible to still keep things simple and "pretend" that things are multiplexed, by doing slightly more waiting at the network sink end. I'm not really sure that perf would be considerably impacted in most cases?

Edit: See #30 (comment) for updated thoughts.

Open Questions presented to N+JSI collab summit

From my collab summit presentation (#16), here are the open questions I presented at the end:

Is this API likeable?
- It seemed to be well received at the collab summit
Do we need a buffer pooling helper?
- (Probably.)
Is it too API-pattern focused?
How to enforce single active request?
Where to start implementing in core?
Libuv pull streams?
- The libuv folks I catted with at N+JSI seemed open to it.

What are the differences of this approach in comparison to pull-stream?

I'm excited to see this development as I am a heavy user of the pull-stream ecosystem for etl processing. This approach feels and reads extremely similar, but with obvious gains to be made by making it natively supported by node. Do these two efforts align (or differ) in any way? Is bob expected to support existing pull-stream patterns so as to benefit the variety of libraries already available on npm? Could it? 😄

For reference: https://github.com/pull-stream/pull-stream

Progress 30/3/2018 (2018 week 13)

The current status is:

Project goals
Project high-level approach
JS API:
- fs source/sinks: working
- zlib transform: working
- JS in C++ passthrough: working - b4fafa6
C++ API:
- C++ PassThrough with JS or C++ endpoints: working- b560aa7
- C++ File Source & File Sink: working in branch c++
buffering api: no work yet

Next up:

Updated profiles
Proper C++ error handling / passing / translation
Evaluation of API, possible comparison / test implementation of a more 'functional' style JS API for discussion reference.

Previous status - 13/3/2018 (2018 week 11): #9

Progress 5/10/2018 - October

The current status is:

Completed sink / source / "duplex" npm modules:

The status codes enum: bob-status
A file system source: fs-source
A file system sink: fs-sink
A zlib transform: zlib-transform
A TCP socket "duplex": socket
A TCP server of "duplex" sockets: socket

Next up:

Present this at the N+JSI 2018 Collaborator Summit: openjs-foundation/summit#110

Previous status - Progress 23/7/2018 - July: #12

Check Deno's stance on streams

@benjamingr Mentioned Deno was discussing their approach to streams. We should check it out.

Progress 11/12/2019 - December

The last update was fairly large (23/07/2019 - July #40), but this one is much smaller.

Notably I'm no longer employed and paid to continue this kind of work around Node and I don't really do this as my hobby or have much default motivation to continue.

I did merge a pull request that adds WritableSource and ReadableSink` for streams3 interoperability: 57a78d1

I am supposed to present this initiative's status again at the Montreal collab summit. I am not quite sure what will come out of that and it may end up being more of a post-mortem.

Progress 23/07/2019 - July

Ok so, a lot of stuff has happened since the last update (15/11/2018 - November).

An unfortunately incomplete list of actions since then:

Re-presented at the Berlin June 2019 collab summit.
- Slides at https://fishrock123.github.io/nodejs-collab-summit-berlin-2019
Introduced Stream() composition #31
start(cb) required for Sinks #32
Automated tests! #38
AssertionSink & AssertionSource
A Verification passthrough for API enforcement #34
Large progress on Streams3 Readable/Writable adaptors #35
This repo is now an npm module under the name bob-streams. #33
Made crc-transform as proof for an internal live coding demo
- I don't think I can make the recording public but maybe I can do a public / extended version
Many other minor fixes
Additional prototyping of interaction with async iterators
Discussion about adding an offset to pull() #23
Discussion about multi-pull flow #30

clean up the source after stopping

this module implements a stop method:
https://github.com/Fishrock123/bob/blob/master/reference-extension-stop.js

but it doesn't notify the source in anyway. If the source is using a heavy resource, (eg, a file descriptor) the source needs to know the sink has stopped so that it can close the resource.

Proposal: make `start()` standard instead of suggesting to automatically pull.

I like being explicit in intent - I feel it would be better to not include default code examples that automatically start from sinks and rather just always include start(). (And no longer have start me an "extension")

Proposal: rename `next()`

Due to potential conflicts with iterator#next(), perhaps this part of the protocol should be renamed. Any thoughts?

Maybe something like give()?

Make Stream() an async iterator

I think Stream() would be a good place for this, what does everyone else think?

We should also be sure to handle cases such as nodejs/node#28194

move StdoutSink into a `stdio-sink` module?

It could expose StdioSink(fd) with shortcuts to fd 1/2 via StdoutSink and StderrSink.

Might be better done outside of this repo.

Proposal: Status Code Error Enums

For errors, any negative value should be considered an error, with multiple negative values permitted. That would obviously mean not using an enum for status, and instead using constants... e.g.

#define BOB_ERR_{whatever} -{some integer}
#define BOB_ERR_END 0
#define BOB_ERR_CONTINUE 1

potential future move to nodejs org

Not really super into moving stuff into an official repo atm but once some more status has been completed (see #2), we may want to so as to get some more traction, as forming a "team" may help there.

idk if that would mean actually moving the repo or just it's contents

Progress 15/11/2018 - November

The current status is:

Was presented, to good reception, at the N+JSI 2018 Collaborator Summit: openjs-foundation/summit#110
- Slides online at https://fishrock123.github.io/nodejs-collab-summit-2018

Completed sink / source / "duplex" npm modules:

The status codes enum: bob-status
A file system source: fs-source
A file system sink: fs-sink
A zlib transform: zlib-transform
C++ headers: bob-base
A TCP socket "duplex": socket
A TCP server of "duplex" sockets: socket

Next up:

Address open questions: #17
Write a Streams3 adaptor: #18

Previous status - Progress 5/10/2018 - October: #13

Progress 15/1/2018 (2018 week 3)

The current status is:

Goals: almost ironed out, see #1
initial js-only fs source/sinks: working in this repo
js-only zlib transform: almost working, see aec09c2
buffering api: no work yet
C++ api: currently out of date, see nodejs/node#16414

I'm thinking of taking a swing at doing a C++ fs after the js-only zlib transform is working.

potential stack overflows

currently, the bob sink calls this.source.pull() and then the source calls this.sink.next()
however, if the source calls back synchronously, and the sink calls pull again synchronously, and the stream moves enough data, this could cause a stack overflow.

I worked around this in pull-stream with this (ugly) code:
https://github.com/pull-stream/pull-stream/blob/master/sinks/drain.js#L12-L37

Basically, it checks if it's next was called sync, (i.e. if the last call to pull hasn't returned yet) and if so, falls out to a loop that calls pull again. if the pull() returns before next is called, then the source is async, so exit the loop. this is the most complicated part of pull-streams. bob streams will need to have a thing like this too. you can use setImmedate too, but that's actually a more overhead than the loop, and the loop means that a completely sync stream can stay completely sync.

push-stream solves this a much simpler way: sinks have a paused property, which the source can check before it calls write. A sync source can just loop until the sink pauses. then wait until resume is called. this means it uses less stack memory.

Hmm, that wouldn't work with bob streams because of the way the sink allocates the buffer...
I'm not really sure about that, though. (and also forbidding object streams, but not discuss that in this issue, the stackoverflow problem is more important)

Where do object streams fit in?

One thing I like about node.js streams is composability. It's easy to compose pipelines that mix binary and object streams. For example, parsing a csv file, transforming the rows (objects), then writing back to a file. When raw performance isn't important (it often isn't) then node.js streams are pretty great.

Bob is binary-only, so what will replace object streams? If the answer is "async iterators", how does one compose a pipeline as easily as x.pipe(y).pipe(z)? And if each of those is an async iterator, wouldn't that hurt throughput, as you can no longer transform multiple objects in one tick?

Collab summit - Progress 1/6/2018

The current status is:

Moving sinks / sources to npm modules:

The status codes enum: bob-status
A file system source: fs-source
A file system sink: fs-sink

Next up:

Move more bits to npm modules
Updated profiles
Proper C++ error handling / passing / translation

Previous status - 30/3/2018 (2018 week 13): #9

Construct flow

Some kind of construct flow would be very useful for a couple of significant reasons:

Presently resources must either be opened upon constructor call, or upon "first pull".
- The former presents async timing issues
- The latter is complicated and messy
It would be useful to pass buffer allocation hits out-of-flow (#52)

Arguments for doing it all inline could be getting pretty long (and very variable), not even counting the first point:

pull(status_type, error, buffer, size, offset)

Maybe?

Idk, maybe separating this all out into multiple flows would be better, similar to Streams3 but just sans the dreaded EventEmitter.

(sink calls) -> (source calls)
construct(...) -> ready(...)
pull(...) -> give(...)
destroy(error) -> destroy(error)

Very related to nodejs/node#29314

Progress 16/2/2018 (2018 week 7)

The current status is:

Goals: merged a couple weeks ago: https://github.com/Fishrock123/bob#goals
js-only fs source/sinks: working in this repo
js-only zlib transform: working in this repo
buffering api: no work yet
JS in C++ passthrough: working in this repo - b4fafa6
C++ api: currently out of date, see nodejs/node#16414

Next up:

C++ to C++ passthrough
C++ file source or sink endpoint

Previous status - 15/1/2018 (2018 week 3): #2

Streams3 adaptor / transform (s)

Heya @mcollina & @mafintosh - I think this is the next required step here, since we'd need this kinda thing to be able to do anything in node core anyways.

I've created a repo for this at https://github.com/Fishrock123/bob-streams3 and invited you both as collaborators. It contains some basic bits but no functioning code.

There are a good bit of docs and examples lying around here & in the linked-to module repos, please holler if you need any of my help.

Blog about this?

I probably have enough material, could be good for visibility?

Compare more to alternate ideas, such as push-stream.

See https://github.com/push-stream/push-stream

'bob' performance discussion

So, I finally profiled this on my linux box (macOS is useless because of ___channel_get_opt, good luck).

I have documented the results so far in performance.md. I only really tried doing a very large file and have not yet made cases that make many small streams.

The results are looking good. The HDD is the limiting factor of my linux system, and the profiles show file copying has ~7x less CPU time in JS, and zlib transform has ~33% less CPU time in JS. 💥 (C++ time does not seem significantly affected for either case.)

cc @jasnell, @mcollina

Proposal: add `offset` to `pull()`

From the collab summit, after talking with @jasnell extensively about QUIC, I think it would be useful to have pull support an offset to ask the source to read from (the source may choose to still return whatever data it chooses).

This should allow for ACKs in a network implementation (i.e. if you need to ACK, you just request the last / desired offset again).

It also would support a level of content-addressability. A sink or downstream intermediate could be the one to inform a file source of where to read from.

Relatedly, this may also make disk-based sources require less state...

"Extensions"

I have added a section to the readme in this repo about "extensions": API Extension Reference

So far this has seemed the best way to deal with possible optional additional APIs, such as an explicit start, or a stop for handling timeouts.

N+JSI 2018 collab summit presentation

Here is my presentation I gave at the collab summit at Node+JS Interactive 2018: https://fishrock123.github.io/nodejs-collab-summit-2018

Organize a meeting

I think it would be useful soon have a voice meeting to discuss various unresolved conversations.

Things that should be talked about:

Project Approach evaluation
Goals evaluation
API conventionality (how to enforce this?)
Naming e.g. #22, "sinks"
Multi-pull flow #30
C++ binding detection #29

ccing everyone who has showed interest so far:
@jasnell, @Raynos, @mcollina, @mafintosh, @benjamingr

Progress 13/3/2018 (2018 week 11)

The current status is:

Project goals: https://github.com/Fishrock123/bob#goals
Project high-level approach: https://github.com/Fishrock123/bob#project-approach
JS API:
- fs source/sinks: working in this repo
- zlib transform: working in this repo
- JS in C++ passthrough: working in this repo - b4fafa6
C++ API:
- C++ PassThrough with JS or C++ endpoints working in #8
buffering api: no work yet

Next up:

C++ file source or sink endpoint
Evaluation of API, possible comparison / test implementation of a more 'functional' style JS API for discussion reference.

Previous status - 16/2/2018 (2018 week 7): #5

automated tests

finally workin on this

The npm modules (fs-source, etc) have automated tests but this repo never yet did...

convert to npm module with 'top-level' require-able helpers

e.g. require('bob/status'), require('bob/verify'), require('bob/stream')...

Would make maintaining the spec-ness of this repo much easier.

Reconsider `bind{Source|Sink}()`

I'm really not convinced the binding apis are very good.

Might be better to have a helper function like streamline(a, b, c) - ideally so that you can do streamline(streamline(a, b, c), d) too.

Gona try to work on a PoC module this week..

Managed state

Essentially an extension of #53 with a different focus.

There are a number of reasons (#53, #52, #30, etc) for why some kind of state management that does not need to be reimplemented by every stream would be useful.

There are two primary ways to deal with this (that I can think of):

Class inheritance, do everything in a base class and then extend that
Some kind of managed state object which is persistent and passed in the pull flow
- @jasnell's idea
- Could potentially deal with the issues without requiring class inheritance?

Of course, if we inherit form a class the obvious thing to do would be to integrate the verify transform into said class, so that guarentees are at the absolute minimal by convention but rather by code.

Transform this repo into more of a spec

So I've been realizing a couple related things...

A big one is that I think protocol enforcement needs to be done via classes / helpers somehow, but at the same time I don't think it should be 100% necessary. (Heck it isn't even in streams3)

A more formal specification than just the reference would be useful, in the same way it would be useful to have (or had) that kind of thing for streams3.

This should also allow the project at a spec and then "core classes" level to be moved into the node org without having to drag every sub module in.

Official way of doing C++ binding detection

need to do this... currently it's really just done via the C++ passthrough

Stream livecoding for this?

could do it, might be good for exposure? Idk.

fishrock123 / bob Goto Github PK

bob's Introduction

BOB

Published Modules

API Reference

Examples

API Extension Reference

Project Approach

Goals

Protocol

Consumer

Performance

Project Layout

Development

Tests

Building the addons

License

bob's People

Contributors

Stargazers

Watchers

Forkers

bob's Issues

The current status is:

The current status is:

The current status is:

Recommend Projects

Recommend Topics

Recommend Org

Jobs