GithubHelp home page GithubHelp logo

fishrock123 / bob Goto Github PK

View Code? Open in Web Editor NEW
77.0 10.0 8.0 433 KB

🚰 binary data "streams+" via data producers, data consumers, and pull flow.

License: MIT License

JavaScript 57.14% C++ 31.50% Python 0.67% C 10.69%
bob streams pull-streams sink source node nodejs

bob's Introduction

BOB

A binary data "streams+" API & implementations via data producers, data consumers, and pull flow.

The name? BLOB β€” Matteo Collina.

Bytes Over Buffers β€” Thomas Watson

This is a Node.js strategic initiative aiming to improve Node.js streaming data interfaces, both within Node.js core internally, and hopefully also as future public APIs.

Flow of data & errors though BOB sinks & sources

Published Modules

The following modules contain usable components (sources, sinks, or transforms) and are published to npm.

The following modules are not published but are 'functional'.

API Reference

The following files serve as the API's reference:

Examples

The composition of the classes looks like this:

const { Stream } = require('bob-streams')

const source = new Source(/* args */)
const xform = new Transform(/* args */)
const sink = new Sink(/* args */)

const stream = new Stream(source, xform, sink)
stream.start(error => {
  // The stream is finished when this is called.
})

An entire passthrough could look like this:

class PassThrough {
  bindSource (source) {
    source.bindSink(this)
    this.source = source
    return this
  }

  bindSink (sink) {
    this.sink = sink
  }

  next (status, error, buffer, bytes) {
    this.sink.next(status, error, buffer, bytes)
  }

  pull (error, buffer) {
    this.source.pull(error, buffer)
  }
}

API Extension Reference

The following files serve as API extension references:

  • extension-stop - Tell a source to stop.
    • Useful for dealing with timeouts on network APIs.

Project Approach

High-level timeline:

  • Prototype separate from core entirely.
  • Move into nodejs org once JS & C++ APIs are significantly prototyped.
  • Begin transitioning Node.js internals once the APIs and perf are proved.
  • If an internal transition works out well, begin planning public APIs.

All of these steps necessitate the buy-in of many stakeholders, both in Node.js core and the greater Node.js ecosystem. This is a long-term project by necessity and design.

Goals

Some collective goals for this initiative.

  • Both performance and ease-of-use are key.
  • Implementable in a performant and usable way for both JS and C++.
  • Browser portability is preferable.

Protocol

As a preface, "protocol" refers to a system with "producer / source" and "consumer / sink" endpoints.

The Protocol itself must be simple:

  • Pull-based: The consumer requests ("pulls") data from the producer.
  • Binary-only: Data is binary buffers only, "object mode" and string encodings are not supported at the protocol level.
  • Stateless: The protocol must not require state to be maintained out-of-band.
    • Non-normative: While the protocol itself does not require out-of-band state, actual operations almost always do.
    • Minimize state assumed between calls.
  • One-to-one: The protocol assumes a one-to-one relationship between producer and consumer.
  • Timing agnostic: The protocol makes no timing (sync or async) assumptions.
  • No buffering: The protocol must not require buffering (although specific implementations might).
    • Non-normative: While the protocol itself does not require buffering, starting sources almost always do (including transforms).
  • In-line errors and EOF: Errors, data, and EOF ("end") should flow through the same call path.

Consumer

  • Should make no assumption on the timing of when data will be received (sync or async).
  • Should own any preallocated memory (the buffer).
  • Must never make more than one data request upstream at the same time.

Performance

Please see performance.md for profiling results & information.

Current results estimate a 30% decrease of CPU time in bad cases, and up to 8x decrease in good cases. This should correlate to overall throughput but may not be exact.

Project Layout

API reference examples sit in the top-level directory and are prefixed by reference-. These are functional and tested when practical, notably reference-verify, reference-passthrough, and verify-buffered-transform.

Other helpers, such as Stream(), reside in the /helpers/ and /tests/helpers directories. All useful and usable components in this repo are exported from index.js with the bob-streams npm module.

Functional sources, sinks, and so on can be found in their own npm modules. See [Published Modules](#Published Modules).

Development

Tests

npm install && npm test

Building the addons

The addons are presently very out-of-date.

You must have a local install of Node master @ ~ 694ac6de5ba2591c8d3d56017b2423bd3e39f769

npm i node-gyp
node-gyp rebuild --nodedir=your/local/node/dir -C ./addons/passthrough
node-gyp rebuild --nodedir=your/local/node/dir -C ./addons/fs-sink
node-gyp rebuild --nodedir=your/local/node/dir -C ./addons/fs-source

License

MIT Licensed β€” Contributions via DCO 1.1

bob's People

Contributors

antsmartian avatar fishrock123 avatar jasnell avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

bob's Issues

Progress 23/7/2018 - July

The current status is:

Completed moving moving sinks / sources to npm modules:

Next up:

  • Make a pull-streaming http / socket "duplex". Needs to use BOB streams as far down as possible.
    • Likely to be based on Matteo's work of making a lighter TCP socket + http implementation.

Previous status - 1/6/2018 (Berlin Collab Summit): #11


Buffer allocation hints

As much as I'd like to avoid it, it seems like some kind of hint system would be useful for telling who should allocate buffers and of what size.

Ideally, this would be done out of the regular pull flow (to avoid passing like 7 arguments every time). So, probably going to be connected to an other (yet to be made) issue about doing some kind of "construct" flow...

See #30 (comment)

One piece of contention in the current design of the sink API is whom allocates the buffer. If I have data already in buffers that needs to be written to a socket or disk it doesn't make sense for the sink to allocate a write buffer and tell me to copy values into it.

Proposal: change binding

Binding is ugly and I bet it is the least comprehensible part right now.

Ideally I guess it would look something like Stream(source, transform, sink).

Non-single-logical-flow (multiple pulls)

Moving out from #23 (comment)

It seems that newer network protocols like QUIC desire multiple chunks of data to be in-flight at once (besides consider re-sending).

This probably violates these two core design ideas:

  • One-to-one: The protocol assumes a one-to-one relationship between producer and consumer.
  • In-line errors and EOF: Errors, data, and EOF ("end") should flow through the same call path.

It may also unleash zalgo? lol.

Anyways, I think it is possible to still keep things simple and "pretend" that things are multiplexed, by doing slightly more waiting at the network sink end. I'm not really sure that perf would be considerably impacted in most cases?

Edit: See #30 (comment) for updated thoughts.

Open Questions presented to N+JSI collab summit

From my collab summit presentation (#16), here are the open questions I presented at the end:

  • Is this API likeable?
    • It seemed to be well received at the collab summit
  • Do we need a buffer pooling helper?
    • (Probably.)
  • Is it too API-pattern focused?
  • How to enforce single active request?
  • Where to start implementing in core?
  • Libuv pull streams?
    • The libuv folks I catted with at N+JSI seemed open to it.

What are the differences of this approach in comparison to pull-stream?

I'm excited to see this development as I am a heavy user of the pull-stream ecosystem for etl processing. This approach feels and reads extremely similar, but with obvious gains to be made by making it natively supported by node. Do these two efforts align (or differ) in any way? Is bob expected to support existing pull-stream patterns so as to benefit the variety of libraries already available on npm? Could it? πŸ˜„

For reference: https://github.com/pull-stream/pull-stream

Progress 30/3/2018 (2018 week 13)

The current status is:

Next up:

  • Updated profiles
  • Proper C++ error handling / passing / translation
  • Evaluation of API, possible comparison / test implementation of a more 'functional' style JS API for discussion reference.

Previous status - 13/3/2018 (2018 week 11): #9

Progress 11/12/2019 - December

The last update was fairly large (23/07/2019 - July#40), but this one is much smaller.

Notably I'm no longer employed and paid to continue this kind of work around Node and I don't really do this as my hobby or have much default motivation to continue.

I did merge a pull request that adds WritableSource and ReadableSink` for streams3 interoperability: 57a78d1

I am supposed to present this initiative's status again at the Montreal collab summit. I am not quite sure what will come out of that and it may end up being more of a post-mortem.


Progress 23/07/2019 - July

Ok so, a lot of stuff has happened since the last update (15/11/2018 - November).

An unfortunately incomplete list of actions since then:

  • Re-presented at the Berlin June 2019 collab summit.
  • Introduced Stream() composition #31
  • start(cb) required for Sinks #32
  • Automated tests! #38
  • AssertionSink & AssertionSource
  • A Verification passthrough for API enforcement #34
  • Large progress on Streams3 Readable/Writable adaptors #35
  • This repo is now an npm module under the name bob-streams. #33
  • Made crc-transform as proof for an internal live coding demo
    • I don't think I can make the recording public but maybe I can do a public / extended version
  • Many other minor fixes
  • Additional prototyping of interaction with async iterators
  • Discussion about adding an offset to pull() #23
  • Discussion about multi-pull flow #30

Proposal: rename `next()`

Due to potential conflicts with iterator#next(), perhaps this part of the protocol should be renamed. Any thoughts?

Maybe something like give()?

Proposal: Status Code Error Enums

For errors, any negative value should be considered an error, with multiple negative values permitted. That would obviously mean not using an enum for status, and instead using constants... e.g.

#define BOB_ERR_{whatever} -{some integer}
#define BOB_ERR_END 0
#define BOB_ERR_CONTINUE 1

potential future move to nodejs org

Not really super into moving stuff into an official repo atm but once some more status has been completed (see #2), we may want to so as to get some more traction, as forming a "team" may help there.

idk if that would mean actually moving the repo or just it's contents

Progress 15/11/2018 - November

The current status is:

Completed sink / source / "duplex" npm modules:

Next up:

  • Address open questions: #17
  • Write a Streams3 adaptor: #18

Previous status - Progress 5/10/2018 - October: #13


Progress 15/1/2018 (2018 week 3)

The current status is:

  • Goals: almost ironed out, see #1
  • initial js-only fs source/sinks: working in this repo
  • js-only zlib transform: almost working, see aec09c2
  • buffering api: no work yet
  • C++ api: currently out of date, see nodejs/node#16414

I'm thinking of taking a swing at doing a C++ fs after the js-only zlib transform is working.

potential stack overflows

currently, the bob sink calls this.source.pull() and then the source calls this.sink.next()
however, if the source calls back synchronously, and the sink calls pull again synchronously, and the stream moves enough data, this could cause a stack overflow.

I worked around this in pull-stream with this (ugly) code:
https://github.com/pull-stream/pull-stream/blob/master/sinks/drain.js#L12-L37

Basically, it checks if it's next was called sync, (i.e. if the last call to pull hasn't returned yet) and if so, falls out to a loop that calls pull again. if the pull() returns before next is called, then the source is async, so exit the loop. this is the most complicated part of pull-streams. bob streams will need to have a thing like this too. you can use setImmedate too, but that's actually a more overhead than the loop, and the loop means that a completely sync stream can stay completely sync.

push-stream solves this a much simpler way: sinks have a paused property, which the source can check before it calls write. A sync source can just loop until the sink pauses. then wait until resume is called. this means it uses less stack memory.

Hmm, that wouldn't work with bob streams because of the way the sink allocates the buffer...
I'm not really sure about that, though. (and also forbidding object streams, but not discuss that in this issue, the stackoverflow problem is more important)

Where do object streams fit in?

One thing I like about node.js streams is composability. It's easy to compose pipelines that mix binary and object streams. For example, parsing a csv file, transforming the rows (objects), then writing back to a file. When raw performance isn't important (it often isn't) then node.js streams are pretty great.

Bob is binary-only, so what will replace object streams? If the answer is "async iterators", how does one compose a pipeline as easily as x.pipe(y).pipe(z)? And if each of those is an async iterator, wouldn't that hurt throughput, as you can no longer transform multiple objects in one tick?

Construct flow

Some kind of construct flow would be very useful for a couple of significant reasons:

  • Presently resources must either be opened upon constructor call, or upon "first pull".
    • The former presents async timing issues
    • The latter is complicated and messy
  • It would be useful to pass buffer allocation hits out-of-flow (#52)

Arguments for doing it all inline could be getting pretty long (and very variable), not even counting the first point:

pull(status_type, error, buffer, size, offset)

Maybe?

Idk, maybe separating this all out into multiple flows would be better, similar to Streams3 but just sans the dreaded EventEmitter.

  • (sink calls) -> (source calls)
  • construct(...) -> ready(...)
  • pull(...) -> give(...)
  • destroy(error) -> destroy(error)

Very related to nodejs/node#29314

Streams3 adaptor / transform (s)

Heya @mcollina & @mafintosh - I think this is the next required step here, since we'd need this kinda thing to be able to do anything in node core anyways.

I've created a repo for this at https://github.com/Fishrock123/bob-streams3 and invited you both as collaborators. It contains some basic bits but no functioning code.

There are a good bit of docs and examples lying around here & in the linked-to module repos, please holler if you need any of my help.

Blog about this?

I probably have enough material, could be good for visibility?

'bob' performance discussion

So, I finally profiled this on my linux box (macOS is useless because of ___channel_get_opt, good luck).

I have documented the results so far in performance.md. I only really tried doing a very large file and have not yet made cases that make many small streams.

The results are looking good. The HDD is the limiting factor of my linux system, and the profiles show file copying has ~7x less CPU time in JS, and zlib transform has ~33% less CPU time in JS. πŸ’₯ (C++ time does not seem significantly affected for either case.)

cc @jasnell, @mcollina

Proposal: add `offset` to `pull()`

From the collab summit, after talking with @jasnell extensively about QUIC, I think it would be useful to have pull support an offset to ask the source to read from (the source may choose to still return whatever data it chooses).

This should allow for ACKs in a network implementation (i.e. if you need to ACK, you just request the last / desired offset again).

It also would support a level of content-addressability. A sink or downstream intermediate could be the one to inform a file source of where to read from.

Relatedly, this may also make disk-based sources require less state...

"Extensions"

I have added a section to the readme in this repo about "extensions": API Extension Reference

So far this has seemed the best way to deal with possible optional additional APIs, such as an explicit start, or a stop for handling timeouts.

Organize a meeting

I think it would be useful soon have a voice meeting to discuss various unresolved conversations.

Things that should be talked about:

  • Project Approach evaluation
  • Goals evaluation
  • API conventionality (how to enforce this?)
  • Naming e.g. #22, "sinks"
  • Multi-pull flow #30
  • C++ binding detection #29

ccing everyone who has showed interest so far:
@jasnell, @Raynos, @mcollina, @mafintosh, @benjamingr

Progress 13/3/2018 (2018 week 11)

The current status is:

Next up:

  • C++ file source or sink endpoint
  • Evaluation of API, possible comparison / test implementation of a more 'functional' style JS API for discussion reference.

Previous status - 16/2/2018 (2018 week 7): #5

Reconsider `bind{Source|Sink}()`

I'm really not convinced the binding apis are very good.

Might be better to have a helper function like streamline(a, b, c) - ideally so that you can do streamline(streamline(a, b, c), d) too.

Gona try to work on a PoC module this week..

Managed state

Essentially an extension of #53 with a different focus.

There are a number of reasons (#53, #52, #30, etc) for why some kind of state management that does not need to be reimplemented by every stream would be useful.

There are two primary ways to deal with this (that I can think of):

  • Class inheritance, do everything in a base class and then extend that
  • Some kind of managed state object which is persistent and passed in the pull flow
    • @jasnell's idea
    • Could potentially deal with the issues without requiring class inheritance?

Of course, if we inherit form a class the obvious thing to do would be to integrate the verify transform into said class, so that guarentees are at the absolute minimal by convention but rather by code.

Transform this repo into more of a spec

So I've been realizing a couple related things...

A big one is that I think protocol enforcement needs to be done via classes / helpers somehow, but at the same time I don't think it should be 100% necessary. (Heck it isn't even in streams3)

A more formal specification than just the reference would be useful, in the same way it would be useful to have (or had) that kind of thing for streams3.

This should also allow the project at a spec and then "core classes" level to be moved into the node org without having to drag every sub module in.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.