foxglove / mcap Goto Github PK
View Code? Open in Web Editor NEWMCAP is a modular, performant, and serialization-agnostic container file format, useful for pub/sub and robotics applications.
Home Page: https://mcap.dev
License: MIT License
MCAP is a modular, performant, and serialization-agnostic container file format, useful for pub/sub and robotics applications.
Home Page: https://mcap.dev
License: MIT License
It seems like they should either both have it or neither have it. My personal leaning is toward only doing CRCs on user data (chunks, attachments) and not on index records.
Having some script that can be run on a .mcap file in each language seems useful. For TypeScript we had this but it was removed in #112.
From team discussion - record_time is ambiguous. log_time
was considered more appropriate
Add test runner for https://github.com/foxglove/mcap/tree/main/tests/conformance/scripts/run-tests/runners which uses child_process.exec()
(or similar) to run the golang reader(s).
Alternative names to consider:
The spec says: A channel info record must occur in the file prior to any message that references its Channel ID.
If I write a chunk and know I've written a channel info record before in another chunk - do I need to write the channel info record again?
Should we have a generic-protobuf
profile which clarifies how to store protobuf data in an mcap file.
From #16
Why is these no publishing/acquisition time for an attachment? Not all attachment formats will support this metadata natively, so it would be useful to record it here.
My writer is trying to produce a chunk for every second of data. If I've written a few messages and now want to write an attachment record, it seems I have to end my chunk, write out all the message indexes, write the attachment, and then start a new chunk.
The readers and writers should support some kind of Reset functionality, to allow reuse of underlying buffers for compression and chunking across files.
The compressed size of the chunk.
The uncompressed size of the chunk.
These should say "The {} size of the chunk records
field"
Add test runner for https://github.com/foxglove/mcap/tree/main/tests/conformance/scripts/run-tests/runners which uses child_process.exec()
(or similar) to run the C++ reader(s).
This was overlooked in #102. schema_count can be uint16 because schema id 0 is reserved (contingent on #126).
This could be implemented without binary breakage by appending the field to the end of the Statistics record. Or for aesthetic (and fixed-offset) reasons we could put it earlier in the record with a binary breakage.
A metadata_count
field should go after the chunk_count
field.
A schema version field should be included in channel info. This could be a fixed-width 16 byte field (md5). This field can be used as a cache key by readers, to either avoid parsing (local) or downloading (remote) schemas they have observed in previous requests.
Description
The spec says the following about string types:
String: a uint32-prefixed UTF8 string
Is the prefix the length of the string (characters) or the number of bytes for the entire UTF8 encoded portion?
Building with go build
succeeds. This is probably resulting from choice of build flags for sqlite.
Add these to glossary
Hi,
I've been reviewing the MCAP spec, and have some feedback which may be of interest.
Attachments
Channel Info
Encryption / Signing
Robustness and Chunks
We specify that the record_time may be relative to an arbitrary epoch. Unix epoch will be common but others options may also be used. It would be useful to readers to know what epoch the timestamps are relative to in some way - this could inform whether stamps could be displayed as date strings rather than integers.
The count is inferred from the number of records
Which of these is the authoritative one?
Right now, the compression
field in Chunk
is a variable-length string. This places the actual chunk payload at a variable offset, and requires parsing the compression
string to determine the chunk payload length. If compression
was instead a fixed-length char[4]
, we would know the chunk payload size immediately after parsing the record length and it would avoid an additional allocation for the std::string compression
.
uncompressed would be [0x00, 0x00, 0x00, 0x00] or little-endian uint32_t 0
lz4 would be [0x6C, 0x7A, 0x34, 0x00] or little-endian uint32_t 3439212
zstd would be [0x7A, 0x73, 0x74, 0x64] or little-endian uint32_t 1685353338
Let's say I have CBOR messages on channels. I'd like to give each "type" of message a "name" so that Studio can render my data in various panels. I don't want to (or need to) provide a schema since CBOR is self describing. What do I do?
The encoding field within ChannelInfo is a string type with a few examples: ros1, protobuf, cbor. If I select protobuf as the encoding, what should I put for schema?, schema_name? Does selecting an encoding have any other restrictions?
I'd like to write mcap reader/writer libraries that inter-operate with other tools but without additional specifications for how to handle encodings can't be sure what to do.
Some useful encodings to specify:
It's easy to accidentally check in the go conformance check binary since it's not gitignored.
Is it valid to write chunk'd files without Message Index records following chunks.
Add test runner for https://github.com/foxglove/mcap/tree/main/tests/conformance/scripts/run-tests/runners which uses child_process.exec()
(or similar) to run the Python reader(s).
Until we figure this out with more detail I think we should remove it from the docs.
Issue for discussion
Since we have byte length prefixing for the entire record - having byte length for array fields feels weird. On the writer side I need to write out the array and only then do I know the length of bytes for the array.
We have already said we are not making more breaking changes at this time, but if we do, this seems like a nice one to include.
After reading a chunk index record, I know which message index record to read, but I don't know how big my chunk record is which means I have to go read the size of the record and then go read the record itself.
We currently have publish time physically ordered before log time in the message record. This isn't a big deal but introduces a couple inconveniences:
The corrective action would be to swap the ordering in the record.
The specification currently makes a division between "chunked" and "unchunked" files, with each having a mandatory set of fields. Discussions have leaned in the direction of this being too restrictive on at least a couple fronts:
In consideration of these, we are considering making the following changes:
Messages written outside chunks will be readable by a sequential reader, but invisible to a random access reader using the chunk index.
Writers that do not include data in the index section will progressively lose utility from the "fast summarization support". The algorithm for "summary" is roughly,
If the index data section is empty, no statistics will be aggregated. Fallback behavior to a full file read is inadvisable to maintain good support on remote files. Update the explanatory notes section to discuss this a little bit.
When reading the spec one encounters usage of Array<Tuple<...>>
before these terms are defined. We should move the definitions up, or add links to the serialization info, or at least mention earlier in the doc that serialization terms will be specified later.
With a start and end stamp on chunks, I could cheaply identify if I need to read a chunk or decompress a chunk based on the time I want to read messages at. Without the start and end stamp I have to process a chunk before knowing it contains messages for my timestamp.
Rather than requiring nodejs fs module, allow McapWriter to write to any FileLike conforming instance. This is a pattern we've used before so readers and writers can function in different i/o environments.
The profile field in the Header says it specifies "interpretation of channel info user data.". Does this mean that a file with protobuf encoding would still be a valid ros1 profile?
It would be useful to expand the scope of what a profile can describe to include any of the open-ended fields (i.e. encoding, schema, schema name, etc).
This would allow creating a ros1 profile that would indicate ros1 as the required encoding, .msg text as the required schema format, and the schema naming convention. By specifying all of the requirements within the profile, a library author can definitely say they implement support for a ros1 profile which will interoperate with other tooling producing ros1 profile files.
The total message count is already available once you've parsed the Statistics record by summing up the message counts for each channel. I don't think the read-time speedup of avoiding a reduce() function on a map is worth having another potential source of disagreement in files.
This would prevent the need to read from the front of the file when doing indexed reads, which would reduce latency for remote reading by some amount (should be tested).
This doesn't have to break backward compatibility, but would require either a new record type, or else overloading the Header. Strictly speaking, we only need the profile, not the library, but perhaps the library would be useful too.
The spec says
String: a uint32-prefixed UTF8 string
KeyValues<T1, T2>: A uint32 length-prefixed association of key-value pairs, serialized as
For string is this the length of the string or the number of bytes?
For KeyValues is this the number of pairs or the number of bytes for the remaining serialized portion?
Here is a simple chunked file layout:
Chunked
-------
Magic
Header
Chunk
ChannelInfo
Message
MessageIndex
[index_offset]
ChannelInfo
ChunkIndex
Statistics
Footer
Magic
And the same file unchunked:
Unchunked
---------
Magic
Header
ChannelInfo
Message
[index_offset]
ChannelInfo
Statistics
Footer
Magic
And here's an empty file:
Empty (Chunked or Unchunked)
-------
Magic
Header
[index_offset]
Statistics
Footer
Magic
How are readers supposed to distinguish between unchunked files and empty files?
https://www.docker.com/blog/docker-github-actions/
https://github.com/marketplace/actions/docker-layer-caching (popular, but not official?)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.