foxglove / mcap Goto Github PK

MCAP is a modular, performant, and serialization-agnostic container file format, useful for pub/sub and robotics applications.

Home Page: https://mcap.dev

License: MIT License

TypeScript 19.79% JavaScript 0.90% Makefile 0.44% Dockerfile 0.23% CMake 0.30% C++ 16.09% Python 24.33% Go 22.30% Shell 0.15% Swift 4.51% Rust 9.35% Kaitai Struct 0.57% CSS 0.40% MDX 0.65%

robotics data serialization deserialization typescript golang python cpp swift

mcap's Issues

Why does Chunk Index have a crc field when Attachment Index does not?

It seems like they should either both have it or neither have it. My personal leaning is toward only doing CRCs on user data (chunks, attachments) and not on index records.

Change alpha badge to beta

Revive TypeScript validate script

Having some script that can be run on a .mcap file in each language seems useful. For TypeScript we had this but it was removed in #112.

Rename record_time to log_time

From team discussion - record_time is ambiguous. log_time was considered more appropriate

Add conformance test runner for Go

Add test runner for https://github.com/foxglove/mcap/tree/main/tests/conformance/scripts/run-tests/runners which uses child_process.exec() (or similar) to run the golang reader(s).

Index data section contains more than "index" data. Creates confusion.

Alternative names to consider:

epilogue (zstd uses this)
appendix
end-of-file section

Do channel info records have to appear in chunks if they have appeared in earlier chunks?

The spec says: A channel info record must occur in the file prior to any message that references its Channel ID.

If I write a chunk and know I've written a channel info record before in another chunk - do I need to write the channel info record again?

Add "generic-protobuf" profile

Should we have a generic-protobuf profile which clarifies how to store protobuf data in an mcap file.

Run the CI conformance tests for different languages in parallel

Add created_at time to attachments

From #16

Why is these no publishing/acquisition time for an attachment? Not all attachment formats will support this metadata natively, so it would be useful to record it here.

Writing an attachment forces me to end my chunk

My writer is trying to produce a chunk for every second of data. If I've written a few messages and now want to write an attachment record, it seems I have to end my chunk, write out all the message indexes, write the attachment, and then start a new chunk.

go: reusable readers/writers

The readers and writers should support some kind of Reset functionality, to allow reuse of underlying buffers for compression and chunking across files.

Clarify Chunk Index compressed_size / uncompressed_size descriptions

The compressed size of the chunk.
The uncompressed size of the chunk.

These should say "The {} size of the chunk records field"

Add conformance test runner for C++

Add test runner for https://github.com/foxglove/mcap/tree/main/tests/conformance/scripts/run-tests/runners which uses child_process.exec() (or similar) to run the C++ reader(s).

Add schema_count to Statistics

This was overlooked in #102. schema_count can be uint16 because schema id 0 is reserved (contingent on #126).

This could be implemented without binary breakage by appending the field to the end of the Statistics record. Or for aesthetic (and fixed-offset) reasons we could put it earlier in the record with a binary breakage.

Statistics has no `metadata_count` field

A metadata_count field should go after the chunk_count field.

include schema_version in channel info

A schema version field should be included in channel info. This could be a fixed-width 16 byte field (md5). This field can be used as a cache key by readers, to either avoid parsing (local) or downloading (remote) schemas they have observed in previous requests.

String datatype is unclear

Description
The spec says the following about string types:

String: a uint32-prefixed UTF8 string

Is the prefix the length of the string (characters) or the number of bytes for the entire UTF8 encoded portion?

"make build" for go cli tool fails on mac

Building with go build succeeds. This is probably resulting from choice of build flags for sqlite.

change offset to file offset and chunk offset

Add these to glossary

Change KeyValue to Array

Comments on the file format spec

Hi,

I've been reviewing the MCAP spec, and have some feedback which may be of interest.

Attachments

Why is these no publishing/acquisition time for an attachment? Not all attachment formats will support this metadata natively, so it would be useful to record it here.

Channel Info

It may be useful to have a schema version, or store a hash of the schema. This allows consumers to determine whether they are able to process the data described by the schema, without having to reason about the contents of the schema itself.

Encryption / Signing

No support.
It would be useful to support encryption and signing on a per chunk basis.

Robustness and Chunks

Major downsides of unchunked files are the lack of message indexes, and integrity checking.
Would it be possible to use an approach whereby un-compressed chunks are written, along with message indexes? This would have negligable impact on write performance, and depending on the implementation (special case of an uncompressed chunk, where messages are written straight to file), would maintain robustness in the case of a crash.

indicate to readers whether record timestamps are relative to custom offset

We specify that the record_time may be relative to an arbitrary epoch. Unix epoch will be common but others options may also be used. It would be useful to readers to know what epoch the timestamps are relative to in some way - this could inform whether stamps could be displayed as date strings rather than integers.

Remove _count_ from message index record

The count is inferred from the number of records

MessageIndex records type is incorrectly saying KeyValue

This is not a key/value but an array. Timestamps are allowed to repeat.

Compression fields in both in Chunk Index and Chunk

Which of these is the authoritative one?

Make Chunk compression field a char[4]

Right now, the compression field in Chunk is a variable-length string. This places the actual chunk payload at a variable offset, and requires parsing the compression string to determine the chunk payload length. If compression was instead a fixed-length char[4], we would know the chunk payload size immediately after parsing the record length and it would avoid an additional allocation for the std::string compression.

uncompressed would be [0x00, 0x00, 0x00, 0x00] or little-endian uint32_t 0
lz4 would be [0x6C, 0x7A, 0x34, 0x00] or little-endian uint32_t 3439212
zstd would be [0x7A, 0x73, 0x74, 0x64] or little-endian uint32_t 1685353338

Add a way to indicate a schema name but no schema.

Let's say I have CBOR messages on channels. I'd like to give each "type" of message a "name" so that Studio can render my data in various panels. I don't want to (or need to) provide a schema since CBOR is self describing. What do I do?

ChannelInfo encoding is underspecified

The encoding field within ChannelInfo is a string type with a few examples: ros1, protobuf, cbor. If I select protobuf as the encoding, what should I put for schema?, schema_name? Does selecting an encoding have any other restrictions?

I'd like to write mcap reader/writer libraries that inter-operate with other tools but without additional specifications for how to handle encodings can't be sure what to do.

Some useful encodings to specify:

ros1
protobuf
json
cbor

Output Go binaries in a gitignored location

It's easy to accidentally check in the go conformance check binary since it's not gitignored.

Is Chunk Index message index offsets required

Is it valid to write chunk'd files without Message Index records following chunks.

Add conformance test runner for Python

Add test runner for https://github.com/foxglove/mcap/tree/main/tests/conformance/scripts/run-tests/runners which uses child_process.exec() (or similar) to run the Python reader(s).

Delete ros2 profile from docs

Until we figure this out with more detail I think we should remove it from the docs.

Byte prefixed individual records (like arrays) feels weird

Issue for discussion

Since we have byte length prefixing for the entire record - having byte length for array fields feels weird. On the writer side I need to write out the array and only then do I know the length of bytes for the array.

schema ID could be fixed length offset from start of channel info record

We have already said we are not making more breaking changes at this time, but if we do, this seems like a nice one to include.

Chunk Index record doesn't tell me how big my chunk record is

After reading a chunk index record, I know which message index record to read, but I don't know how big my chunk record is which means I have to go read the size of the record and then go read the record itself.

swap publish time and log time in message record?

We currently have publish time physically ordered before log time in the message record. This isn't a big deal but introduces a couple inconveniences:

The description of the publish time references the log time, which a spec reader will not have read about yet
The log time is the time on which the file is most closely ordered (under most conditions) and the one on which the indexes are built, so it may be more natural somehow for it to come first. Since all the fields are fixed-width, the difference is just cosmetic.

The corrective action would be to swap the ordering in the record.

Rework high-level format variants

The specification currently makes a division between "chunked" and "unchunked" files, with each having a mandatory set of fields. Discussions have leaned in the direction of this being too restrictive on at least a couple fronts:

Users may want the compression benefits of chunking, but not want the cost of retaining channel info records in RAM for the statistics or chunk index records.
Users of the unchunked format may not want the cost of retaining channel info records in RAM for the statistics record. That's part of what they are trying to avoid by using the unchunked variant to begin with.

In consideration of these, we are considering making the following changes:

Chunked and unchunked files are eliminated as terms. There will be just one "mcap file".
Chunks and messages may both appear at the top level of the file.
Chunk indexes, attachment indexes, statistics, and channel infos in the index data section are optional, but subject to some mutual constraints:

if chunk indexes are included, any channels referenced by those chunk indexes must have channel infos in the index data section
if the channel_stats field of the statistics record is included, any channels it references must be reflected in the index data section as channel infos
if there are no records in the index data section, the index_offset of the footer record will be set to zero. Otherwise it will point to the first record in the section, regardless of what kind of record that is.
the channel_stats field of the statistics record may be zero-length/empty. This is to allow tracking of cheap global file stats without the expense of retaining the channel infos.

Messages written outside chunks will be readable by a sequential reader, but invisible to a random access reader using the chunk index.

Writers that do not include data in the index section will progressively lose utility from the "fast summarization support". The algorithm for "summary" is roughly,

seek to the index_offset
read to the end of the file
report aggregated statistics

If the index data section is empty, no statistics will be aggregated. Fallback behavior to a full file read is inadvisable to maintain good support on remote files. Update the explanatory notes section to discuss this a little bit.

Bring the diagrams back

Spec is not clear up front about serialization details

When reading the spec one encounters usage of Array<Tuple<...>> before these terms are defined. We should move the definitions up, or add links to the serialization info, or at least mention earlier in the doc that serialization terms will be specified later.

Use _location_ when referring to bytes from start of file and _offset_ when bytes relative to another value?

Chunk records should have a start and end stamp

With a start and end stamp on chunks, I could cheaply identify if I need to read a chunk or decompress a chunk based on the time I want to read messages at. Without the start and end stamp I have to process a chunk before knowing it contains messages for my timestamp.

Make McapWriter file interface agnostic

Rather than requiring nodejs fs module, allow McapWriter to write to any FileLike conforming instance. This is a pattern we've used before so readers and writers can function in different i/o environments.

Clarify start_time and end_time in chunk index records

This should be the record_time of a message. Should this also be the earliest seen message in the chunk? Or just the first and last messages (if record_time happens to be out of order).

profiles are underspecified

The profile field in the Header says it specifies "interpretation of channel info user data.". Does this mean that a file with protobuf encoding would still be a valid ros1 profile?

It would be useful to expand the scope of what a profile can describe to include any of the open-ended fields (i.e. encoding, schema, schema name, etc).

This would allow creating a ros1 profile that would indicate ros1 as the required encoding, .msg text as the required schema format, and the schema naming convention. By specifying all of the requirements within the profile, a library author can definitely say they implement support for a ros1 profile which will interoperate with other tooling producing ros1 profile files.

Remove message_count from Statistics record

The total message count is already available once you've parsed the Statistics record by summing up the message counts for each channel. I don't think the read-time speedup of avoiding a reduce() function on a map is worth having another potential source of disagreement in files.

v1: make profile available in summary section?

This would prevent the need to read from the front of the file when doing indexed reads, which would reduce latency for remote reading by some amount (should be tested).

This doesn't have to break backward compatibility, but would require either a new record type, or else overloading the Header. Strictly speaking, we only need the profile, not the library, but perhaps the library would be useful too.

Clarify length semantics for String and KeyValue

The spec says

String: a uint32-prefixed UTF8 string
KeyValues<T1, T2>: A uint32 length-prefixed association of key-value pairs, serialized as

For string is this the length of the string or the number of bytes?
For KeyValues is this the number of pairs or the number of bytes for the remaining serialized portion?

How are readers supposed to determine that unchunked files contain messages?

Here is a simple chunked file layout:

Chunked
-------
Magic
Header
Chunk
  ChannelInfo
  Message
MessageIndex
[index_offset]
ChannelInfo
ChunkIndex
Statistics
Footer
Magic

And the same file unchunked:

Unchunked
---------
Magic
Header
ChannelInfo
Message
[index_offset]
ChannelInfo
Statistics
Footer
Magic

And here's an empty file:

Empty (Chunked or Unchunked)
-------
Magic
Header
[index_offset]
Statistics
Footer
Magic

How are readers supposed to distinguish between unchunked files and empty files?

Use Docker layer caching in CI for C++ builds

https://www.docker.com/blog/docker-github-actions/

https://github.com/marketplace/actions/docker-layer-caching (popular, but not official?)

foxglove / mcap Goto Github PK

mcap's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs