Description of the issue
I'd reckon that not everyone would use every stop database, and some of them may never be loaded.
I would still like everything to be offline, so I still need to ship these things.
I think there is some value in compressing these.
I've been running experiments on mdst-compression-experiment
, after adding a new compressed field to mdst-license-notice
. In mdst-license-notice
, I moved the attribution data into a field which is gzip-compressed.
Also in mdst-license-notice
, I lazy-load the station index.
It's probably better to compress entire sections of the file. MdST requires you to load the header block, then the entire index block if you want to reference stations by ID.
Looking at a big database (suica_rail.mdst
):
Mode |
header size |
stations size |
index size |
uncompressed |
18710 |
254315 |
56307 |
zlib (compressing each station record) |
9888 |
308086 |
24939 |
brotli (compressing each station record) |
8576 |
277950 |
15694 |
zlib (solid station records) |
9888 |
138167 |
24939 |
brotli (solid station records) |
8576 |
106444 |
15694 |
"compressing each station record" is where I pushed a separate compressed block for each record in the file. "solid station records" is where I concatenated all station records, then compressed the entire block.
It should be possible to make changes to the format to seek within a solid station record list. It does mean a bunch of data will need to be decompressed for every lookup, unless the solid blocks can be split.
At worst, just compressing the header and the index is generally a pretty good win.