kolmodin / binary Goto Github PK
View Code? Open in Web Editor NEWEfficient, pure binary serialisation using ByteStrings in Haskell.
License: Other
Efficient, pure binary serialisation using ByteStrings in Haskell.
License: Other
The README.md file has been converted to Markdown and been slightly extended.
Here are some further changes I'd like to see:
Not sure if it's still portable to Hugs. We claim it is, but I doubt it.
The section about "Using binary". I think this section is not too helpful. It could be more structured of the alternatives and with links to the relevant haddocks.
Deriving binary instance section. This is outdated and missleading. We can now generate instances with GHCs generics. A small example of how to do so, or link to the haddocks, would be better suiting.
This quickly gets expensive if called multiple times, and can build up a huge stack if called multiple times, like so:
countTrades :: BL.ByteString -> Int
countTrades input = stepper (0, input) where
stepper (!count, !buffer)
| BL.null buffer = count
| otherwise =
let (trade, rest, _) = runGetState getTrade buffer 0
in stepper (count+1, rest)
The Natural
type has been officially added to base-4.8.0.0
We use Int
in a lot of places, but Haskell only guarantees 29 bits, and 28 bits of positive integers, or referring up to 256MB.
We should consider using Int64
instead of relying on having >28bits.
Got a request to re-implement lookAheadE
.
Apparently it can be used to implement lookAheadM
in monad transformer stacks, like in the hackage package bytes
.
binary
contains an implementation of a builder just like found in blaze-builder
. IIRC, this implementation was actually the original source for blaze-builder
.
Would it make any sense to swap out the locally-maintained Builder for that one? If not, why not?
Taken from cereal documentation:
isolate :: Int -> Get a -> Get a
Isolate an action to operating within a fixed block of bytes. The action is required to consume all the bytes that it is isolated to.
It's quite useful function since pattern N of bytes in chunk
followed by said chunk is quite common in binary format.
I propose to add two variants of isolate
. One should have same semantics as cereal's and require parser to consume all input. Second should only ensure that parser consume no more than N
bytesand rest is discarded. It's useful when do not want fully decode block data.
Is there any way to get the same functionality in binary 0.6?
Citing from http://hydra.cryp.to/build/852686/nixlog/1/raw:
tests/QC.hs:364:28:
Ambiguous occurrence ‘arbitrarySizedNatural’
It could refer to either ‘Test.QuickCheck.arbitrarySizedNatural’,
imported from ‘Test.QuickCheck’ at tests/QC.hs:21:1-32
(and originally defined in ‘Test.QuickCheck.Arbitrary’)
or ‘Arbitrary.arbitrarySizedNatural’,
imported from ‘Arbitrary’ at tests/QC.hs:26:56-76
(and originally defined at tests/Arbitrary.hs:70:1-21)
In the ELF format, there are absolute offsets pointing to positions after the offset.
Thus, calculating this offset requires knowing the length of several Builders, including the length of the Builder that will contain the offset itself.
With the current API and a naive approach that would force the Builders into lazy ByteStrings, you'd end up with <>.
One simple solution, suggested by Joe Hendrix, is to wrap the Builder in a type that contains the Builder and it's length.
data SizedBuilder = SB Int64 Builder
length (SB l _) = l
… + additional methods exported by Data.Binary.Builder
This API could be exposed from Data.Binary.Builder.Sized.
Currently Data.Binary.decode
and Data.Binary.decodeFile
rely on error
to fail. They really ought to return Maybe
or Either
.
Basically something like delimit :: Get a -> Get a
that resets the bytesRead
counter on the inner Get
. This would primarily be useful in conjunction with the isolate
combinator (which can then act undelimited by default, and we can modify it to be delimited if needed) or the alignment combinator I propose in #50.
I've often wanted a combinator called sized :: Get a -> Get (Int, a)
that will behave the same as its input, except will also tell you how many bytes it consumed.
The safecopy package uses this to improve error messages, and the plan is to port safecopy/acid-state to binary.
src/Data/Binary/Put.hs:61:1:
bytestring-0.9.2.0:Data.ByteString can't be safely imported! The module itself isn't safe.
xcabal: Error: some packages failed to install:
binary-0.7.6.0 failed during the building phase. The exception was:
ExitFailure 1
Hi,
I can’t actually find anything in the documentation that discusses the binary compatibility between binary version, i.e. whether code that successfully parses something with binary 0.5 will also do so with 0.7, but I was optimistically assuming so.
Anyways, it does not seem to be the case. Compare https://s3.amazonaws.com/archive.travis-ci.org/jobs/17619621/log.txt with https://api.travis-ci.org/jobs/17619789/log.txt?deansi=true – identical setups, besides the version of binary, and the tests show that Data.Binary.Get
behaves differently.
Is there any documentation of incompatibilities between binary versions? I can’t even find a changelog.
On bytestring < 0.10.0.0 there were no NFData declarations, and we instantiated our own.
Now bytestring >= 0.10.0.0 is shipped with the Haskell Platform - our instances are no longer needed for newer bytestrings.
We have some broken code that checks for GHC version, but it should actually be checking the ByteString version.
This wiki might be a good starting point of what ppl are looking for;
http://www.haskell.org/haskellwiki/DealingWithBinaryData
When I run the benchmarks on the master branch I get the following error:
Binary (de)serialisation benchmarks:
100MB of Word8 in chunks of 16 ( Host endian): bench: too few bytes. Failed reading at byte position 6553601
This is the command I ran
make -C benchmarks/ clean bench run-bench
I haven't touched the Data.Binary.Get
code so I'm not sure what's wrong.
The IEEE "not-a-number" (NaN) value is not encoded properly, since encode
uses decodeFloat
, which is unspecified for NaN (cf. Prelude).
Examples:
(0 / 0 :: Double) = NaN
(decode (encode (0 / 0 :: Double)) :: Double) = -Infinity
(log (-1) :: Double) = NaN
(decode (encode (log (-1) :: Double)) :: Double) = Infinity
(0 / 0 :: Float) = NaN
(decode (encode (0 / 0 :: Float)) :: Float) = -Infinity
Data.Binary.Put
and Data.Binary.Builder
provide a variety of Put
s for various width Word
types. I don't see any reason why they shouldn't include similar functionality for the signed types from Data.Int
.
It would be useful if this would be exported, as I can then take apart the Put
monad and reassemble it, without incurring the cost of running the Builder
itself.
I prefer not to ask for convenience functions willy-nilly, but I think refactoring decodeFileOrFail
with this signature would be useful:
decodeGetFileOrFail :: Get a -> FilePath -> IO (Either (ByteOffset, String) a)
The reason is because there is a nontrivial chunk of code for running the incremental parser that preferably we'd avoid duplicating. I'll submit a PR soon.
From @simonpj:
In the binary library I’m seeing lots of these warnings:
libraries/binary/src/Data/Binary/Get.hs:420:1: warning:
Rule "getWord16le/readN" may never fire
because ‘getWord16le’ might inline first
Probable fix: add an INLINE[n] or NOINLINE[n] pragma on this function
libraries/binary/src/Data/Binary/Builder/Base.hs:510:1: warning:
Rule "flush/flush" may never fire
because ‘flush’ might inline first
Probable fix: add an INLINE[n] or NOINLINE[n] pragma on this function
The warnings look right to me: currently everything is very fragile and may not work as you intend.
The tutorial in Data.Binary.Get
includes the following example:
example2 :: BL.ByteString -> [Trade]
example2 input
| BL.null input = []
| otherwise =
let (trade, rest, _) = runGetState getTrade input 0
in trade : example2 rest
Unfortunately, runGetState
is marked as deprecated, with a suggestion to use runGetIncremental
instead. It'd be nice if the tutorial examples showed the recommended usage of the library.
Writing decoders in applicative style gives more efficient code, since boundary checks (check whether we have enough remaining input) can be merged.
Update the documentation to reflect this, README.md and haddock.
These needs to be addressed;
src/Data/Binary/Builder/Internal.hs:3:14: Warning:
‘Data.Binary.Builder.Internal’ is marked as Trustworthy but has been inferred as safe!
src/Data/Binary/Put.hs:3:14: Warning:
‘Data.Binary.Put’ is marked as Trustworthy but has been inferred as safe!
src/Data/Binary/Class.hs:3:14: Warning:
‘Data.Binary.Class’ is marked as Trustworthy but has been inferred as safe!
src/Data/Binary/Generic.hs:2:26: Warning:
‘Data.Binary.Generic’ is marked as Trustworthy but has been inferred as safe!
An closer look at all Safe Haskell use within binary would be good.
I find myself making the same error over and over again. I use get
when I mean getByteString
or I use put
when I mean putByteString
. It's occured to me that the Binary instance for ByteString is actually a very bad idea. Instead, I suggest wrapping ByteString with a newtype that will define the current instance. Although this may break some code, it would likely save more time than it costs over all for the library's end users.
I was just reading through the Binary instance of Integer and stumbled on the roll function:
roll :: (Integral a, Num a, Bits a) => [Word8] -> a
roll = foldr unstep 0
where
unstep b a = a `shiftL` 8 .|. fromIntegral b
There's a risk of a stack-overflow here since it's lazily building the result value. Although the list of bytes will usually not be that big I think it would be better to build the value strictly using something like the following (untested):
roll :: (Integral a, Num a, Bits a) => [Word8] -> a
roll = foldl' unstep 0
where
unstep a b = a `shiftL` 8 .|. fromIntegral b
One of my programs just failed with
too few bytes. Failed reading at byte position 1852252265
It took me a while to figure out that this message was coming from binary. I suggest we always output the module and function name in error messages:
Data.Binary.Get.getBytes: too few bytes. Failed reading at byte position 1852252265
The data in Foreign.C.Types
are just newtype wrappers around types which mostly have Binary
instances already. Is there a reason the C types don't have Binary
instances?
I often find myself needing this function:
listUntilEnd :: (Binary a) => Get [a]
listUntilEnd = do
done <- isEmpty
if done then return [] else do
next <- get
rest <- listUntilEnd
return (next:rest)
Hi,
binary
(like many other Haskell libraries, unfortunately) does not have a proper changelog file that collects, per release, the user-relevant changes. With hackage now showing links to changelogs, it is a good time so introduce one. It would also prevent me from bothering you with #44...
Thanks,
Joachm
Hi Lennart,
I assume you have the tags corresponding to the versions of 'binary' on hackage. It would be great, if they were also available in your github repo.
best regards,
Simon
Data.Binary
is small enough, and exports names that are unique enough, that it can commonly be simply imported wholesale:
import Data.Binary
However, this also happens to re-export Data.Word, which is surprising, and generates a warning from GHC, if code that uses it also imports Data.Word:
Import Data.Binary
Import Data.Word
yields:
src/Foo.hs:7:1: Warning:
The import of ‘Data.Word’ is redundant
except perhaps to import instances from ‘Data.Word’
To import instances alone, use: import Data.Word()
In a module that uses both Binary
and Word
explicitly, it makes for poor developer experience to rely on Data.Binary
to export the names from Data.Word
. If you move the Data.Binary
dependent code out of the module, and delete the import - the remaining Data.Word
code doesn't compile.
Apparently, binary represents a Double as a tuple of (Integer, Int)? This means that doubles suffer a x3 or more size explosion, when really you could just record an IEEE floating point with the proper endian. This would also fix #64
Backwards compatibility might be a concern for fixing this, however.
As pointed out in #70 by @ttuegel we don't do validation of UTF-8 when decoding. This needs to be fixed.
Consider introducing getList
in Binary
if it makes a big difference for performance. With getList
we could use some of the faster UTF-8 validators without having to write our own. See how text does utf8 validation . Our case might be more difficult though as we don't know beforehand whether all input bytes are available.
class Binary a where
-- ...
getList :: [a]
getList = getDefaultList
getDefaultList :: Binary a => Get [a]
getDefaultList = get >>= getMany
class Binary Char where
-- ...
getList = -- faster code
I've found myself wanting something like (might be buggy, but you get the idea)
aligned :: Int -> Get a -> Get a
aligned n g = do
br <- fromIntegral <$> bytesRead
skip $ n - br `rem` n
g
Might it be worth adding to the library, with a Builder
/Put
counterpart? The Builder
side of things would require more changes to make it work than what I wrote above, but it's not all that hard.
Would be useful to provide encode' :: (Binary a) => a -> Data.ByteString.ByteString
Usually I want lazy, but sometimes I do not.
So slow that on Travis CI it doesn't link within 10 minutes and the build gets killed.
I'm using the lazy interface of Data.Binary.Get to parse large binaries from disk, where most parts are skipped on the first pass. When using skip directly, I get a severe performance and space usage problem when the skipped byte count is large (many megabytes). I'm not an expert on lazy bytestring internals, but it seems like the input data is held on for too long before being skipped ("PINNED" memory usage in -hc heap profile). Using this wrapper around skip makes it several orders of magnitude faster and does not explode on the heap (I'm using GHC 7.10.1):
import qualified Data.Binary.Get as G
skipMany :: Int -> G.Get ()
skipMany bytes =
replicateM_ rep (G.skip cs) >> G.skip rest
where
cs = 1024
(rep, rest) = bytes `quotRem` cs
instance (Binary i, Ix i, Binary e, IArray UArray e) => Binary (UArray i e) where
get = do
bs <- get
n <- get
xs <- getMany n
return (listArray bs xs)
getMany is fully strict in the list, since it uses an accumulator and reverses it at the end. The intermediate xs list can be huge in cases where the eventual UArray is much more manageable (eg 28M Booleans).
Two questions:
Is there a known alternative for (un)serializing UArrays to(from) disk? Such an alternative would make this Issue far less important.
Have you considered a version that serializes the bytes directly? I drafted one up; it's tremendously more efficient, though I'm concerned about robustness wrt endianness etc. Furthermore, it requires a base monad that can mutate arrays, which requires an "unsafe" invocation. And lastly it's not portable, using ghc-prim.
HTH. Thanks.
Could be implemented using Alternative.
I heard on IRC that this should never be the case, so I provided a benchmarked counterexample
https://github.com/ghorn/binary-counterexample
stats here: http://ghorn.github.io/binary-counterexample/
The current list encoding seems to blindly assume that lists have a length that can fit in an Int
. It seems like there should be a way of encoding larger lists.
The following program encodes and decodes a long list of words. Memory consumption seems 4x bigger than what I'd expect. Results shown below. ghc-7.10.2, binary-0.7.6.1.
import Control.Exception (evaluate)
import Control.Monad (void)
import Data.Binary (encode, decode)
import qualified Data.ByteString.Lazy as BSL
import Data.List (isPrefixOf, foldl')
import Data.Word (Word32)
import GHC.Stats
import System.Mem (performGC)
type T = (Word32,[Word32])
main :: IO ()
main = do
let sz = 1024 * 1024 * 15
xs = [ (i,[i]) :: T | i <- [0 .. sz] ]
bs = encode xs
void $ evaluate $ sum' $ map (\(x, vs) -> x + sum' vs) xs
putStrLn "After building the value to encode:"
printMem
putStrLn $ "Size of the encoded value: " ++
show (BSL.length bs `div` (1024 * 1024)) ++ " MB"
putStrLn ""
putStrLn "After encoding the value:"
printMem
let xs' = decode bs :: [T]
void $ evaluate $ sum' $ map (\(x, vs) -> x + sum' vs) xs'
putStrLn "After decoding the value:"
printMem
-- retain the original list so it is not GC'ed
void $ evaluate $ last xs
-- retain the decoded list so it is not GC'ed
void $ evaluate $ last xs'
printMem :: IO ()
printMem = do
readFile "/proc/self/status" >>=
putStr . unlines . filter (\x -> any (`isPrefixOf` x) ["VmHWM", "VmRSS"])
. lines
performGC
stats <- getGCStats
putStrLn $ "In use according to GC stats: " ++
show (currentBytesUsed stats `div` (1024 * 1024)) ++ " MB"
putStrLn $ "HWM according the GC stats: " ++
show (maxBytesUsed stats `div` (1024 * 1024)) ++ " MB"
putStrLn ""
sum' :: Num a => [a] -> a
sum' = foldl' (+) 0
Here are the results:
# time ./test +RTS -TAfter building the value to encode:
VmHWM: 1557456 kB
VmRSS: 1557456 kB
In use according to GC stats: 1320 MB
HWM according the GC stats: 1320 MB
Size of the encoded value: 240 MB
After encoding the value:
VmHWM: 2791620 kB
VmRSS: 2791620 kB
In use according to GC stats: 1560 MB
HWM according the GC stats: 1560 MB
After decoding the value:
VmHWM: 6229164 kB
VmRSS: 6229164 kB
In use according to GC stats: 2880 MB
HWM according the GC stats: 2880 MB
real 0m27.143s
user 0m25.112s
sys 0m2.016s
The GC reports mostly what I expect. However the OS reports a much higher memory usage. The difference seems to exacerbate after decoding.
Any hints appreciated.
[1 of 8] Compiling Data.Binary.Builder.Base ( src/Data/Binary/Builder/Base.hs, dist/build/Data/Binary/Builder/Base.o )
src/Data/Binary/Builder/Base.hs:68:0:
Warning: Module `Data.Word' is imported, but nothing from it is used,
except perhaps instances visible in `Data.Word'
To suppress this warning, use: import Data.Word()
[2 of 8] Compiling Data.Binary.Builder.Internal ( src/Data/Binary/Builder/Internal.hs, dist/build/Data/Binary/Builder/Internal.o )
[3 of 8] Compiling Data.Binary.Builder ( src/Data/Binary/Builder.hs, dist/build/Data/Binary/Builder.o )
[4 of 8] Compiling Data.Binary.Get.Internal ( src/Data/Binary/Get/Internal.hs, dist/build/Data/Binary/Get/Internal.o )
src/Data/Binary/Get/Internal.hs:251:2:
`some' is not a (visible) method of class `Alternative'
src/Data/Binary/Get/Internal.hs:252:2:
`many' is not a (visible) method of class `Alternative'
In the definition of unsafeReadN
:
unsafeReadN :: Int -> (B.ByteString -> a) -> Get a
unsafeReadN !n f = C $ \inp ks -> do
ks (B.unsafeDrop n inp) $! f inp -- strict return
We pass the rest of the input (B.unsafeDrop n inp
) to the success continuation without first forcing it. This could lead to us holding on to input longer than necessary.
In practice it's not much of a problem as the success continuation will most likely evaluate the thunk, but I think it's more correct to do:
unsafeReadN :: Int -> (B.ByteString -> a) -> Get a
unsafeReadN !n f = C $ \inp ks -> do
let !t = B.unsafeDrop n inp
ks t $! f inp -- strict return
Add description for:
flag development
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.