kolmodin / binary Goto Github PK

View Code? Open in Web Editor NEW

103.0 103.0 67.0 8.2 MB

Efficient, pure binary serialisation using ByteStrings in Haskell.

License: Other

Haskell 99.61% C 0.39%

binary's People

Contributors

Stargazers

Watchers

binary's Issues

Improve README.md

The README.md file has been converted to Markdown and been slightly extended.

Here are some further changes I'd like to see:

Not sure if it's still portable to Hugs. We claim it is, but I doubt it.

The section about "Using binary". I think this section is not too helpful. It could be more structured of the alternatives and with links to the relevant haddocks.

Deriving binary instance section. This is outdated and missleading. We can now generate instances with GHCs generics. A small example of how to do so, or link to the haddocks, would be better suiting.

runGetState uses toChunks and fromChunks

This quickly gets expensive if called multiple times, and can build up a huge stack if called multiple times, like so:

countTrades :: BL.ByteString -> Int
countTrades input = stepper (0, input) where
  stepper (!count, !buffer)
    | BL.null buffer = count
    | otherwise      =
        let (trade, rest, _) = runGetState getTrade buffer 0
        in stepper (count+1, rest)

Code from http://stackoverflow.com/questions/9567040/poor-performance-parsing-binary-file-in-haskell/9573661#9573661

Add instance for `Natural` type

The Natural type has been officially added to base-4.8.0.0

Consider using Int64 throughout the API

We use Int in a lot of places, but Haskell only guarantees 29 bits, and 28 bits of positive integers, or referring up to 256MB.
We should consider using Int64 instead of relying on having >28bits.

lookAheadE

Got a request to re-implement lookAheadE.

Apparently it can be used to implement lookAheadM in monad transformer stacks, like in the hackage package bytes.

Relation to blaze-builder

binary contains an implementation of a builder just like found in blaze-builder. IIRC, this implementation was actually the original source for blaze-builder.

Would it make any sense to swap out the locally-maintained Builder for that one? If not, why not?

Add isolate function

Taken from cereal documentation:

isolate :: Int -> Get a -> Get a

Isolate an action to operating within a fixed block of bytes. The action is required to consume all the bytes that it is isolated to.

It's quite useful function since pattern N of bytes in chunk followed by said chunk is quite common in binary format.

I propose to add two variants of isolate. One should have same semantics as cereal's and require parser to consume all input. Second should only ensure that parser consume no more than N bytesand rest is discarded. It's useful when do not want fully decode block data.

Where did lookAhead functions dissapear?

Is there any way to get the same functionality in binary 0.6?

binary-0.7.4.0 can't compile its test suite

Citing from http://hydra.cryp.to/build/852686/nixlog/1/raw:

tests/QC.hs:364:28:
    Ambiguous occurrence ‘arbitrarySizedNatural’
    It could refer to either ‘Test.QuickCheck.arbitrarySizedNatural’,
                             imported from ‘Test.QuickCheck’ at tests/QC.hs:21:1-32
                             (and originally defined in ‘Test.QuickCheck.Arbitrary’)
                          or ‘Arbitrary.arbitrarySizedNatural’,
                             imported from ‘Arbitrary’ at tests/QC.hs:26:56-76
                             (and originally defined at tests/Arbitrary.hs:70:1-21)

Calculate the length of a Builder without executing it

In the ELF format, there are absolute offsets pointing to positions after the offset.
Thus, calculating this offset requires knowing the length of several Builders, including the length of the Builder that will contain the offset itself.

With the current API and a naive approach that would force the Builders into lazy ByteStrings, you'd end up with <>.

One simple solution, suggested by Joe Hendrix, is to wrap the Builder in a type that contains the Builder and it's length.

data SizedBuilder = SB Int64 Builder

length (SB l _) = l
… + additional methods exported by Data.Binary.Builder

This API could be exposed from Data.Binary.Builder.Sized.

decode should handle failure more gracefully

Currently Data.Binary.decode and Data.Binary.decodeFile rely on error to fail. They really ought to return Maybe or Either.

Implement a delimiter for Get's bytesRead

Basically something like delimit :: Get a -> Get a that resets the bytesRead counter on the inner Get. This would primarily be useful in conjunction with the isolate combinator (which can then act undelimited by default, and we can modify it to be delimited if needed) or the alignment combinator I propose in #50.

A size counter for Get

I've often wanted a combinator called sized :: Get a -> Get (Int, a) that will behave the same as its input, except will also tell you how many bytes it consumed.

Add a label function like in cereal

http://hackage.haskell.org/packages/archive/cereal/0.3.5.2/doc/html/src/Data-Serialize-Get.html#label

The safecopy package uses this to improve error messages, and the plan is to port safecopy/acid-state to binary.

GHC-7.2 build failure

src/Data/Binary/Put.hs:61:1:
    bytestring-0.9.2.0:Data.ByteString can't be safely imported! The module itself isn't safe.
xcabal: Error: some packages failed to install:
binary-0.7.6.0 failed during the building phase. The exception was:
ExitFailure 1

Possible binary incompatibility between 0.5 and 0.7

Hi,

I can’t actually find anything in the documentation that discusses the binary compatibility between binary version, i.e. whether code that successfully parses something with binary 0.5 will also do so with 0.7, but I was optimistically assuming so.

Anyways, it does not seem to be the case. Compare https://s3.amazonaws.com/archive.travis-ci.org/jobs/17619621/log.txt with https://api.travis-ci.org/jobs/17619789/log.txt?deansi=true – identical setups, besides the version of binary, and the tests show that Data.Binary.Get behaves differently.

Is there any documentation of incompatibilities between binary versions? I can’t even find a changelog.

Conditionally instantiate NFData for ByteString and L.ByteString

On bytestring < 0.10.0.0 there were no NFData declarations, and we instantiated our own.

Now bytestring >= 0.10.0.0 is shipped with the Haskell Platform - our instances are no longer needed for newer bytestrings.

We have some broken code that checks for GHC version, but it should actually be checking the ByteString version.

Improve Haddock documentation

This wiki might be a good starting point of what ppl are looking for;
http://www.haskell.org/haskellwiki/DealingWithBinaryData

Benchmark failing on master branch

When I run the benchmarks on the master branch I get the following error:

Binary (de)serialisation benchmarks:
100MB of Word8  in chunks of 16 (  Host endian): bench: too few bytes. Failed reading at byte position 6553601

This is the command I ran

make -C benchmarks/ clean bench run-bench

I haven't touched the Data.Binary.Get code so I'm not sure what's wrong.

decode (encode NaN) /= NaN

The IEEE "not-a-number" (NaN) value is not encoded properly, since encode uses decodeFloat, which is unspecified for NaN (cf. Prelude).

Examples:

(0 / 0 :: Double) = NaN
(decode (encode (0 / 0 :: Double)) :: Double) = -Infinity
(log (-1) :: Double) = NaN
(decode (encode (log (-1) :: Double)) :: Double) = Infinity
(0 / 0 :: Float) = NaN
(decode (encode (0 / 0 :: Float)) :: Float) = -Infinity

`Data.Binary.Put` lacks `Put`s for `Int` types

Data.Binary.Put and Data.Binary.Builder provide a variety of Puts for various width Word types. I don't see any reason why they shouldn't include similar functionality for the signed types from Data.Int.

Export PairS

It would be useful if this would be exported, as I can then take apart the Put monad and reassemble it, without incurring the cost of running the Builder itself.

Support decoding from a Get monad

I prefer not to ask for convenience functions willy-nilly, but I think refactoring decodeFileOrFail with this signature would be useful:

decodeGetFileOrFail :: Get a -> FilePath -> IO (Either (ByteOffset, String) a)

The reason is because there is a nontrivial chunk of code for running the incremental parser that preferably we'd avoid duplicating. I'll submit a PR soon.

Remove all compilation warnings when compiling binary

From @simonpj:

In the binary library I’m seeing lots of these warnings:

libraries/binary/src/Data/Binary/Get.hs:420:1: warning:

    Rule "getWord16le/readN" may never fire

      because ‘getWord16le’ might inline first

    Probable fix: add an INLINE[n] or NOINLINE[n] pragma on this function

libraries/binary/src/Data/Binary/Builder/Base.hs:510:1: warning:

    Rule "flush/flush" may never fire

      because ‘flush’ might inline first

    Probable fix: add an INLINE[n] or NOINLINE[n] pragma on this function

The warnings look right to me: currently everything is very fragile and may not work as you intend.

Update the example that uses `runGetState` to new API

The tutorial in Data.Binary.Get includes the following example:

 example2 :: BL.ByteString -> [Trade]
 example2 input
   | BL.null input = []
   | otherwise =
      let (trade, rest, _) = runGetState getTrade input 0
      in trade : example2 rest

Unfortunately, runGetState is marked as deprecated, with a suggestion to use runGetIncremental instead. It'd be nice if the tutorial examples showed the recommended usage of the library.

Explain why the applicative style should be preferred over the monadic style

Writing decoders in applicative style gives more efficient code, since boundary checks (check whether we have enough remaining input) can be merged.

Update the documentation to reflect this, README.md and haddock.

Safe Haskell compilation warnings

These needs to be addressed;

src/Data/Binary/Builder/Internal.hs:3:14: Warning:
    ‘Data.Binary.Builder.Internal’ is marked as Trustworthy but has been inferred as safe!
src/Data/Binary/Put.hs:3:14: Warning:
    ‘Data.Binary.Put’ is marked as Trustworthy but has been inferred as safe!
src/Data/Binary/Class.hs:3:14: Warning:
    ‘Data.Binary.Class’ is marked as Trustworthy but has been inferred as safe!
src/Data/Binary/Generic.hs:2:26: Warning:
    ‘Data.Binary.Generic’ is marked as Trustworthy but has been inferred as safe!

An closer look at all Safe Haskell use within binary would be good.

Remove Binary instance for ByteString

I find myself making the same error over and over again. I use getwhen I mean getByteString or I use put when I mean putByteString. It's occured to me that the Binary instance for ByteString is actually a very bad idea. Instead, I suggest wrapping ByteString with a newtype that will define the current instance. Although this may break some code, it would likely save more time than it costs over all for the library's end users.

Risk of stack-overflow in roll

I was just reading through the Binary instance of Integer and stumbled on the roll function:

roll :: (Integral a, Num a, Bits a) => [Word8] -> a
roll   = foldr unstep 0
  where
    unstep b a = a `shiftL` 8 .|. fromIntegral b

There's a risk of a stack-overflow here since it's lazily building the result value. Although the list of bytes will usually not be that big I think it would be better to build the value strictly using something like the following (untested):

roll :: (Integral a, Num a, Bits a) => [Word8] -> a
roll = foldl' unstep 0
  where
    unstep a b = a `shiftL` 8 .|. fromIntegral b

Include module and function name in error messages

One of my programs just failed with

too few bytes. Failed reading at byte position 1852252265

It took me a while to figure out that this message was coming from binary. I suggest we always output the module and function name in error messages:

Data.Binary.Get.getBytes: too few bytes. Failed reading at byte position 1852252265

Binary instances for Foreign.C.Types

The data in Foreign.C.Types are just newtype wrappers around types which mostly have Binary instances already. Is there a reason the C types don't have Binary instances?

listUntilEnd

I often find myself needing this function:

listUntilEnd :: (Binary a) => Get [a]
listUntilEnd = do
   done <- isEmpty
   if done then return [] else do
      next <- get
      rest <- listUntilEnd
      return (next:rest)

Add changelog

Hi,

binary (like many other Haskell libraries, unfortunately) does not have a proper changelog file that collects, per release, the user-relevant changes. With hackage now showing links to changelogs, it is a good time so introduce one. It would also prevent me from bothering you with #44...

Thanks,
Joachm

Push stable version tags to github

Hi Lennart,

I assume you have the tags corresponding to the versions of 'binary' on hackage. It would be great, if they were also available in your github repo.

best regards,
Simon

don't re-export Data.Word from Data.Binary

Data.Binary is small enough, and exports names that are unique enough, that it can commonly be simply imported wholesale:

import Data.Binary

However, this also happens to re-export Data.Word, which is surprising, and generates a warning from GHC, if code that uses it also imports Data.Word:

Import Data.Binary
Import Data.Word

yields:

src/Foo.hs:7:1: Warning:
     The import of ‘Data.Word’ is redundant
      except perhaps to import instances from ‘Data.Word’
    To import instances alone, use: import Data.Word()

In a module that uses both Binary and Word explicitly, it makes for poor developer experience to rely on Data.Binary to export the names from Data.Word. If you move the Data.Binary dependent code out of the module, and delete the import - the remaining Data.Word code doesn't compile.

Instance for Double/Float is absolutely batty

Apparently, binary represents a Double as a tuple of (Integer, Int)? This means that doubles suffer a x3 or more size explosion, when really you could just record an IEEE floating point with the proper endian. This would also fix #64

Backwards compatibility might be a concern for fixing this, however.

UTF-8 validation when deserializing

As pointed out in #70 by @ttuegel we don't do validation of UTF-8 when decoding. This needs to be fixed.

Consider introducing getList in Binary if it makes a big difference for performance. With getList we could use some of the faster UTF-8 validators without having to write our own. See how text does utf8 validation . Our case might be more difficult though as we don't know beforehand whether all input bytes are available.

class Binary a where
  -- ...
  getList :: [a]
  getList = getDefaultList

getDefaultList :: Binary a => Get [a]
getDefaultList = get >>= getMany

class Binary Char where
  -- ...
  getList = -- faster code

Support alignment modifiers

I've found myself wanting something like (might be buggy, but you get the idea)

aligned :: Int -> Get a -> Get a
aligned n g = do
  br <- fromIntegral <$> bytesRead
  skip $ n - br `rem` n
  g

Might it be worth adding to the library, with a Builder/Put counterpart? The Builder side of things would require more changes to make it work than what I wrote above, but it's not all that hard.

encode'

Would be useful to provide encode' :: (Binary a) => a -> Data.ByteString.ByteString

Usually I want lazy, but sometimes I do not.

Linking 'bench' executable slow

So slow that on Travis CI it doesn't link within 10 minutes and the build gets killed.

Performance issue with skip

I'm using the lazy interface of Data.Binary.Get to parse large binaries from disk, where most parts are skipped on the first pass. When using skip directly, I get a severe performance and space usage problem when the skipped byte count is large (many megabytes). I'm not an expert on lazy bytestring internals, but it seems like the input data is held on for too long before being skipped ("PINNED" memory usage in -hc heap profile). Using this wrapper around skip makes it several orders of magnitude faster and does not explode on the heap (I'm using GHC 7.10.1):

import qualified Data.Binary.Get as G

skipMany :: Int -> G.Get ()
skipMany bytes = 
  replicateM_ rep (G.skip cs) >> G.skip rest
 where
   cs = 1024
   (rep, rest) = bytes `quotRem` cs

get for UArray blows the heap for large arrays

instance (Binary i, Ix i, Binary e, IArray UArray e) => Binary (UArray i e) where
    get = do
        bs <- get
        n  <- get
        xs <- getMany n
        return (listArray bs xs)

getMany is fully strict in the list, since it uses an accumulator and reverses it at the end. The intermediate xs list can be huge in cases where the eventual UArray is much more manageable (eg 28M Booleans).

Two questions:

Is there a known alternative for (un)serializing UArrays to(from) disk? Such an alternative would make this Issue far less important.
Have you considered a version that serializes the bytes directly? I drafted one up; it's tremendously more efficient, though I'm concerned about robustness wrt endianness etc. Furthermore, it requires a base monad that can mutate arrays, which requires an "unsafe" invocation. And lastly it's not portable, using ghc-prim.

HTH. Thanks.

MonadPlus

Could be implemented using Alternative.

binary 4x slower than cereal on my data type

I heard on IRC that this should never be the case, so I provided a benchmarked counterexample

https://github.com/ghorn/binary-counterexample

stats here: http://ghorn.github.io/binary-counterexample/

List encoding does not work for large lists?

The current list encoding seems to blindly assume that lists have a length that can fit in an Int. It seems like there should be a way of encoding larger lists.

The changelog.md file is not available in Hackage

Memory consumption of decoding bigger than expected

The following program encodes and decodes a long list of words. Memory consumption seems 4x bigger than what I'd expect. Results shown below. ghc-7.10.2, binary-0.7.6.1.

import Control.Exception (evaluate)
import Control.Monad (void)
import Data.Binary (encode, decode)
import qualified Data.ByteString.Lazy as BSL
import Data.List (isPrefixOf, foldl')
import Data.Word (Word32)
import GHC.Stats
import System.Mem (performGC)

type T = (Word32,[Word32])

main :: IO ()
main = do
  let sz = 1024 * 1024 * 15
      xs = [ (i,[i]) :: T | i <- [0 .. sz] ]
      bs = encode xs

  void $ evaluate $ sum' $ map (\(x, vs) -> x + sum' vs) xs
  putStrLn "After building the value to encode:"
  printMem

  putStrLn $ "Size of the encoded value: " ++
    show (BSL.length bs `div` (1024 * 1024)) ++ " MB"
  putStrLn ""

  putStrLn "After encoding the value:"
  printMem

  let xs' = decode bs :: [T]
  void $ evaluate $ sum' $ map (\(x, vs) -> x + sum' vs) xs'
  putStrLn "After decoding the value:"
  printMem

  -- retain the original list so it is not GC'ed
  void $ evaluate $ last xs
  -- retain the decoded list so it is not GC'ed
  void $ evaluate $ last xs'

printMem :: IO ()
printMem = do
  readFile "/proc/self/status" >>=
    putStr . unlines . filter (\x -> any (`isPrefixOf` x) ["VmHWM", "VmRSS"])
           . lines
  performGC
  stats <- getGCStats
  putStrLn $ "In use according to GC stats: " ++
    show (currentBytesUsed stats `div` (1024 * 1024)) ++ " MB"
  putStrLn $ "HWM according the GC stats: " ++
    show (maxBytesUsed stats `div` (1024 * 1024)) ++ " MB"
  putStrLn ""

sum' :: Num a => [a] -> a
sum' = foldl' (+) 0

Here are the results:

# time ./test +RTS -TAfter building the value to encode:
VmHWM:   1557456 kB
VmRSS:   1557456 kB
In use according to GC stats: 1320 MB
HWM according the GC stats: 1320 MB

Size of the encoded value: 240 MB

After encoding the value:
VmHWM:   2791620 kB
VmRSS:   2791620 kB
In use according to GC stats: 1560 MB
HWM according the GC stats: 1560 MB

After decoding the value:
VmHWM:   6229164 kB
VmRSS:   6229164 kB
In use according to GC stats: 2880 MB
HWM according the GC stats: 2880 MB


real    0m27.143s
user    0m25.112s
sys 0m2.016s

The GC reports mostly what I expect. However the OS reports a much higher memory usage. The difference seems to exacerbate after decoding.

Any hints appreciated.

0.7.0.2 build broken with ghc 6.10.4

[1 of 8] Compiling Data.Binary.Builder.Base ( src/Data/Binary/Builder/Base.hs, dist/build/Data/Binary/Builder/Base.o )

src/Data/Binary/Builder/Base.hs:68:0:
    Warning: Module `Data.Word' is imported, but nothing from it is used,
               except perhaps instances visible in `Data.Word'
             To suppress this warning, use: import Data.Word()
[2 of 8] Compiling Data.Binary.Builder.Internal ( src/Data/Binary/Builder/Internal.hs, dist/build/Data/Binary/Builder/Internal.o )
[3 of 8] Compiling Data.Binary.Builder ( src/Data/Binary/Builder.hs, dist/build/Data/Binary/Builder.o )
[4 of 8] Compiling Data.Binary.Get.Internal ( src/Data/Binary/Get/Internal.hs, dist/build/Data/Binary/Get/Internal.o )

src/Data/Binary/Get/Internal.hs:251:2:
    `some' is not a (visible) method of class `Alternative'

src/Data/Binary/Get/Internal.hs:252:2:
    `many' is not a (visible) method of class `Alternative'

unsafeReadN holds on to consumed input

In the definition of unsafeReadN:

unsafeReadN :: Int -> (B.ByteString -> a) -> Get a
unsafeReadN !n f = C $ \inp ks -> do
  ks (B.unsafeDrop n inp) $! f inp -- strict return

We pass the rest of the input (B.unsafeDrop n inp) to the success continuation without first forcing it. This could lead to us holding on to input longer than necessary.

In practice it's not much of a problem as the success continuation will most likely evaluate the thunk, but I think it's more correct to do:

unsafeReadN :: Int -> (B.ByteString -> a) -> Get a
unsafeReadN !n f = C $ \inp ks -> do
  let !t = B.unsafeDrop n inp
  ks t $! f inp -- strict return

Add flag description

Add description for:

flag development

kolmodin / binary Goto Github PK

binary's People

Contributors

Stargazers

Watchers

Forkers

binary's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs