haskell-works / avro Goto Github PK

View Code? Open in Web Editor NEW

82.0 82.0 35.0 767 KB

Haskell Avro Encoding and Decoding Native Support (no RPC)

License: BSD 3-Clause "New" or "Revised" License

Haskell 99.26% Nix 0.37% Shell 0.37%

avro's People

Contributors

Stargazers

Watchers

avro's Issues

Design deriving types for n-ary unions for n > 2

I see in the README and here that deriving data types for n-ary unions is currently not supported. I assume this is because for n > 2 it is not clear what data type to generate.

In Scala's avro4s they use something akin to GHC.Generics :+: to represent n-ary unions, with the common usage being users specifying their own sum type and the compiler ensuring the shape of the sum type matches the generic shape. I was wondering if a similar approach can be done in Haskell?

Decoding of Avros with `codec = deflate` doesn't obey spec

The spec states

The "deflate" codec writes the data block using the deflate algorithm as specified in RFC 1951, and > typically implemented using the zlib library. Note that this format (unlike the "zlib format" in RFC 1950) does not have a checksum.

Now, when I try to decode a container with codec = deflate I get

Header error: Header checksum failed

Which is an error message from pure-zlib. But according to spec neither header nor checksums should be serialized and instead the raw format defined by RFC 1951 should be used.

Are we actually using hpack?

Right now, we have both a .cabal file and a package.yaml file in the repo. This is redundant and can cause problems—for example, #71 only updated the .cabal file, so the repo does not build with hpack any more. This was not caught by our CI which, presumably, just uses the .cabal file directly.

Ideally, we should only have one or the other. I propose getting rid of package.yaml altogether and sticking to a normal .cabal file.

Clarification on encode - encodeContainer vs decode - decodeContainer

Hi there!

I'm trying to get familiar with the library, and as an example I tried to expand on the README example and take a Haskell Data structure, encode it to Avro, write it to file and decode it from the file into the original Data structure again.

My first try was using the encode and decode functions and writing/reading to a file, but it produced invalid Avro (tested with avro-tools.jar)

So I decided to try the encodeContainer / decodeContainer functions instead and that worked, then I started to look into "why" that worked and found the following:

Avro includes a simple object container file format. A file has a schema, and all objects stored in the file must be written according to that schema, using binary encoding. Objects are stored in blocks that may be compressed. Syncronization markers are used between blocks to permit efficient splitting of files for MapReduce processing.

As I've been using Avro only through high level interfaces such as Spark/Hive before I was unaware of the "Object Container Files" specificity, and that records were organized in "blocks".

Does that mean encodeContainer and decodeContainer are the only suitable functions for dealing with files ? (because they deal with blocks), and encode / decode are only suitable for exchange of Avro messages through HTTP requests ?

Decoding maps with blocks that start with a negative number is not handled.

According to the Avro spec, blocks can start with a negative number, which requires special handling:

If a block's count is negative, its absolute value is used, and the count is followed immediately by a long block size indicating the number of bytes in the block. This block size permits fast skipping through data, e.g., when projecting a record to a subset of its fields.

The implementation of getArray deals with this correctly:

getArray :: GetAvro ty => Get [ty]
getArray =
  do nr <- getLong
     if
      | nr == 0 -> return []
      | nr < 0  ->
          do _len <- getLong
             rs <- replicateM (fromIntegral (abs nr)) getAvro
             (rs <>) <$> getArray
      | otherwise ->
          do rs <- replicateM (fromIntegral nr) getAvro
             (rs <>) <$> getArray

getMap, however, doesn't handle this case:

getMap :: GetAvro ty => Get (Map.Map Text ty)
getMap = go Map.empty
 where
 go acc =
  do nr <- getLong
     if nr == 0
       then return acc
       else do m <- Map.fromList <$> replicateM (fromIntegral nr) getKVs
               go (Map.union m acc)
 getKVs = (,) <$> getString <*> getAvro

This is an easy fix, but I don't have time to implement it right now, so I'm opening an issue to keep track of it.

Strange decoding error when deserializing Avro blob

This is more of a request for debugging assistance than a bug report. I think.

I'm trying to decode RDK-B messages that use this schema: https://code.rdkcentral.com/r/plugins/gitiles/rdkb/components/opensource/ccsp/CcspLMLite/+/2f08d4d145699f31310c942199a62eb1e9f4d8b3/config/InterfaceDevicesWifi.avsc.

The actual blobs that I'm trying to decode I can't provide in this report. If it'd help, I might be able to try reproducing the issue with QuickCheck.

When I do a decode on a blob that's supposed to have this schema, I get:
Error "Unexpected value when decoding for 'Double': Record (...), where the (...) is the entire record, AFAICT, that’s been decoded. What’s particularly weird about this is that there’s no use of double anywhere in the schema. I can provide the -ddump-splices TH output, but there's no reference to Double anywhere in the generated code, either.

Where should I start in trying to debug this?

`buildTypeEnvironment` doesn't work properly with maps

buildTypeEnvironment on older versions of the package doesn't handle maps properly.

case ty of
        Record {..} -> mk name aliases namespace ++ concatMap (go . fldType) fields
        Enum {..}   -> mk name aliases namespace
        Union {..}  -> concatMap go options
        Fixed {..}  -> mk name aliases namespace
        Array {..}  -> go item
        _           -> []

This case should also recurse on Map {..}:

    Map {..} -> go values

This leads to errors on schemas that define a named type in a map and then refer to it:

{
  "name" : "Foo",
  "type" : "record",
  "fields" : [
    {
      "name" : "bar",
      "type" : {
        {
          "type" : "map",
          "values" : {
            "name" : "Bar",
            "type" : "record",
            "fields" : []
          }
        }
      }
    },
    {
      "name" : "bar2",
      "type" : "Bar"
    }
  ]
}

Decoding an object with this schema will result in an error about the type "Bar" not being in scope.

It looks like I actually fixed this in my namespace PR (although I don't remember doing that explicitly). The minimum we can do to fix this is release a new version with that PR included (it has already been merged on master.)

This bug isn't specific to namespaces, so it might be worth releasing bug-fixes for older versions of avro on Hackage as well.

fails to build with aeson-2.0

avro                      > /tmp/stack-e0cde8ed4b1cb895/avro-0.5.2.1/src/Data/Avro/Schema/Schema.hs:526:21: error:
avro                      >     • Couldn't match expected type ‘A.Key’ with actual type ‘Text’
avro                      >     • In the second argument of ‘(.:?)’, namely ‘("order" :: Text)’
avro                      >       In the first argument of ‘(.!=)’, namely ‘o .:? ("order" :: Text)’
avro                      >       In a stmt of a 'do' block:
avro                      >         order <- o .:? ("order" :: Text) .!= Just Ascending
avro                      >     |
avro                      > 526 |     order <- o .:? ("order" :: Text)    .!= Just Ascending
avro                      >     |                     ^^^^^^^^^^^^^^^
avro                      > 
avro                      > /tmp/stack-e0cde8ed4b1cb895/avro-0.5.2.1/src/Data/Avro/Schema/Schema.hs:650:37: error:
avro                      >     • Couldn't match type: HashMap Text A.Value
avro                      >                      with: Data.Aeson.KeyMap.KeyMap A.Value
avro                      >       Expected: A.Object
avro                      >         Actual: HashMap Text A.Value
avro                      >     • In the first argument of ‘A.Object’, namely
avro                      >         ‘(HashMap.map toJSON mp)’
avro                      >       In the expression: A.Object (HashMap.map toJSON mp)
avro                      >       In a case alternative: DMap mp -> A.Object (HashMap.map toJSON mp)
avro                      >     |
avro                      > 650 |       DMap mp          -> A.Object (HashMap.map toJSON mp)
avro                      >     |                                     ^^^^^^^^^^^^^^^^^^^^^
avro                      > 
avro                      > /tmp/stack-e0cde8ed4b1cb895/avro-0.5.2.1/src/Data/Avro/Schema/Schema.hs:651:37: error:
avro                      >     • Couldn't match type: HashMap Text A.Value
avro                      >                      with: Data.Aeson.KeyMap.KeyMap A.Value
avro                      >       Expected: A.Object
avro                      >         Actual: HashMap Text A.Value
avro                      >     • In the first argument of ‘A.Object’, namely
avro                      >         ‘(HashMap.map toJSON flds)’
avro                      >       In the expression: A.Object (HashMap.map toJSON flds)
avro                      >       In a case alternative:
avro                      >           DRecord _ flds -> A.Object (HashMap.map toJSON flds)
avro                      >     |
avro                      > 651 |       DRecord _ flds   -> A.Object (HashMap.map toJSON flds)
avro                      >     |                                     ^^^^^^^^^^^^^^^^^^^^^^^
avro                      > 
avro                      > /tmp/stack-e0cde8ed4b1cb895/avro-0.5.2.1/src/Data/Avro/Schema/Schema.hs:653:36: error:
avro                      >     • Couldn't match expected type ‘A.Key’ with actual type ‘Text’
avro                      >     • In the first argument of ‘(.=)’, namely ‘typeName ty’
avro                      >       In the expression: typeName ty .= val
avro                      >       In the first argument of ‘object’, namely ‘[typeName ty .= val]’
avro                      >     |
avro                      > 653 |       DUnion _ ty val  -> object [ typeName ty .= val ]
avro                      >     |                                    ^^^^^^^^^^^
avro                      > 
avro                      > /tmp/stack-e0cde8ed4b1cb895/avro-0.5.2.1/src/Data/Avro/Schema/Schema.hs:777:35: error:
avro                      >     • Couldn't match type ‘Data.Aeson.KeyMap.KeyMap’
avro                      >                      with ‘HashMap Text’
avro                      >       Expected: Result (HashMap Text DefaultValue)
avro                      >         Actual: Result (Data.Aeson.KeyMap.KeyMap DefaultValue)
avro                      >     • In the second argument of ‘(<$>)’, namely
avro                      >         ‘mapM (parseAvroJSON union env mTy) obj’
avro                      >       In the expression: DMap <$> mapM (parseAvroJSON union env mTy) obj
avro                      >       In a case alternative:
avro                      >           Map mTy -> DMap <$> mapM (parseAvroJSON union env mTy) obj
avro                      >     |
avro                      > 777 |           Map mTy     -> DMap <$> mapM (parseAvroJSON union env mTy) obj
avro                      >     |                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
avro                      > 
avro                      > /tmp/stack-e0cde8ed4b1cb895/avro-0.5.2.1/src/Data/Avro/Schema/Schema.hs:780:53: error:
avro                      >     • Couldn't match type: Data.Aeson.KeyMap.KeyMap A.Value
avro                      >                      with: HashMap Text A.Value
avro                      >       Expected: HashMap Text A.Value
avro                      >         Actual: A.Object
avro                      >     • In the second argument of ‘HashMap.lookup’, namely ‘obj’
avro                      >       In the expression: HashMap.lookup (fldName f) obj
avro                      >       In the expression:
avro                      >         case HashMap.lookup (fldName f) obj of
avro                      >           Nothing
avro                      >             -> case fldDefault f of
avro                      >                  Just v -> return v
avro                      >                  Nothing
avro                      >                    -> fail
avro                      >                         $ "Decode failure: No record field '"
avro                      >                             <> T.unpack (fldName f) <> "' and no default in schema."
avro                      >           Just v -> parseAvroJSON union env (fldType f) v
avro                      >     |
avro                      > 780 |                     case HashMap.lookup (fldName f) obj of
avro                      >     |                                                     ^^^

Deconflicting occurs on full name, rather than baseName

I think the avro specification changed around 1.10 to be more flexible in resolution.

It now reads:

both schemas are enums whose (unqualified) names match
both schemas are fixed whose sizes and (unqualified) names match
both schemas are records with the same (unqualified) name

I think the current name checks can be relaxed a little. I would also note that the java implementation doesn't appear to check names at all on deconflicting a top level record.

Don't reexport the orphan Show (a -> b) instance

Is there a good reason why it's exported?

avro/src/Data/Avro/Schema.hs

Line 79 in 2a37013

import Text.Show.Functions ()

Question relative to integer types encoding

Hi,

Just a question: the Quickstart documentation states that the Int and Int64 Haskell types are matched with the long Avro type. However, in Data.Avro.Encode, Int and Int64 use avroInt instead of avroLong for their EncodeAvro instances. Is that normal?

Enable CCI to use orbs

@TomMD Hi Tom,
I'd like to use CircleCI Orb (https://circleci.com/orbs/registry/orb/haskell-works/haskell-build) for building this project.
You can read more about orbs here: https://circleci.com/docs/2.0/using-orbs/

Would it be possible to enable this settings for the Galois organisation in CircleCI?

Does not compile on `ghc-8.8.1`

Building library for avro-0.4.5.2..
[ 1 of 28] Compiling Data.Avro.Codec  ( src/Data/Avro/Codec.hs, dist/build/Data/Avro/Codec.o )
[ 2 of 28] Compiling Data.Avro.Decode.Lazy.LazyValue ( src/Data/Avro/Decode/Lazy/LazyValue.hs, dist/build/Data/Avro/Decode/Lazy/LazyValue.o )
[ 3 of 28] Compiling Data.Avro.Decode.Strict ( src/Data/Avro/Decode/Strict.hs, dist/build/Data/Avro/Decode/Strict.o )
[ 4 of 28] Compiling Data.Avro.Types.Value ( src/Data/Avro/Types/Value.hs, dist/build/Data/Avro/Types/Value.o )
[ 5 of 28] Compiling Data.Avro.Types  ( src/Data/Avro/Types.hs, dist/build/Data/Avro/Types.o )
[ 6 of 28] Compiling Data.Avro.Schema ( src/Data/Avro/Schema.hs, dist/build/Data/Avro/Schema.o )

src/Data/Avro/Schema.hs:523:3: error:
    ‘fail’ is not a (visible) method of class ‘Monad’
    |
523 |   fail = MF.fail
    |   ^^^^

JSON parsing/serialization for Avro types

Is there a convenient way to read/write types with FromAvro/ToAvro instances as JSON rather than Avro's binary format, using the corresponding Avro schema?

This would be particularly nice with the types generated by deriveAvro as it would let me easily support both formats.

If there's no easy way to do this as is, what would adding that to the library look like? Would a good option be to move from Avro.Value to Aeson.Value?

I'm up for implementing a PR if it isn't too tricky :).

'toAvro' does not work well in the presence of 'NamedType'

When implementing toAvro, sometimes you need to inspect the corresponding schema. However, if the original schema uses NamedType to reference other types (for example, in the implementation from HasAvroSchema), this is also what you get as argument. But at that point you have completely lost any context about what that NamedType is!

My suggestion is to make an invariant that toAvro is never called with a NamedType. The algorithm should ensure that any NamedType is resolved (or an error is issued if it cannot be found) before calling toAvro.

Release version of package with JSON module

Do we need anything else before we can release a version of avro to Hackage with the new JSON module exposed?

I believe I made some backwards-incompatible changes to how schemas are parsed (to match the Avro specification), so the new version should be 0.3.0.0 or something.

avro-0.6.0.1 failed to build in Stackage Nightly

/var/stackage/work/unpack-dir/unpacked/avro-0.6.0.1-b2a1f0c83e1ff48aabc88feacfe01f5c014ae710a7a023027dfa8e72b6795259/src/Data/Avro/Deriving/Lift.hs:29:10: error:
    Duplicate instance declarations:
      instance (Lift k, Lift v) => Lift (HashMap.HashMap k v)
        -- Defined at src/Data/Avro/Deriving/Lift.hs:29:10
      instance (Lift k, Lift v) => Lift (HashMap.HashMap k v)
        -- Defined in ‘Data.HashMap.Internal’
   |
29 | instance (Lift k, Lift v) => Lift (HashMap.HashMap k v) where
   |          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Avro and lazy decoding

I am not actually starting any work on this now and opening this issue as a discussion point.

The problem is that the decodeContainer has the following signature:

decodeContainer :: BL.ByteString -> Either String (Schema, [[T.Value Type]])

which means that in order to return the result it must know whether there was an error or a success for the entire structure. Which means that the value in Right is fully traversed before it is returned and that [[T.Value Type]] is never lazy.

It comes all the way from

getContainerWith :: (Schema -> Get a) -> Get (Schema, [[a]])

which decodes and decompresses the bytestring and then calls runGetOrFail once for the whole structure in this block (wrapping the result back to Get to hide its crime):

case runGetOrFail (replicateM nrObj getValue) bytes of
    Right (_,_,x) -> return x
    Left (_,_,s)  -> fail s

I am not very familiar with Get, but If we somehow could call runGetOrFail on each chunk separately, then we could be "lazier".

Or we can go all the way and have an error case as another constructor of Value so that we could lazily decode the structure without forcing the full traversal on it (which Either Error a does).

Does anyone else think it is worth improving and can offer their thoughts on the subject?

@newhoggy , @TikhonJelvis ?

Cannot resolve read/write schemas when NamedType fields are 'renamed'

{
  "type": "record",
  "name": "A",
  "namespace": "blah.contract.v2",
  "fields": [
    { "name": "id",             "type": "long" },
    { "name": "geo",
      "type": ["null", {
        "type": "record",
        "name": "Geo",
        "fields": [
          { "name": "source",
            "type": {
              "type": "record",
              "name": "Endpoint",
              "fields": [
                { "name": "ccs",   "type": {"type": "array", "items": "string"} }
              ]
            }
          },
          { "name": "dest", "type": "Endpoint" }
        ]
      }]
    }
  ]
}

{
  "type": "record",
  "name": "A",
  "namespace": "blah.contract.v2",
  "fields": [
    { "name": "id",             "type": "long" },
    { "name": "geo",
      "type": ["null", {
        "type": "record",
        "name": "Geo",
        "fields": [
          { "name": "dest",
            "type": {
              "type": "record",
              "name": "Endpoint",
              "fields": [
                { "name": "ccs",   "type": {"type": "array", "items": "string"} }
              ]
            }
          }
        ]
      }]
    }
  ]
}

i.e. The original has source and dest but I only want to decode dest so I remove source but leave the type definition in the remaining dest field.

The error produced is:

Unexpected value for 'blah.contract.v2.Endpoint': Error "Can not resolve differing writer and reader schemas: 

(NamedType "blah.contract.v2.Endpoint",
	Record {name = "blah.contract.v2.Endpoint",
		aliases = [],
		doc = Nothing,
		order = Just Ascending,
		fields = [
			Field {fldName = "ccs",
				fldAliases = [],
				fldDoc = Nothing,
				fldOrder = Just Ascending,
				fldType = Array {item = String},
				fldDefault = Nothing
				}
			]
		}
)

I removed some potentially sensitive information from these texts but if there looks like something is missing I can provide full dumps.

{To,From}Avro (Either a b) are Left-biased when (a ~ b)

Avro does not distinguish between Left and Right branch

>>> toAvro (Left "str" :: Either Text Text)
Union (String :| [String]) String (String "str")

>>> toAvro (Right "str" :: Either Text Text)
Union (String :| [String]) String (String "str")

so converting back from Avro gives a Left regardless of the original constructor

>>> fromAvro (toAvro (Left "str" :: Either Text Text)) :: Result (Either Text Text)
Success (Left "str")

>>> fromAvro (toAvro (Right "str" :: Either Text Text)) :: Result (Either Text Text)
Success (Left "str")

API for writing containers incrementally

I am constructing huge AVRO containers that I have to upload in multiple chunks. To help the use case It would be great if we had an API that let's you write containers incrementally:

packContainer
  :: (ToAvro a) 
  => Codec
  -> Schema -- ^ Writer schema
  -> ByteString -- ^ Sync bytes
  -> (Builder -- ^ Container header
       , [a] -> Builder -- ^ A function to feed a's and turn them into valid container blocks
       )

This allows to consume from a streaming source and fill buffers incrementally.

I can contribute if you think it'd be worthwhile.

Nullable unions doesn't compile

I'm trying to express an optional union type, and fail.

Given this schema:

{ 
  "name": "test",
  "type": ["null", "string", {"type": "map", "values": "int"}]
}

It crashes on compile with this error:

Exception when trying to run compile-time code:
      Avro type is not supported: Null
CallStack (from HasCallStack):
  error, called at src/Data/Avro/Deriving.hs:463:25 in avro-0.5.2.0-HIAMSWcivAy8F6u9uD512G:Data.Avro.Deriving
    Code: deriveAvroWithOptions
            withCustomNamespace
            "schemas/my-schema.avsc"

Without "null" in the union type this works fine.

Serialise to JSON

Is JSON encoding for serialisation supported? I've found some JSON related code in the library, but it seems mostly related to schemas. If I understand correctly, the JSON encoding is similar for serialisation and field default values, so maybe there's a way to reuse that? There doesn't seem to be a way to get a DefaultValue via ToAvro however.

The issue tracker link on Hackage does not work.

The issue tracker link on Hackage currently points to https://github.com/GaloisInc/avro.git/issues but should point to https://github.com/GaloisInc/avro/issues instead (avro vs avro.git).

I believe you can submit a revision to fix this—probably a better option than waiting for a new version of the package.

0.3.6.0 API breaking change

Hello! 0.3.6.0 adds a new field to the TN constructor in Data.Avro.Schema.TypeName.

This change breaks downstream users of that constructor - would you be open to making a new release called 0.4 (no code change, just version bump), to match the PVP convention? and maybe even further, marking 0.3.6.0 and 0.3.6.1 as deprecated?

I'm not blocked on this personally, but it would likely help other users.

Fails to build with Eta

I'm trying to build the avro-0.3.2.0 with Eta and it fails with the following error:

[16 of 17] Compiling Data.Avro.Deriving
src/Data/Avro/Deriving.hs:335:74: error:
    Not in scope: ‘e’
    In the Template Haskell quotation
      [| ($(mkText k), $(mkDefaultValue v)) e |]
    |
335 |         mkKVPair (k, v)         = [e| ($(mkText k), $(mkDefaultValue v)) e|]

I've made the following change to the line https://github.com/GaloisInc/avro/blob/master/src/Data/Avro/Deriving.hs#L335 to

mkKVPair (k, v)         = [e| ($(mkText k), $(mkDefaultValue v)) |]

Will this maintain the original intent?

Publish on Hackage

Can it be published on Hackage?

Add a benchmark suite

Performance matters when serializing and deserializing large amounts of data with avro. It would be great to have a benchmark suite to measure and track performance:

how long it takes to serialize/deserialize Avro.Value to binary
how much memory the package uses (including just the memory efficiency of Avro.Values)

This will help us make informed decisions about the performance of the package going forwards.

Defaults for Union types are not encoded/decoded correctly

@TikhonJelvis I have bumped into an issue where default values for union types are not encoded/decoded in Schema in a way that they can be respected by decoding functionality.

For this schema:

{
        "type": "record",
        "name": "Inner",
        "fields": [
          { "name": "id", "type": "int" },
          { "name": "smell", "type": ["null", "string"], "default": null }
        ]
}

the expected Haskell representation would be:

Record 
  { name = "Inner"
  , namespace = Nothing
  , aliases = []
  , doc = Nothing
  , order = Just Ascending
  , fields = 
      [ Field 
          { fldName = "id"
          , fldAliases = []
          , fldDoc = Nothing
          , fldOrder = Just Ascending
          , fldType = Int
          , fldDefault = Nothing
          }
      , Field 
          { fldName = "smell"
          , fldAliases = []
          , fldDoc = Nothing
          , fldOrder = Just Ascending
          , fldType = Union {options = Null :| [String], unionLookup = <function>}, fldDefault = Just (Union (Null :| [String]) Null Null)}
      ]
  }

Note that the fldDefault is annotated with its type of Union Null String.

However the current implementation gives me this:

Record 
  { name = "Inner"
  , namespace = Nothing
  , aliases = []
  , doc = Nothing
  , order = Just Ascending
  , fields = 
      [ Field 
          { fldName = "id"
          , fldAliases = []
          , fldDoc = Nothing
          , fldOrder = Just Ascending
          , fldType = Int
          , fldDefault = Nothing
          }
      , Field 
          { fldName = "smell"
          , fldAliases = []
          , fldDoc = Nothing
          , fldOrder = Just Ascending
          , fldType = Union {options = Null :| [String], unionLookup = <function>}, fldDefault = Just Null}
      ]
  }

Which doesn't annotate the union type for the default value.

Looking at the code I see that it has been done deliberately.
Is there anything that I am missing or should we fix the type annotation back?

I can encode Avro inconsistent with my schema

I just ran into a (more complex) bug which mostly reduces to the fact that if I have:

import qualified Data.Avro.Types as T
import qualified Data.Avro.Schema as S

newtype Foo = Foo Double

instance HasAvroSchema Foo where
  schema = pure S.Long

instance ToAvro Foo where
  toAvro (Foo val) = T.Double val

instance FromAvro Foo where
  ...

The problem is caught at decode time, and in a non-obvious way, not encode time and in an obvious way.

Is this intended, or is it an oversight?

deriveAvro / parse schema error

derriveAvro fails when a primitive type is defined as an object, i.e. :

{
  "name": "CREATETIME",
  "type": {
    "type": "long",
    "connect.version": 1,
    "connect.name": "org.apache.kafka.connect.data.Timestamp"
  }
}

According to Schema Declaration (https://avro.apache.org/docs/current/spec.html#schemas):

A Schema is represented in JSON by one of:

A JSON string, naming a defined type.

A JSON object, of the form:

{"type": "typeName" ...attributes...}

where typeName is either a primitive or derived type name, as defined below. Attributes not defined in this document are permitted as metadata, but must not affect the format of serialized data.

A JSON array, representing a union of embedded types.

So in the example type given it should be parsed as a primitive long and the additional fields ["connect.version", "connect.name"], captured as metadata or ignored.

It looks like the Nothing case at

avro/src/Data/Avro/Schema.hs

Line 344 in c5484e3

case logicalType of

falls through for non primitives.

Deconflicting record with different orderings

The schema resolution spec reads:

if both are records:

the ordering of fields may be different: fields are matched by name.

...

However, the implementation mandates the reader schema to be "more specified" than the writer schema.

Is this intended? Should I ensure that my schemas always have such bit on?

Add better support for logical types

Currently, logical types [1] are only supported "on the way in". We can correctly parse/decode schemas with logical types, but then the information about logical types is gone.

Ideally, we should preserve logical types in Schema.

[1] https://avro.apache.org/docs/1.8.2/spec.html#Logical+Types

Schemas with default values for fixed and bytes fields do not parse.

A schema containing a record with the following field:

{ "name": "binary",
        "type": {
          "type" : "fixed",
          "name" : "FixedField",
          "size" : 1
        },
        "default": "\u0000"
      }

will fail to parse because parseAvroJSON does not handle parsing JSON strings into Fixed or Bytes values:

Could not resolve type 'string' with expected type: Fixed {name = "FixedField", namespace = Nothing, aliases = [], size = 1}

The same problem for a field with type "bytes":

Could not resolve type 'string' with expected type: Bytes

`deriveAvro` example doesn't work

Hi,

I just started testing this library and the deriveAvro example doesn't seem to work:

person.avsc

{
  "type": "record",
  "name": "Person",
  "fields": [
    {
      "name": "name",
      "type": "string"
    }
  ]
}

MyModule.hs

{-# LANGUAGE TemplateHaskell #-}
{-# LANGUAGE DeriveGeneric   #-}

module MyModule where

import Data.Avro.Deriving

deriveAvro "schemas/person.avsc"

Stack resolver 13.26

I get this compilation error:

    • Couldn't match expected type ‘GHC.Exts.Item
                                      (f0 (Data.Text.Internal.Text,
                                           Data.Avro.Types.Value.Value Data.Avro.Schema.Type))’
                  with actual type ‘(Data.Text.Internal.Text,
                                     Data.Avro.Types.Value.Value Data.Avro.Schema.Type)’
      The type variable ‘f0’ is ambiguous
    • In the expression:
        (Data.Text.pack "name" Data.Avro.ToAvro..= p_1_alvH)
      In the second argument of ‘Data.Avro.record’, namely
        ‘[(Data.Text.pack "name" Data.Avro.ToAvro..= p_1_alvH)]’
      In the expression:
        (Data.Avro.record schema'Person)
          [(Data.Text.pack "name" Data.Avro.ToAvro..= p_1_alvH)]
  |
8 | deriveAvro "schemas/person.avsc"
  | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

However this compiles fine with deriveFromAvro

Is there something I missed? I may very well have done an error 😅

Cheers!

Why use entropy

Hi @TomMD,

Just a question. I see that you use entropy package for generating "The 16-byte, randomly-generated sync marker".
Can you explain why entropy and why "standard" random cannot be used?

Make logical types extensible

{ "name": "Decimal", "type": {
      "type": "bytes", "logicalType": "decimal", "precision": 7, "scale": 2
}}

As per Avro specification, logical types are optional for codecs to implement.
This library implements "basic" logical types, but they are hard coded into the library.

It would be nice to be able to make logical types extensible so that users of the library would be able to provide their own logical types.

Currently, Java and even .NET implementations allow "registering" new logical types.
We ought to be able to do it, too ;)

Names and namespaces are not handled correctly.

I see three separate problems with how names are currently handled:

a namespace of Nothing is semantically the same as Just ""
name and namespace should be unified into TypeName because we should never compare names without considering the namespaces
namespaces should be inferred following the rather ugly rules in the Avro specification

The first issue is a problem because Nothing /= Just "" even though they mean the same thing. Using Text instead of Maybe Text would remove this weird edgecase while also making the code simpler (including some of the Template Haskell code I'm working on).

The second issue is that equality between names in Avro is defined on the "fullname" (ie name + namespace). The easiest way to make this work in the Haskell code is moving namespace into TypeName:

data TypeName = TN
  { name      :: Text
  , namespace :: Text
  }

The third issue is something that should be handled when parsing or validating schemas: we need to infer namespaces following some tedious rules laid out in the Avro protocol:

In record, enum and fixed definitions, the fullname is determined in one of the following ways:

A name and namespace are both specified. For example, one might use "name": "X", "namespace": "org.foo" to indicate the fullname org.foo.X.

A fullname is specified. If the name specified contains a dot, then it is assumed to be a fullname, and any namespace also specified is ignored. For example, use "name": "org.foo.X" to indicate the fullname org.foo.X.

A name only is specified, i.e., a name that contains no dots. In this case the namespace is taken from the most tightly enclosing schema or protocol. For example, if "name": "X" is specified, and this occurs within a field of the record definition of org.foo.Y, then the fullname is org.foo.X. If there is no enclosing namespace then the null namespace is used.

References to previously defined names are as in the latter two cases above: if they contain a dot they are a fullname, if they do not contain a dot, the namespace is the namespace of the enclosing definition.

Adding code to infer namespaces sounds like a real pain. I think it would be reasonable to fix the first two issues and punt on the third one, at least for now—people can always work around this by always specifying explicit namespaces for named types.

Addressing these issues would involve some pretty broad breaking changes to the library. I'm happy to work on it at some point, but I'd love to hear thoughts about whether it's even worth doing and what the best approach would be.

Add support for blocks compression

Avro spec (https://avro.apache.org/docs/1.8.1/spec.html#Object+Container+Files) allows blocks within containers to be compressed.

This lib already supports such compression one way (when reading containers). It would be nice if compression was supported for both reading and writing containers.

Add toEncoding/fromEncoding methods to ToAvro/FromAvro.

Right now, we always use Value as an intermediate type for encoding and decoding Avro values.

Using an intermediate type like this for larger amounts of data is expensive. On an internal project, I found that using an intermediate representation like Value was about 3x slower and used about 40× more memory than going directly to/from normal Haskell types generated by TH.

Aeson has the same problem, with similar time/space overhead for using Aeson.Value on large inputs. Aeson solves this problem by having a toEncoding function as part of ToJSON that generates a bytestring directly rather than going through Aeson.Value.

We should add similar toEncoding and fromEncoding methods to ToAvro and FromAvro, and generate implementations for these methods for our TH types. I recently implemented this for some slightly different TH types on an internal project at Target and the logic was a lot simpler than I had thought—in fact, the logic to go directly to binary turned out to be simpler than going through an intermediate type! This change also removed a massive memory leak and improved serialization performance significantly.

Types redefined in schema of generated avro

There are some types that are redefined in the header's schema after encoding with encodeContainer

The smallest case I have found to recreate this issue is with the following schema.

{
  "type": "record",
  "name": "TwoBits",
  "fields": 
    [ { "name": "bit0",
        "type": 
          { "type": "enum",
            "name": "Bit",
            "symbols": [ "Zero", "One"]
          }
      }
    , { "name": "bit1",
        "type": "Bit"
      }
    ]
}

Data.Avro sets the following as the writer's schema after encoding with encodeContainer

{
  "name": "TwoBits",
  "type": "record",
  "aliases": [],
  "fields": 
    [ { "name": "bit0"
      , "type": 
          { "name": "Bit"
          , "type": "enum"
          , "symbols": ["Zero", "One"]
          , "aliases": []
          }
      , "aliases": []
      , "order": "ascending"
      }
    , { "name": "bit1"
      , "type": 
          { "name": "Bit"
          , "type": "enum"
          , "symbols": ["Zero", "One"]
          , "aliases": []
          }
      , "aliases": []
      , "order": "ascending"
    }
  ]
}

My understanding is that types should not be redefined due to the following statement in the spec

A schema or protocol may not contain multiple definitions of a fullname. Further, a name must be defined before it is used ("before" in the depth-first, left-to-right traversal of the JSON parse tree, where the types attribute of a protocol is always deemed to come "before" the messages attribute.)

HasBigDecimal 0.2 does not provide getScale and getValue

Building library for avro-0.6.1.1..                                                                                                 
[ 8 of 21] Compiling Data.Avro.Schema.Decimal                                                                                       
                                                                                                                                    
/var/stackage/work/unpack-dir/unpacked/avro-0.6.1.1-c8abdd5c2d67341d5f6a1d2d3b53ca282e000be1c281ae35e370b50a65e1b08c/src/Data/Avro/Schema/Decimal.hs:28:23: error:                                                                                                      
    Not in scope: ‘D.getScale’                                                                                                      
    Module ‘Data.BigDecimal’ does not export ‘getScale’.                                                                            
   |                                                                                                                                
28 |         new = if ss > D.getScale d                                                                                             
   |                       ^^^^^^^^^^                                                                                               
                                                                                                                                    
/var/stackage/work/unpack-dir/unpacked/avro-0.6.1.1-c8abdd5c2d67341d5f6a1d2d3b53ca282e000be1c281ae35e370b50a65e1b08c/src/Data/Avro/Schema/Decimal.hs:29:37: error:                                                                                                      
    Not in scope: ‘D.getValue’                                                                                                      
    Module ‘Data.BigDecimal’ does not export ‘getValue’.                                                                            
   |                                                                                                                                
29 |                  then D.BigDecimal (D.getValue d * 10 ^ (ss - D.getScale d)) ss                                                
   |                                     ^^^^^^^^^^                                                                                 
                                                                                                                                    
/var/stackage/work/unpack-dir/unpacked/avro-0.6.1.1-c8abdd5c2d67341d5f6a1d2d3b53ca282e000be1c281ae35e370b50a65e1b08c/src/Data/Avro/Schema/Decimal.hs:29:63: error:                                                                                                      
    Not in scope: ‘D.getScale’                                                                                                      
    Module ‘Data.BigDecimal’ does not export ‘getScale’.                                                                            
   |                                                                                                                                
29 |                  then D.BigDecimal (D.getValue d * 10 ^ (ss - D.getScale d)) ss                                                
   |                                                               ^^^^^^^^^^                                                       
                                                                                                                                    
/var/stackage/work/unpack-dir/unpacked/avro-0.6.1.1-c8abdd5c2d67341d5f6a1d2d3b53ca282e000be1c281ae35e370b50a65e1b08c/src/Data/Avro/Schema/Decimal.hs:33:37: error:                                                                                                      
    Not in scope: ‘D.getValue’                                                                                                      
    Module ‘Data.BigDecimal’ does not export ‘getValue’.                                                                            
   |                                                                                                                                
33 |           else Just $ fromInteger $ D.getValue new                                                                             
   |                                     ^^^^^^^^^^

haskell-works / avro Goto Github PK

avro's People

Contributors

Stargazers

Watchers

Forkers

avro's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs