haskell-works / avro Goto Github PK
View Code? Open in Web Editor NEWHaskell Avro Encoding and Decoding Native Support (no RPC)
License: BSD 3-Clause "New" or "Revised" License
Haskell Avro Encoding and Decoding Native Support (no RPC)
License: BSD 3-Clause "New" or "Revised" License
I see in the README and here that deriving data types for n-ary unions is currently not supported. I assume this is because for n > 2 it is not clear what data type to generate.
In Scala's avro4s they use something akin to GHC.Generics :+:
to represent n-ary unions, with the common usage being users specifying their own sum type and the compiler ensuring the shape of the sum type matches the generic shape. I was wondering if a similar approach can be done in Haskell?
The spec states
The "deflate" codec writes the data block using the deflate algorithm as specified in RFC 1951, and > typically implemented using the zlib library. Note that this format (unlike the "zlib format" in RFC 1950) does not have a checksum.
Now, when I try to decode a container with codec = deflate
I get
Header error: Header checksum failed
Which is an error message from pure-zlib. But according to spec neither header nor checksums should be serialized and instead the raw format defined by RFC 1951 should be used.
Right now, we have both a .cabal
file and a package.yaml
file in the repo. This is redundant and can cause problems—for example, #71 only updated the .cabal
file, so the repo does not build with hpack any more. This was not caught by our CI which, presumably, just uses the .cabal
file directly.
Ideally, we should only have one or the other. I propose getting rid of package.yaml
altogether and sticking to a normal .cabal
file.
Hi there!
I'm trying to get familiar with the library, and as an example I tried to expand on the README example and take a Haskell Data structure, encode it to Avro, write it to file and decode it from the file into the original Data structure again.
My first try was using the encode
and decode
functions and writing/reading to a file, but it produced invalid Avro (tested with avro-tools.jar)
So I decided to try the encodeContainer / decodeContainer functions instead and that worked, then I started to look into "why" that worked and found the following:
Avro includes a simple object container file format. A file has a schema, and all objects stored in the file must be written according to that schema, using binary encoding. Objects are stored in blocks that may be compressed. Syncronization markers are used between blocks to permit efficient splitting of files for MapReduce processing.
As I've been using Avro only through high level interfaces such as Spark/Hive before I was unaware of the "Object Container Files" specificity, and that records were organized in "blocks".
Does that mean encodeContainer and decodeContainer are the only suitable functions for dealing with files ? (because they deal with blocks), and encode / decode are only suitable for exchange of Avro messages through HTTP requests ?
According to the Avro spec, blocks can start with a negative number, which requires special handling:
If a block's count is negative, its absolute value is used, and the count is followed immediately by a long block size indicating the number of bytes in the block. This block size permits fast skipping through data, e.g., when projecting a record to a subset of its fields.
The implementation of getArray
deals with this correctly:
getArray :: GetAvro ty => Get [ty]
getArray =
do nr <- getLong
if
| nr == 0 -> return []
| nr < 0 ->
do _len <- getLong
rs <- replicateM (fromIntegral (abs nr)) getAvro
(rs <>) <$> getArray
| otherwise ->
do rs <- replicateM (fromIntegral nr) getAvro
(rs <>) <$> getArray
getMap
, however, doesn't handle this case:
getMap :: GetAvro ty => Get (Map.Map Text ty)
getMap = go Map.empty
where
go acc =
do nr <- getLong
if nr == 0
then return acc
else do m <- Map.fromList <$> replicateM (fromIntegral nr) getKVs
go (Map.union m acc)
getKVs = (,) <$> getString <*> getAvro
This is an easy fix, but I don't have time to implement it right now, so I'm opening an issue to keep track of it.
This is more of a request for debugging assistance than a bug report. I think.
I'm trying to decode RDK-B messages that use this schema: https://code.rdkcentral.com/r/plugins/gitiles/rdkb/components/opensource/ccsp/CcspLMLite/+/2f08d4d145699f31310c942199a62eb1e9f4d8b3/config/InterfaceDevicesWifi.avsc.
The actual blobs that I'm trying to decode I can't provide in this report. If it'd help, I might be able to try reproducing the issue with QuickCheck
.
When I do a decode on a blob that's supposed to have this schema, I get:
Error "Unexpected value when decoding for 'Double': Record (...)
, where the (...)
is the entire record, AFAICT, that’s been decoded. What’s particularly weird about this is that there’s no use of double
anywhere in the schema. I can provide the -ddump-splices
TH output, but there's no reference to Double
anywhere in the generated code, either.
Where should I start in trying to debug this?
buildTypeEnvironment
on older versions of the package doesn't handle maps properly.
case ty of
Record {..} -> mk name aliases namespace ++ concatMap (go . fldType) fields
Enum {..} -> mk name aliases namespace
Union {..} -> concatMap go options
Fixed {..} -> mk name aliases namespace
Array {..} -> go item
_ -> []
This case
should also recurse on Map {..}
:
Map {..} -> go values
This leads to errors on schemas that define a named type in a map and then refer to it:
{
"name" : "Foo",
"type" : "record",
"fields" : [
{
"name" : "bar",
"type" : {
{
"type" : "map",
"values" : {
"name" : "Bar",
"type" : "record",
"fields" : []
}
}
}
},
{
"name" : "bar2",
"type" : "Bar"
}
]
}
Decoding an object with this schema will result in an error about the type "Bar"
not being in scope.
It looks like I actually fixed this in my namespace PR (although I don't remember doing that explicitly). The minimum we can do to fix this is release a new version with that PR included (it has already been merged on master.)
This bug isn't specific to namespaces, so it might be worth releasing bug-fixes for older versions of avro
on Hackage as well.
avro > /tmp/stack-e0cde8ed4b1cb895/avro-0.5.2.1/src/Data/Avro/Schema/Schema.hs:526:21: error:
avro > • Couldn't match expected type ‘A.Key’ with actual type ‘Text’
avro > • In the second argument of ‘(.:?)’, namely ‘("order" :: Text)’
avro > In the first argument of ‘(.!=)’, namely ‘o .:? ("order" :: Text)’
avro > In a stmt of a 'do' block:
avro > order <- o .:? ("order" :: Text) .!= Just Ascending
avro > |
avro > 526 | order <- o .:? ("order" :: Text) .!= Just Ascending
avro > | ^^^^^^^^^^^^^^^
avro >
avro > /tmp/stack-e0cde8ed4b1cb895/avro-0.5.2.1/src/Data/Avro/Schema/Schema.hs:650:37: error:
avro > • Couldn't match type: HashMap Text A.Value
avro > with: Data.Aeson.KeyMap.KeyMap A.Value
avro > Expected: A.Object
avro > Actual: HashMap Text A.Value
avro > • In the first argument of ‘A.Object’, namely
avro > ‘(HashMap.map toJSON mp)’
avro > In the expression: A.Object (HashMap.map toJSON mp)
avro > In a case alternative: DMap mp -> A.Object (HashMap.map toJSON mp)
avro > |
avro > 650 | DMap mp -> A.Object (HashMap.map toJSON mp)
avro > | ^^^^^^^^^^^^^^^^^^^^^
avro >
avro > /tmp/stack-e0cde8ed4b1cb895/avro-0.5.2.1/src/Data/Avro/Schema/Schema.hs:651:37: error:
avro > • Couldn't match type: HashMap Text A.Value
avro > with: Data.Aeson.KeyMap.KeyMap A.Value
avro > Expected: A.Object
avro > Actual: HashMap Text A.Value
avro > • In the first argument of ‘A.Object’, namely
avro > ‘(HashMap.map toJSON flds)’
avro > In the expression: A.Object (HashMap.map toJSON flds)
avro > In a case alternative:
avro > DRecord _ flds -> A.Object (HashMap.map toJSON flds)
avro > |
avro > 651 | DRecord _ flds -> A.Object (HashMap.map toJSON flds)
avro > | ^^^^^^^^^^^^^^^^^^^^^^^
avro >
avro > /tmp/stack-e0cde8ed4b1cb895/avro-0.5.2.1/src/Data/Avro/Schema/Schema.hs:653:36: error:
avro > • Couldn't match expected type ‘A.Key’ with actual type ‘Text’
avro > • In the first argument of ‘(.=)’, namely ‘typeName ty’
avro > In the expression: typeName ty .= val
avro > In the first argument of ‘object’, namely ‘[typeName ty .= val]’
avro > |
avro > 653 | DUnion _ ty val -> object [ typeName ty .= val ]
avro > | ^^^^^^^^^^^
avro >
avro > /tmp/stack-e0cde8ed4b1cb895/avro-0.5.2.1/src/Data/Avro/Schema/Schema.hs:777:35: error:
avro > • Couldn't match type ‘Data.Aeson.KeyMap.KeyMap’
avro > with ‘HashMap Text’
avro > Expected: Result (HashMap Text DefaultValue)
avro > Actual: Result (Data.Aeson.KeyMap.KeyMap DefaultValue)
avro > • In the second argument of ‘(<$>)’, namely
avro > ‘mapM (parseAvroJSON union env mTy) obj’
avro > In the expression: DMap <$> mapM (parseAvroJSON union env mTy) obj
avro > In a case alternative:
avro > Map mTy -> DMap <$> mapM (parseAvroJSON union env mTy) obj
avro > |
avro > 777 | Map mTy -> DMap <$> mapM (parseAvroJSON union env mTy) obj
avro > | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
avro >
avro > /tmp/stack-e0cde8ed4b1cb895/avro-0.5.2.1/src/Data/Avro/Schema/Schema.hs:780:53: error:
avro > • Couldn't match type: Data.Aeson.KeyMap.KeyMap A.Value
avro > with: HashMap Text A.Value
avro > Expected: HashMap Text A.Value
avro > Actual: A.Object
avro > • In the second argument of ‘HashMap.lookup’, namely ‘obj’
avro > In the expression: HashMap.lookup (fldName f) obj
avro > In the expression:
avro > case HashMap.lookup (fldName f) obj of
avro > Nothing
avro > -> case fldDefault f of
avro > Just v -> return v
avro > Nothing
avro > -> fail
avro > $ "Decode failure: No record field '"
avro > <> T.unpack (fldName f) <> "' and no default in schema."
avro > Just v -> parseAvroJSON union env (fldType f) v
avro > |
avro > 780 | case HashMap.lookup (fldName f) obj of
avro > | ^^^
I think the avro specification changed around 1.10 to be more flexible in resolution.
It now reads:
both schemas are enums whose (unqualified) names match
both schemas are fixed whose sizes and (unqualified) names match
both schemas are records with the same (unqualified) name
I think the current name checks can be relaxed a little. I would also note that the java implementation doesn't appear to check names at all on deconflicting a top level record.
Is there a good reason why it's exported?
Line 79 in 2a37013
Hi,
Just a question: the Quickstart documentation states that the Int
and Int64
Haskell types are matched with the long
Avro type. However, in Data.Avro.Encode
, Int
and Int64
use avroInt
instead of avroLong
for their EncodeAvro
instances. Is that normal?
@TomMD Hi Tom,
I'd like to use CircleCI Orb (https://circleci.com/orbs/registry/orb/haskell-works/haskell-build) for building this project.
You can read more about orbs here: https://circleci.com/docs/2.0/using-orbs/
Would it be possible to enable this settings for the Galois organisation in CircleCI?
Building library for avro-0.4.5.2..
[ 1 of 28] Compiling Data.Avro.Codec ( src/Data/Avro/Codec.hs, dist/build/Data/Avro/Codec.o )
[ 2 of 28] Compiling Data.Avro.Decode.Lazy.LazyValue ( src/Data/Avro/Decode/Lazy/LazyValue.hs, dist/build/Data/Avro/Decode/Lazy/LazyValue.o )
[ 3 of 28] Compiling Data.Avro.Decode.Strict ( src/Data/Avro/Decode/Strict.hs, dist/build/Data/Avro/Decode/Strict.o )
[ 4 of 28] Compiling Data.Avro.Types.Value ( src/Data/Avro/Types/Value.hs, dist/build/Data/Avro/Types/Value.o )
[ 5 of 28] Compiling Data.Avro.Types ( src/Data/Avro/Types.hs, dist/build/Data/Avro/Types.o )
[ 6 of 28] Compiling Data.Avro.Schema ( src/Data/Avro/Schema.hs, dist/build/Data/Avro/Schema.o )
src/Data/Avro/Schema.hs:523:3: error:
‘fail’ is not a (visible) method of class ‘Monad’
|
523 | fail = MF.fail
| ^^^^
Is there a convenient way to read/write types with FromAvro
/ToAvro
instances as JSON rather than Avro's binary format, using the corresponding Avro schema?
This would be particularly nice with the types generated by deriveAvro
as it would let me easily support both formats.
If there's no easy way to do this as is, what would adding that to the library look like? Would a good option be to move from Avro.Value
to Aeson.Value
?
I'm up for implementing a PR if it isn't too tricky :).
When implementing toAvro
, sometimes you need to inspect the corresponding schema. However, if the original schema uses NamedType
to reference other types (for example, in the implementation from HasAvroSchema
), this is also what you get as argument. But at that point you have completely lost any context about what that NamedType
is!
My suggestion is to make an invariant that toAvro
is never called with a NamedType
. The algorithm should ensure that any NamedType
is resolved (or an error is issued if it cannot be found) before calling toAvro
.
Do we need anything else before we can release a version of avro
to Hackage with the new JSON module exposed?
I believe I made some backwards-incompatible changes to how schemas are parsed (to match the Avro specification), so the new version should be 0.3.0.0 or something.
/var/stackage/work/unpack-dir/unpacked/avro-0.6.0.1-b2a1f0c83e1ff48aabc88feacfe01f5c014ae710a7a023027dfa8e72b6795259/src/Data/Avro/Deriving/Lift.hs:29:10: error:
Duplicate instance declarations:
instance (Lift k, Lift v) => Lift (HashMap.HashMap k v)
-- Defined at src/Data/Avro/Deriving/Lift.hs:29:10
instance (Lift k, Lift v) => Lift (HashMap.HashMap k v)
-- Defined in ‘Data.HashMap.Internal’
|
29 | instance (Lift k, Lift v) => Lift (HashMap.HashMap k v) where
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
I am not actually starting any work on this now and opening this issue as a discussion point.
The problem is that the decodeContainer
has the following signature:
decodeContainer :: BL.ByteString -> Either String (Schema, [[T.Value Type]])
which means that in order to return the result it must know whether there was an error or a success for the entire structure. Which means that the value in Right
is fully traversed before it is returned and that [[T.Value Type]]
is never lazy.
It comes all the way from
getContainerWith :: (Schema -> Get a) -> Get (Schema, [[a]])
which decodes and decompresses the bytestring and then calls runGetOrFail
once for the whole structure in this block (wrapping the result back to Get
to hide its crime):
case runGetOrFail (replicateM nrObj getValue) bytes of
Right (_,_,x) -> return x
Left (_,_,s) -> fail s
I am not very familiar with Get
, but If we somehow could call runGetOrFail
on each chunk separately, then we could be "lazier".
Or we can go all the way and have an error case as another constructor of Value
so that we could lazily decode the structure without forcing the full traversal on it (which Either Error a
does).
Does anyone else think it is worth improving and can offer their thoughts on the subject?
{
"type": "record",
"name": "A",
"namespace": "blah.contract.v2",
"fields": [
{ "name": "id", "type": "long" },
{ "name": "geo",
"type": ["null", {
"type": "record",
"name": "Geo",
"fields": [
{ "name": "source",
"type": {
"type": "record",
"name": "Endpoint",
"fields": [
{ "name": "ccs", "type": {"type": "array", "items": "string"} }
]
}
},
{ "name": "dest", "type": "Endpoint" }
]
}]
}
]
}
to
{
"type": "record",
"name": "A",
"namespace": "blah.contract.v2",
"fields": [
{ "name": "id", "type": "long" },
{ "name": "geo",
"type": ["null", {
"type": "record",
"name": "Geo",
"fields": [
{ "name": "dest",
"type": {
"type": "record",
"name": "Endpoint",
"fields": [
{ "name": "ccs", "type": {"type": "array", "items": "string"} }
]
}
}
]
}]
}
]
}
i.e. The original has source
and dest
but I only want to decode dest
so I remove source
but leave the type definition in the remaining dest
field.
The error produced is:
Unexpected value for 'blah.contract.v2.Endpoint': Error "Can not resolve differing writer and reader schemas:
(NamedType "blah.contract.v2.Endpoint",
Record {name = "blah.contract.v2.Endpoint",
aliases = [],
doc = Nothing,
order = Just Ascending,
fields = [
Field {fldName = "ccs",
fldAliases = [],
fldDoc = Nothing,
fldOrder = Just Ascending,
fldType = Array {item = String},
fldDefault = Nothing
}
]
}
)
I removed some potentially sensitive information from these texts but if there looks like something is missing I can provide full dumps.
Avro does not distinguish between Left
and Right
branch
>>> toAvro (Left "str" :: Either Text Text)
Union (String :| [String]) String (String "str")
>>> toAvro (Right "str" :: Either Text Text)
Union (String :| [String]) String (String "str")
so converting back from Avro gives a Left
regardless of the original constructor
>>> fromAvro (toAvro (Left "str" :: Either Text Text)) :: Result (Either Text Text)
Success (Left "str")
>>> fromAvro (toAvro (Right "str" :: Either Text Text)) :: Result (Either Text Text)
Success (Left "str")
I am constructing huge AVRO containers that I have to upload in multiple chunks. To help the use case It would be great if we had an API that let's you write containers incrementally:
packContainer
:: (ToAvro a)
=> Codec
-> Schema -- ^ Writer schema
-> ByteString -- ^ Sync bytes
-> (Builder -- ^ Container header
, [a] -> Builder -- ^ A function to feed a's and turn them into valid container blocks
)
This allows to consume from a streaming source and fill buffers incrementally.
I can contribute if you think it'd be worthwhile.
I'm trying to express an optional union type, and fail.
Given this schema:
{
"name": "test",
"type": ["null", "string", {"type": "map", "values": "int"}]
}
It crashes on compile with this error:
Exception when trying to run compile-time code:
Avro type is not supported: Null
CallStack (from HasCallStack):
error, called at src/Data/Avro/Deriving.hs:463:25 in avro-0.5.2.0-HIAMSWcivAy8F6u9uD512G:Data.Avro.Deriving
Code: deriveAvroWithOptions
withCustomNamespace
"schemas/my-schema.avsc"
Without "null" in the union type this works fine.
Is JSON encoding for serialisation supported? I've found some JSON related code in the library, but it seems mostly related to schemas. If I understand correctly, the JSON encoding is similar for serialisation and field default values, so maybe there's a way to reuse that? There doesn't seem to be a way to get a DefaultValue via ToAvro however.
The issue tracker link on Hackage currently points to https://github.com/GaloisInc/avro.git/issues
but should point to https://github.com/GaloisInc/avro/issues
instead (avro
vs avro.git
).
I believe you can submit a revision to fix this—probably a better option than waiting for a new version of the package.
Hello! 0.3.6.0 adds a new field to the TN
constructor in Data.Avro.Schema.TypeName
.
This change breaks downstream users of that constructor - would you be open to making a new release called 0.4 (no code change, just version bump), to match the PVP convention? and maybe even further, marking 0.3.6.0 and 0.3.6.1 as deprecated?
I'm not blocked on this personally, but it would likely help other users.
I'm trying to build the avro-0.3.2.0 with Eta and it fails with the following error:
[16 of 17] Compiling Data.Avro.Deriving
src/Data/Avro/Deriving.hs:335:74: error:
Not in scope: ‘e’
In the Template Haskell quotation
[| ($(mkText k), $(mkDefaultValue v)) e |]
|
335 | mkKVPair (k, v) = [e| ($(mkText k), $(mkDefaultValue v)) e|]
I've made the following change to the line https://github.com/GaloisInc/avro/blob/master/src/Data/Avro/Deriving.hs#L335 to
mkKVPair (k, v) = [e| ($(mkText k), $(mkDefaultValue v)) |]
Will this maintain the original intent?
Can it be published on Hackage?
Performance matters when serializing and deserializing large amounts of data with avro
. It would be great to have a benchmark suite to measure and track performance:
Avro.Value
to binaryAvro.Value
s)This will help us make informed decisions about the performance of the package going forwards.
@TikhonJelvis I have bumped into an issue where default values for union types are not encoded/decoded in Schema in a way that they can be respected by decoding functionality.
For this schema:
{
"type": "record",
"name": "Inner",
"fields": [
{ "name": "id", "type": "int" },
{ "name": "smell", "type": ["null", "string"], "default": null }
]
}
the expected Haskell representation would be:
Record
{ name = "Inner"
, namespace = Nothing
, aliases = []
, doc = Nothing
, order = Just Ascending
, fields =
[ Field
{ fldName = "id"
, fldAliases = []
, fldDoc = Nothing
, fldOrder = Just Ascending
, fldType = Int
, fldDefault = Nothing
}
, Field
{ fldName = "smell"
, fldAliases = []
, fldDoc = Nothing
, fldOrder = Just Ascending
, fldType = Union {options = Null :| [String], unionLookup = <function>}, fldDefault = Just (Union (Null :| [String]) Null Null)}
]
}
Note that the fldDefault
is annotated with its type of Union Null String
.
However the current implementation gives me this:
Record
{ name = "Inner"
, namespace = Nothing
, aliases = []
, doc = Nothing
, order = Just Ascending
, fields =
[ Field
{ fldName = "id"
, fldAliases = []
, fldDoc = Nothing
, fldOrder = Just Ascending
, fldType = Int
, fldDefault = Nothing
}
, Field
{ fldName = "smell"
, fldAliases = []
, fldDoc = Nothing
, fldOrder = Just Ascending
, fldType = Union {options = Null :| [String], unionLookup = <function>}, fldDefault = Just Null}
]
}
Which doesn't annotate the union type for the default value.
Looking at the code I see that it has been done deliberately.
Is there anything that I am missing or should we fix the type annotation back?
I just ran into a (more complex) bug which mostly reduces to the fact that if I have:
import qualified Data.Avro.Types as T
import qualified Data.Avro.Schema as S
newtype Foo = Foo Double
instance HasAvroSchema Foo where
schema = pure S.Long
instance ToAvro Foo where
toAvro (Foo val) = T.Double val
instance FromAvro Foo where
...
The problem is caught at decode time, and in a non-obvious way, not encode time and in an obvious way.
Is this intended, or is it an oversight?
derriveAvro
fails when a primitive type is defined as an object, i.e. :
{
"name": "CREATETIME",
"type": {
"type": "long",
"connect.version": 1,
"connect.name": "org.apache.kafka.connect.data.Timestamp"
}
}
According to Schema Declaration (https://avro.apache.org/docs/current/spec.html#schemas):
A Schema is represented in JSON by one of:
- A JSON string, naming a defined type.
- A JSON object, of the form:
{"type": "typeName" ...attributes...}
where typeName is either a primitive or derived type name, as defined below. Attributes not defined in this document are permitted as metadata, but must not affect the format of serialized data.
- A JSON array, representing a union of embedded types.
So in the example type given it should be parsed as a primitive long
and the additional fields ["connect.version", "connect.name"]
, captured as metadata or ignored.
It looks like the Nothing
case at
Line 344 in c5484e3
The schema resolution spec reads:
if both are records:
- the ordering of fields may be different: fields are matched by name.
- ...
However, the implementation mandates the reader schema to be "more specified" than the writer schema.
Is this intended? Should I ensure that my schemas always have such bit on?
Currently, logical types [1] are only supported "on the way in". We can correctly parse/decode schemas with logical types, but then the information about logical types is gone.
Ideally, we should preserve logical types in Schema
.
[1] https://avro.apache.org/docs/1.8.2/spec.html#Logical+Types
A schema containing a record with the following field:
{ "name": "binary",
"type": {
"type" : "fixed",
"name" : "FixedField",
"size" : 1
},
"default": "\u0000"
}
will fail to parse because parseAvroJSON
does not handle parsing JSON strings into Fixed
or Bytes
values:
Could not resolve type 'string' with expected type: Fixed {name = "FixedField", namespace = Nothing, aliases = [], size = 1}
The same problem for a field with type "bytes":
Could not resolve type 'string' with expected type: Bytes
Hi,
I just started testing this library and the deriveAvro
example doesn't seem to work:
person.avsc
{
"type": "record",
"name": "Person",
"fields": [
{
"name": "name",
"type": "string"
}
]
}
MyModule.hs
{-# LANGUAGE TemplateHaskell #-}
{-# LANGUAGE DeriveGeneric #-}
module MyModule where
import Data.Avro.Deriving
deriveAvro "schemas/person.avsc"
I get this compilation error:
• Couldn't match expected type ‘GHC.Exts.Item
(f0 (Data.Text.Internal.Text,
Data.Avro.Types.Value.Value Data.Avro.Schema.Type))’
with actual type ‘(Data.Text.Internal.Text,
Data.Avro.Types.Value.Value Data.Avro.Schema.Type)’
The type variable ‘f0’ is ambiguous
• In the expression:
(Data.Text.pack "name" Data.Avro.ToAvro..= p_1_alvH)
In the second argument of ‘Data.Avro.record’, namely
‘[(Data.Text.pack "name" Data.Avro.ToAvro..= p_1_alvH)]’
In the expression:
(Data.Avro.record schema'Person)
[(Data.Text.pack "name" Data.Avro.ToAvro..= p_1_alvH)]
|
8 | deriveAvro "schemas/person.avsc"
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
However this compiles fine with deriveFromAvro
Is there something I missed? I may very well have done an error 😅
Cheers!
Hi @TomMD,
Just a question. I see that you use entropy
package for generating "The 16-byte, randomly-generated sync marker".
Can you explain why entropy
and why "standard" random
cannot be used?
{ "name": "Decimal", "type": {
"type": "bytes", "logicalType": "decimal", "precision": 7, "scale": 2
}}
As per Avro specification, logical types are optional for codecs to implement.
This library implements "basic" logical types, but they are hard coded into the library.
It would be nice to be able to make logical types extensible so that users of the library would be able to provide their own logical types.
Currently, Java and even .NET implementations allow "registering" new logical types.
We ought to be able to do it, too ;)
I see three separate problems with how names are currently handled:
namespace
of Nothing
is semantically the same as Just ""
name
and namespace
should be unified into TypeName
because we should never compare names without considering the namespacesThe first issue is a problem because Nothing /= Just ""
even though they mean the same thing. Using Text
instead of Maybe Text
would remove this weird edgecase while also making the code simpler (including some of the Template Haskell code I'm working on).
The second issue is that equality between names in Avro is defined on the "fullname" (ie name + namespace). The easiest way to make this work in the Haskell code is moving namespace
into TypeName
:
data TypeName = TN
{ name :: Text
, namespace :: Text
}
The third issue is something that should be handled when parsing or validating schemas: we need to infer namespaces following some tedious rules laid out in the Avro protocol:
In record, enum and fixed definitions, the fullname is determined in one of the following ways:
- A name and namespace are both specified. For example, one might use "name": "X", "namespace": "org.foo" to indicate the fullname org.foo.X.
- A fullname is specified. If the name specified contains a dot, then it is assumed to be a fullname, and any namespace also specified is ignored. For example, use "name": "org.foo.X" to indicate the fullname org.foo.X.
- A name only is specified, i.e., a name that contains no dots. In this case the namespace is taken from the most tightly enclosing schema or protocol. For example, if "name": "X" is specified, and this occurs within a field of the record definition of org.foo.Y, then the fullname is org.foo.X. If there is no enclosing namespace then the null namespace is used.
References to previously defined names are as in the latter two cases above: if they contain a dot they are a fullname, if they do not contain a dot, the namespace is the namespace of the enclosing definition.
Adding code to infer namespaces sounds like a real pain. I think it would be reasonable to fix the first two issues and punt on the third one, at least for now—people can always work around this by always specifying explicit namespaces for named types.
Addressing these issues would involve some pretty broad breaking changes to the library. I'm happy to work on it at some point, but I'd love to hear thoughts about whether it's even worth doing and what the best approach would be.
Avro spec (https://avro.apache.org/docs/1.8.1/spec.html#Object+Container+Files) allows blocks within containers to be compressed.
This lib already supports such compression one way (when reading containers). It would be nice if compression was supported for both reading and writing containers.
Right now, we always use Value
as an intermediate type for encoding and decoding Avro values.
Using an intermediate type like this for larger amounts of data is expensive. On an internal project, I found that using an intermediate representation like Value
was about 3x slower and used about 40× more memory than going directly to/from normal Haskell types generated by TH.
Aeson has the same problem, with similar time/space overhead for using Aeson.Value
on large inputs. Aeson solves this problem by having a toEncoding
function as part of ToJSON
that generates a bytestring directly rather than going through Aeson.Value
.
We should add similar toEncoding
and fromEncoding
methods to ToAvro
and FromAvro
, and generate implementations for these methods for our TH types. I recently implemented this for some slightly different TH types on an internal project at Target and the logic was a lot simpler than I had thought—in fact, the logic to go directly to binary turned out to be simpler than going through an intermediate type! This change also removed a massive memory leak and improved serialization performance significantly.
There are some types that are redefined in the header's schema after encoding with encodeContainer
The smallest case I have found to recreate this issue is with the following schema.
{
"type": "record",
"name": "TwoBits",
"fields":
[ { "name": "bit0",
"type":
{ "type": "enum",
"name": "Bit",
"symbols": [ "Zero", "One"]
}
}
, { "name": "bit1",
"type": "Bit"
}
]
}
Data.Avro
sets the following as the writer's schema after encoding with encodeContainer
{
"name": "TwoBits",
"type": "record",
"aliases": [],
"fields":
[ { "name": "bit0"
, "type":
{ "name": "Bit"
, "type": "enum"
, "symbols": ["Zero", "One"]
, "aliases": []
}
, "aliases": []
, "order": "ascending"
}
, { "name": "bit1"
, "type":
{ "name": "Bit"
, "type": "enum"
, "symbols": ["Zero", "One"]
, "aliases": []
}
, "aliases": []
, "order": "ascending"
}
]
}
My understanding is that types should not be redefined due to the following statement in the spec
A schema or protocol may not contain multiple definitions of a fullname. Further, a name must be defined before it is used ("before" in the depth-first, left-to-right traversal of the JSON parse tree, where the types attribute of a protocol is always deemed to come "before" the messages attribute.)
Building library for avro-0.6.1.1..
[ 8 of 21] Compiling Data.Avro.Schema.Decimal
/var/stackage/work/unpack-dir/unpacked/avro-0.6.1.1-c8abdd5c2d67341d5f6a1d2d3b53ca282e000be1c281ae35e370b50a65e1b08c/src/Data/Avro/Schema/Decimal.hs:28:23: error:
Not in scope: ‘D.getScale’
Module ‘Data.BigDecimal’ does not export ‘getScale’.
|
28 | new = if ss > D.getScale d
| ^^^^^^^^^^
/var/stackage/work/unpack-dir/unpacked/avro-0.6.1.1-c8abdd5c2d67341d5f6a1d2d3b53ca282e000be1c281ae35e370b50a65e1b08c/src/Data/Avro/Schema/Decimal.hs:29:37: error:
Not in scope: ‘D.getValue’
Module ‘Data.BigDecimal’ does not export ‘getValue’.
|
29 | then D.BigDecimal (D.getValue d * 10 ^ (ss - D.getScale d)) ss
| ^^^^^^^^^^
/var/stackage/work/unpack-dir/unpacked/avro-0.6.1.1-c8abdd5c2d67341d5f6a1d2d3b53ca282e000be1c281ae35e370b50a65e1b08c/src/Data/Avro/Schema/Decimal.hs:29:63: error:
Not in scope: ‘D.getScale’
Module ‘Data.BigDecimal’ does not export ‘getScale’.
|
29 | then D.BigDecimal (D.getValue d * 10 ^ (ss - D.getScale d)) ss
| ^^^^^^^^^^
/var/stackage/work/unpack-dir/unpacked/avro-0.6.1.1-c8abdd5c2d67341d5f6a1d2d3b53ca282e000be1c281ae35e370b50a65e1b08c/src/Data/Avro/Schema/Decimal.hs:33:37: error:
Not in scope: ‘D.getValue’
Module ‘Data.BigDecimal’ does not export ‘getValue’.
|
33 | else Just $ fromInteger $ D.getValue new
| ^^^^^^^^^^
I am thinking of dropping support for GHC 7.10.
Please let me know if you think supporting 7.10 is still important.
There is a typicalSemigroup
/Monoid
issue
Also GaloisInc/pure-zlib#8
At least for schemas generated by Deriving
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.