rtbhouse / avro-fastserde Goto Github PK
View Code? Open in Web Editor NEWFast Apache Avro serialization/deserialization library
License: Apache License 2.0
Fast Apache Avro serialization/deserialization library
License: Apache License 2.0
Hi
Could you possibly update the project with a LICENSE file, so it clarifies what license this can be used under e.g. MIT, BSD-3-Clause, or Apache License 2.0
we notice some of you other projects have this (but seems sometime MIT is used and other times ASLv2)
https://github.com/RTBHOUSE/graphite-gw/blob/master/LICENSE
https://github.com/RTBHOUSE/torch-utils/blob/master/LICENSE
thanks.
Mike
(im asking as i want to add a dependency to use this over the vanilla avro serializer in an apache licensed open source project, but to add this its important all dependencies in it have a License and the license is ASL compatible (the ones i mentioned are))
We're seeing the following NPE when running on Java 11 using fastserde-1.0.5.
com.rtbhouse.utils.avro.FastDeserializerGeneratorException: java.lang.NullPointerException
at com.rtbhouse.utils.avro.FastDeserializerGenerator.generateDeserializer(FastDeserializerGenerator.java:139) ~[avro-fastserde-1.0.5.jar:?]
at com.rtbhouse.utils.avro.FastSerdeCache.buildSpecificDeserializer(FastSerdeCache.java:319) ~[avro-fastserde-1.0.5.jar:?]
at com.rtbhouse.utils.avro.FastSerdeCache.lambda$getFastSpecificDeserializer$1(FastSerdeCache.java:207) ~[avro-fastserde-1.0.5.jar:?]
at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700) [?:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
at java.lang.Thread.run(Thread.java:834) [?:?]
Caused by: java.lang.NullPointerException
at com.rtbhouse.utils.avro.FastDeserializerGenerator.processUnion(FastDeserializerGenerator.java:473) ~[avro-fastserde-1.0.5.jar:?]
at com.rtbhouse.utils.avro.FastDeserializerGenerator.processComplexType(FastDeserializerGenerator.java:157) ~[avro-fastserde-1.0.5.jar:?]
at com.rtbhouse.utils.avro.FastDeserializerGenerator.processRecord(FastDeserializerGenerator.java:264) ~[avro-fastserde-1.0.5.jar:?]
at com.rtbhouse.utils.avro.FastDeserializerGenerator.generateDeserializer(FastDeserializerGenerator.java:102) ~[avro-fastserde-1.0.5.jar:?]
... 6 more
The error is at this line:
Symbol.UnionAdjustAction unionAdjustAction = (Symbol.UnionAdjustAction) alternative.symbols[i].production[0];
The production == null.
I think the relevant part of the schema is the "type" in the following:
{
"name": "name",
"type": ["null", "string"]
}
I am using Avro to parse messages generated by another vendor (I single fixed schema) and found that the parsing performance is not great. I was looking for ways to improve performance when I came across avro-fastserde library.
My initial attempt at using the library was not successful, I was able to parse messages but the performance was identical to the base Avro implementation. I was hoping you might be able to provide some additional insights into what I might be doing wrong as the documentation does not provide a complete working example.
In my case I have used the Avro schema provided by the vendor to generate a SpecificDatumParser and all supporting classes. The code used to parse the messages looks something like this...
final SpecificDatumReader<HfReadData> avroHfReader = new SpecificDatumReader<HfReadData>(HfReadData.SCHEMA$);
DataFileReader<HfReadData> dataFileReader = new DataFileReader<HfReadData>(byteStream, avroHfReader);
while (dataFileReader.hasNext()) {
HfReadData hfRead = hfReadQueue.take();
hfRead = dataFileReader.next(hfRead);
parseAvroMessage(hfRead);
}
where the parseAvroMessage() method parses the method into various objects before passing them on to the application for processing. Note: the JSON schema is moderately simple consisting of a single record with an array of 1..n sub-records. The parser method combines sets of sub-records into single objects. This method consumes minimal cpu as verified using Java Mission Control to take Flight Recordings.
Here is the FastSerde implementation...
final FastSerdeCache serdeCache = new FastSerdeCache("./build/classes/main/com/serde");
final FastSpecificDatumReader<HfReadData> fastSpecificDatumReader = new FastSpecificDatumReader<HfReadData>(HfReadData.SCHEMA$, HfReadData.SCHEMA$, serdeCache);
DataFileReader<HfReadData> dataFileReader = new DataFileReader<HfReadData>(byteStream, fastSpecificDatumReader);
while (dataFileReader.hasNext()) {
HfReadData hfRead = hfReadQueue.take();
hfRead = dataFileReader.next(hfRead);
parseAvroMessage(hfRead);
}
As noted above I get the exact same throughput regardless of whether I use FastSerde or base Avro. On my test setup it takes about 5min to process 100,000 messages and flight recordings from each test run show slight differences but otherwise are more or less the same.
You note that the FastSpecificDatumReader will fall back to the SpecificDatumReader if the specific classes are not available. I am feeling this is most likely happening in my case (hence the identical performance). I feel like I have done something wrong but not sure what that is.
Very interesting project.
Would you be able to share an example of pointer to code in avro which does schema analysis and slower and how you hand rolled a de-serializer ?
For FastSerdeCache cache = new FastSerdeCache(compileClassPath);
when and why do i need to set this ?
Do i need to set this ? on reading the blog i thought class generation happens on demand ? Is there a way to enable some static time compilation for cases where i know already the schema i am trying to de-serialize ?
How to check if the class is generated and compiled and observe in some stacktrace that the more efficient de-serializer is being invoked ?
Please could this support StringableClasses, currently it throws a cast exception, where these are defined in avro schema.
e.g.
{ "name": "id", "type": { "type": "string", "java-class": "java.math.BigInteger" } }
/** Read/write some common builtin classes as strings. Representing these as
We have seen two FastDeserializer generation failure cases -
These two issues have been fixed in the following PR:
linkedin/avro-util#46
Can you take a look of this PR to see if you need this fixup?
For avro schemas with nested records structure, FastDeserializerGenerator.generateDeserializer throws NPE if FastGenericDatumReader is used with a (reader) schema different from the writer schema. It deserializes correctly if both writer and reader schemas are the same.
Sample writer schema (a record with 3 nested records):
{
"type": "record",
"name": "User",
"namespace": "example.avro",
"fields": [
{
"name": "address",
"type": {
"type": "record",
"name": "address_data",
"fields": [
{
"name": "streetName",
"type": "string",
"doc": "name of street"
},
{
"name": "city",
"type": "string",
"doc": "name of city"
}
]
},
"doc": "user addresses"
},
{
"name": "segment",
"type": {
"type": "record",
"name": "segment_data",
"fields": [
{
"name": "segmentA",
"type": "string",
"doc": "name of segment A"
},
{
"name": "segmentB",
"type": "string",
"doc": "name of segment B"
}
]
},
"doc": "user segments"
},
{
"name": "devices",
"type": {
"type": "record",
"name": "device_data",
"fields": [
{
"name": "deviceA",
"type": "string",
"doc": "name of device A"
},
{
"name": "deviceB",
"type": "string",
"doc": "name of device B"
}
]
},
"doc": "user devices"
}
],
"doc": "user schema"
}
Sample reader schema (contains one less record than the writer):
{
"type": "record",
"name": "User",
"namespace": "example.avro",
"fields": [
{
"name": "address",
"type": {
"type": "record",
"name": "address_data",
"fields": [
{
"name": "streetName",
"type": "string",
"doc": "name of street"
},
{
"name": "city",
"type": "string",
"doc": "name of city"
}
]
},
"doc": "user addresses"
},
{
"name": "devices",
"type": {
"type": "record",
"name": "device_data",
"fields": [
{
"name": "deviceA",
"type": "string",
"doc": "name of device A"
},
{
"name": "deviceB",
"type": "string",
"doc": "name of device B"
}
]
},
"doc": "user devices"
}
],
"doc": "user schema"
}
WARNING: deserializer generation exception
com.rtbhouse.utils.avro.FastDeserializerGeneratorException: java.lang.NullPointerException
at com.rtbhouse.utils.avro.FastDeserializerGenerator.generateDeserializer(FastDeserializerGenerator.java:169)
at com.rtbhouse.utils.avro.FastSerdeCache.buildGenericDeserializer(FastSerdeCache.java:322)
at com.rtbhouse.utils.avro.FastSerdeCache.lambda$getFastGenericDeserializer$4(FastSerdeCache.java:225)
at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NullPointerException
at com.rtbhouse.utils.avro.FastDeserializerGenerator.processRecord(FastDeserializerGenerator.java:227)
at com.rtbhouse.utils.avro.FastDeserializerGenerator.generateDeserializer(FastDeserializerGenerator.java:97)
... 6 more
Test case is follow:
@Test
public void testNullElementArray () {
// given
Schema arrayRecordSchema = Schema.createArray(Schema.createUnion(Schema.create(Schema.Type.STRING)
, Schema.create(Schema.Type.NULL)));
List<Object> records = new ArrayList<Object>();
records.add(null);
records.add(null);
records.add(null);
// when
List<Object> array = decodeRecord(arrayRecordSchema, arrayRecordSchema,
specificDataAsDecoder(records, arrayRecordSchema));
// then
Assert.assertEquals(3, array.size());
Assert.assertNull(array.get(0));
Assert.assertNull(array.get(1));
Assert.assertNull(array.get(2));
}
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.