hortonworks / registry Goto Github PK

Schema Registry

License: Apache License 2.0

Shell 0.53% Java 90.94% HTML 0.27% JavaScript 4.27% CSS 1.27% Python 0.56% PLSQL 0.22% PLpgSQL 0.16% Dockerfile 0.05% Mustache 0.08% Gherkin 1.06% FreeMarker 0.41% Smarty 0.13% EJS 0.05%

schema-registry kafka kinesis flink spark-streaming metadata schemas storm

registry's Introduction

Registry

Registry is a framework to build metadata repositories. As part of Registry, we currently have SchemaRegistry repositories.

Follow @schemaregistry on Twitter for updates on the project.

Documentation

Documentation and tutorials can be found on the Registry docs

Getting Help

Registry users or devs should send a message to Registry Google Group

License

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Committers

Satish Duggana (@0xeed)
Michael Andre Pearce (@itsourcery)
Sriharsha Chintalapani (@d3fmacro)
Priyank Shah (@priyank5485)
Arun Mahadevan (@arun_mah)
Jungtaek Lim (@heartsavior)
Sanket Shah (@ShahSank3t)
Hugo da Cruz Louro (@hlouro)
Mani Kumar reddy (@omkreddy)
Saravanan Raju (@raju-saravanan)
Csaba Galyo (@gcsaba2)
Csenge Virag Maruzsi (@csengevirag)
Andras Csaki (@acsaki)

...and many more!

registry's People

Contributors

Stargazers

Watchers

Forkers

gkomlossi misselvexu nattilabalint fonaid balassai asamoal booper-sinara

registry's Issues

Documentation link - Sorry this page does not exist yet.

The link to documentation
http://registry-project.readthedocs.io/en/latest/

currently leads to no documentation.

Add Confluent Schema Registry compatible API to enable migration

Many users of Kafka with Avro will be using Confluent Schema Registry, to enable an easier migration path, as not all consumer, producers will be able to be changed and released at the same time.

Registry could expose compatible rest api for apps either producing or consuming still on confluent serdes.
Like wise to add also an option to run with a compatible byte wire protocol, so that apps with the new registry serdes able to talk/understand the confluent protocol.

e.g. schema meta info and version could be referencable by a unique single id, which then this id is sent in the byte[], with the same leading magic / protocol byte

Listeners to pull in external meta store schemas

We should provide a Scheduled Listener interface to allow plugins to pull in external meta store schemas into registry. Example , a listener that can be scheduled to pull in any schemas or version changes from confluent's schema registry.

KafkaAvroDeserializer should not require setting the version id.

Currently it seems that the KafkaAvroDeserializer needs to know the version id upfront.

This causes two issues:

Apps that handle messages such as GenericRecord
Apps have the SpecificRecord in the class path/jvm.

Ideally it should have similar to confluents, where if no READER_VERSION is present, it simply uses the same id as the incoming data. (GenericRecord problem)

If specific record is found, it should find the schema from the specific record found on the class path, it should use its schema

Currently this blocks migration from Confluents Schema Registry.

Registry drop-tables.sql is empty we should be able to clean the database.

Schema registry UI not responding while adding schemas

Update examples application to take any json payload and schema to send messages to a topic using avro serializer.

Support SchemaRegistryClient to handle cluster URLs with url selectors

SchemaRegistryClient should be able to take clusterUrl which contains sequence of schemaregistry urls separated by , and it should be able to handle failover of the target schema registry instances.

Registry UI should be able to accept the schema text as a file upload

Add support for optional evolution for schemas.

This will add a feature to register schema metadata which can either have multiple versions or have a single version. Currently it is always multiple versions and there are cases where users may want a schema to have only one version of the schema and it should not allow adding more versions.

Update NOTICE file & add LICENSE , NOTICE files to jars.

Support composition of avro schemas.

Currently avro requires users to create individual schema documents and there is no ability to include/import other schema documents. This feature will allow users to include other schemas and refer the types. Multiple schemas can be composed with respective abstractions and they canbe composed in meaningful ways.

#todo add examples

Create schemaGroup API to register group names

This is intended to create a admin page on the UI side to register schema group names , instead free-form text like today. Once we do this we can enable rules on schema names , example Kafka group requires suffixing ":v" or ":k" with the schema names. We've lot of users trip over this registration naming scheme.
Also having a predefined schemaGroup name allows users not to make mistakes in registering the schema.

Schema Registry needs to allow users to paste a schema

Previous versions of Schema Registry allowed me to paste a schema in, which was very nice. Now it requires that a file be uploaded. This means that I have to copy the schema, paste into vi, save, go back to web app, upload, navigate to file... While I can appreciate the desire to upload a file if you already have a file saved off, it is much more of a pain if you don't have the schema already saved in a file. I would like to have the option to either paste the text into the UI or uploading an existing file.

Schema Registry should allow schema registry clients to handle schema identifier and version tracking on their own

The idea is that some well-behaved schema registry clients do not need/want the SR library to do the serialization and deserialization of the data for them and instead want the schema registry to primarily focus on the publishing/retrieval of the schemas themselves.
Today the schema registry client takes care of the serialization and deserialization of data, and when it does so, it will write out the identifier and version then write out the raw data.
So, for example, the resulting bytes on disk would be
<>
NiFi and protocols like JMS, HTTP, etc.. have facilities to support context (headers) and content (payload). So we don't need/want serializers to write that stuff for us, and we'll handle passing those references around for the objects.

Add a convenient way to pass JAAS config file

Change the UI to add schema for better UX experience

Need to change the styling while adding a schema as per attached images.

SampleApplicationTest.testApi test failure

testApis(com.hortonworks.registries.schemaregistry.examples.avro.SampleApplicationTest) Time elapsed: 0.708 sec <<< ERROR!
java.lang.RuntimeException: Jar /serdes-examples.jar could not be loaded
at com.hortonworks.registries.schemaregistry.examples.avro.SampleSchemaRegistryClientApp.runCustomSerDesApi(SampleSchemaRegistryClientApp.java:181)
at com.hortonworks.registries.schemaregistry.examples.avro.SampleApplicationTest.testApis(SampleApplicationTest.java:43)

Update bootstrap script & remove mariadb driver

Remove mariadb driver. Update bootstrap-storage script to fetch mysql driver.

Create an audit log of clients using schemas

We should introduce clientId to SchemaRegistryClient which will at configured interval sends a heartbeat to the registry server along with schema id & version it's using and if it's a producer or consumer. This will help us build an audit log of clients accessing schemas. This will in-turn give indications to Schema Authors to see any potential change to schema might affect which clients.

API for aggregated information about schemametadata

Aggregated information for schema metadata including

include versions of schema
any custom ser/des configured

Returned schemametadata for a given schema name/id always returns compatibility as 'BACKWARD'.

I tried creating a schema with 'NONE' compatibility but the response always defaulted to 'BACKWARD'

Return right id as response for serdes registration apis.

Simplify API for registering/fetching serializer/deserializer

Add support for Kafka Header Registry

As you'll see KIP-82 got adopted and submitted.

This means as of Kafka 0.11, Kafka will have headers.

Kafka Record, the value is simply a byte[] as such delegates the handling of what schema or how to decode that to the consumer, which obviously solutions like schema registry provide.

The Kafka Header record, introduces a String key and a value byte[], following this, as such having support for being able to register the kafka header value types.

It would be great to support schema registry for headers where the schema can be a primitive, int8, int16, int32, int64, float32, float64, boolean, bytes, string or more complex avro like schemas.

The idea would be a subject for lookup could be the topic + header key, or if all values for the same key within an organisation then simply the subject could just be header key.

Is this possible to record with the current schema repo api's? as in is the mapping agnostic and just simply subject = topic + ".key" or subject = topic + ".value" as such we could make subject = topic ".header." + headerKey

Hdfs service should be mandatory for HBase and Hive

If user creates an environment by picking HBase/Hive from an HDP cluster added via ambari then HDFS from same cluster should automatically added by UI to the environment. Reason is HBase has hbase.rootdir that uses hdfs-site.xml core-site.xml to connect to HDFS. Without HDFS being present in environment storm topology fails at runtime throwing an exception from HBaseClient that it cannot connect to hdfs.

Treat union type having null to be treated with default value as null.

Currently, union types having null as the fist type should be treated as a type with default value being null as mentioned here.

(Note that when a default value is specified for a record field whose type is a union, the type of the default value must match the first element of the union. Thus, for unions containing "null", the "null" is usually listed first, since the default value of such unions is typically null.)

But avro(1.7.x and 1.8.x) implementation does not handle this scenario and it may take a while in getting this fix from avro. Better to address this issue in schemaregistry while computing effective or resultant schema.

Upgrade avro version to 1.8.2(latest)

Add postgres dependencies in registry.

_orderByField should have been handled for inmemory storage manager also.

_orderByField query parsing may need to be pushed to the service instead of keeping at JDBC storage manager level. Ideally, StorageManager should not try to parse the query params and figure out orderByField query etc but it should expose find APIs with List of OrderByFields as optional argument.

Fix common-auth module so that SPNEGO works with cross realm kerberos setup

SchemaGroup and enforcing uniqueness with in the group or across the SchemaRegistry

Lets say, If I want to register a schemaName with person with following schema for Nifi
{ "name": "person", "namespace": "nifi", "type": "record", "fields": [ { "name": "id", "type": "string" }, { "name": "firstName", "type": "string", "aliases": [ "first_name" ] }, { "name": "lastName", "type": "string", "aliases": [ "last_name" ] }, { "name": "email", "type": "string" }, { "name": "gender", "type": "string" }, { "name": "ipAddress", "type": "string", "aliases": [ "ip_address" ] } ] }
Should we allow users to use the same schemaName under a different group.
If I want to use schemaName "person" under schemaGroup "kafka"
{ "name": "person", "namespace": "nifi", "type": "record", "fields": [ { "name": "id", "type": "string" }, { "name": "firstName", "type": "string", "aliases": [ "first_name" ] }, { "name": "lastName", "type": "string", "aliases": [ "last_name" ] }, { "name": "email", "type": "string" }, { "name": "gender", "type": "string" }, }
The above request comes back as success but I don't see new schema getting registered.
cc @satishd

Store schema ID and version in Kafka message headers instead of payload prefix

The serde currently stores schema identifiers as a payload prefix. This makes the payload incompatible with standard Avro deserializers.

Since starting with 0.11 Kafka supports message headers, it is now possible to store schema identifiers in the headers and leave the payload unchanged.

Nice to have: the serde should be backward compatible and automatically detect the version of Kafka it is running against and store the identifiers either in headers or in the payload, depending on what's available.

UI layout of adding schemas

This issue is discussed in #86 here.

Have we thought about option of having 3/4 area for schema text and 1/4 area for the left column about name/type/compatibility etc and description can be two rows resizable textbox?

This is mainly about having the left column with 1/3rd area and schema text with 2/3rd area and description to have resizable text box initially with two rows of space. Currently left column takes half of the space which may not be really needed.

@harshach @shahsank3t any thoughts/opinions?

Add postgres sql scripts

We already have storage manager support for postgres. We need to convert mysql scripts to postgres.

Add support of handling confluent kafka byte wire protocol

To aid migration for user currently using the confluent platform, where producers are using still confluent serdes
or
Where need to integrate with tooling that supports the confluent serdes only atm (until registry adoption is more wide spread).

It would be good to be able to produce a confluent wire compatible protocol, and like wise consume, making the serialiser configurable which protocol to produce, and the consumer able to simply just handle either.

This will need to ensure we check the protocol versions in the byte array, confluent currently uses a leading 0x0 byte for the protocol magic byte, followed by 4 bytes for the int32 id.

This would link with the other confluent compatibility work I've already PR'd, and enable a faster adoption.

kafka examples code should run against secure clusters

Provide URLs that link directly to a schema in the UI

Right now it's not possible to link directly to a schema in the UI.

It would be nice to be able to obtain a link that brings up a specific schema in the UI.

Kafka avro serializer/deserializer should treat schema name as topic name by default.

This issue is raised based on the discussion happened on google-groups conversation.

By default Kafka Avro serializer computes schema name as the topic name. User can always over ride this default behavior by implementing getSchemaKey in KafkaAvroSerializer

UI implementation for adding serializers & deserializers for respective schemas

Should have a UI page to show listing as well as adding new serializers & deserializers for added schemas.

Updating schema description in the UI does not work

To replicate:

edit an existing schema; the Add Version dialog is displayed
change the description and click OK; "Version added successfully" is displayed

Expected result:

schema now has new description

Actual result:

description has not changed, even after refreshing the page

Schema registry client in secured mode

Schema registry UI is very slow at initial load

In our environment it takes about 30 seconds for the UI to get populated with all the schemas in the registry.

Initially the screen displays "No data found" and then the schemas start to trickle in slowly.

It would be helpful to display some indication that the schemas are still being loaded. It would also be nice to speed up the loading if possible.

Add OrderBy implementation for select queries in jdbc storage manager

Handle null values appropriately in Schema#getTypeOfVal(String val)

Generating source jars should be done in specific profiles.

Currently, sources jars are generated for all the profiles. This can be reduced to wherever it is really applicable and this should be removed from default profile as it takes around 15min to run mvn clean install

HA Support for schema registry.

Support for HighAvailability for schema registry cluster. Allow multiple instances of schema registry running in the same cluster with

One node in the cluster would act like a master, writes should be handled only by this instance. This node can also handle read requests.
All other nodes can take read requests and write requests would be redirected to the master

Need pagination support

Currently, schema lists get long if too many schemas are added. It will be better to provide pagination support showing 10 or 15 schemas at a time.

Search API in finding schemas with name, description.

Currently, there is search API to find schema versions containing fields with a given name. There should be a search API to find schemas for given name or description.

Need an API that can return if the schema creation was successful or not

Currently, the POST API returns the new schemaId if the schema was created successfully or old schemaId when a user tries to create a schema with the same name. UI has no way to identify if the schemaId in response is for new or old and hence can not show proper notification message if it's created or not.
Having another API or same API with a proper response will help in showing proper notifications.