GithubHelp home page GithubHelp logo

aiven-open / commons-for-apache-kafka-connect Goto Github PK

View Code? Open in Web Editor NEW
6.0 6.0 12.0 2.13 MB

Shared common functionality among Aiven's connectors for Apache Kafka®

License: Apache License 2.0

Java 100.00%
kafka kafka-connect

commons-for-apache-kafka-connect's People

Contributors

actions-user avatar ahmedsobeh avatar anatolypopov avatar c0urante avatar dependabot[bot] avatar docemmetbrown avatar github-actions[bot] avatar giuseppelillo avatar helenmel avatar ivanyu avatar jeqo avatar jjaakola-aiven avatar jlprat avatar juha-aiven avatar oikarinen avatar pedrorossi avatar ryanskraba avatar snuyanzin avatar staaldraad avatar stephen-harris avatar tvainika avatar willyborankin avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

commons-for-apache-kafka-connect's Issues

Option to provide user defined RecordGrouper

I'd really like to group records from multiple topics and partitions and group records by event time. I imagine many other users of this plugin would love to have more control over how records are grouped also. I'd propose that the RecordGrouperFactory somehow detect if a configuration value has already been set for a user supplied class to load (ex file.grouper.class); otherwise fallback to the current implementation.

I might be able to contribute a PR to this effect if you folks were interested. If that doesn't align with the direction or availability of the maintainers, no worries! Never hurts to ask ;-)

`validateKeyFilenameTemplate` is not used

The protected method validateKeyFilenameTemplate, which was added in #39, isn't executed anywhere in this repository, nor is it invoked by the S3/GCS connectors mentioned in the PR.

Should we invoke validateKeyFilenameTemplate() from AivenCommonConfig::validate, or do we expect each connector to invoke it?

Potential security vulnerability in the zstd C library.Can you help upgrade to patch versions?

Hi, @ivan-savciuc, @actions-user , I'd like to report a vulnerability issue in io.aiven:aiven-kafka-connect-commons:0.7.0.

Issue Description

I noticed that io.aiven:aiven-kafka-connect-commons:0.7.0 directly depends on com.github.luben:zstd-jni:v1.4.5-12 in the pom. However, as shown in the following dependency graph. However, com.github.luben:zstd-jni:v1.4.5-12 sufferes from the vulnerability which the C library zstd(version:1.4.5) exposed: CVE-2021-24032.

Dependency Graph between Java and Shared Libraries

image (12)

Suggested Vulnerability Patch Versions

com.github.luben:zstd-jni:v1.4.9-1 (>=v1.4.9-1) has upgraded this vulnerable C library zstd to the patch version 1.4.9.

Java build tools cannot report vulnerable C libraries, which may induce potential security issues to many downstream Java projects. Could you please upgrade this vulnerable dependency?

Thanks for your help~
Best regards,
Helen Parr

Add extensions for file format

Currently, extensions are only added for compression type (e.g. gzip, snappy) but not for format type (json, parquet).
To avoid confusion if file format changes over time, extensions should be included.

Support {{topic}} variable with {{key}} in file template pattern

This is a feature request to support use of {{key}} in conjunction with {{topic}} variable.

Currently two modes of grouping are supported:

  • grouping by the topic, partition, and timestamp;
  • grouping by the key.

In our context we have a connector responsible for backing-up multiple compacted topics to an S3 bucket. While this could be achieved through multiple connectors, but given the large number of topics we were hoping to use connectors to provide a logical grouping of topics. However, as keys may exist across topics, to do this we would need the topic name in the S3 key.

Compatibility shim library

It's difficult to determine whether some features of the Kafka Connect API are supported by the runtime into which a plugin (e.g., connector, converter, etc.) is deployed. Some features can be detected implicitly based on which methods the runtime invokes on the plugin (such as whether SinkTask::preCommit or SinkTask::flush is invoked), and others can be detected by catching runtime classloading errors (such as those thrown by SinkTaskContext::errantRecordReporter).

This is both inconvenient (catching classloading errors is never fun) and limiting (it's impossible to leverage interfaces introduced in newer APIs with default implementations that make sense in the absence of support for that feature by the Kafka Connect runtime, and there are limited options for when exactly support or lack of support for a given feature can be determined).

We can implement a compatibility shim library that eases some of these pain points and covers up some of the uglier bits for connector developers who want to access newer features in their connectors without sacrificing compatibility with older versions of Kafka Connect.

Additional TypestampSource Types?

With the introduction of TimestampSource and the single enumerated type of WALLCLOCK, what are the plans to support additional Timestamps?

I believe there are 3 common cases: WALLCLOCK, EVENT, and EXTRACTED. Ideally, I need to make sure that a timestamp field in the event controls the proper folder that the event goes to. And if I have messages w/out a timestamp in them, using EVENT timestamp is more helpful than wallclock.

I get the sense that this is in the roadmap because of this being added, but was wondering if that is still the plan?

Remove kafka-clients from dependencies

One side-effect from #166: kafka-clients (included via schema-registry-client) is included on the output package.

This is probably a non-issue, as the Connector plugin classpath should be isolated, and should be ok to ship kafka-clients as it should be aligned with the schema-registry client dependency using it. Unless schema-registry-client is not needed all together?

Just want to call out this as it's causing Aiven-Open/gcs-connector-for-apache-kafka#279 to fail (though it is caused by how the integration tests are implemented, see Aiven-Open/gcs-connector-for-apache-kafka#283).

cc @C0urante

What happens if first record does not have headers but further records in the same batch do?

Output schema (avro/parquet) is based on the first record. This works fine for key/value, etc. but headers may differ between records. Different record header elements types can cause issues but this can be solved by casting header types by using StringConverter or similar for headers.
Though, when first record does not have headers the schema type is null. We should figure out what happens with next records in the same batch if they do have header (are the headers missing? does conversion fail?) and provide an approach on how to process infer the schema for headers properly.

See Aiven-Open/gcs-connector-for-apache-kafka#347 (comment)

Add changelog record grouping by field

Currently, changelog record grouping only allows using topic, partition, and offset metadata as part of the filename template.
There is a need to group records by a payload field (e.g. customer-id) to have them all together and facilitate any indexing on top of the object storage.

For instance, the following structure could be possible:

- t1/
  - p0/
    - customer1/
      - 00000000.json
      - 00001000.json

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.