conduitio / conduit Goto Github PK

View Code? Open in Web Editor NEW

348.0 348.0 41.0 10.37 MB

Conduit streams data between data stores. Kafka Connect replacement. No JVM required.

Home Page: https://conduit.io

License: Apache License 2.0

Dockerfile 0.03% Makefile 0.14% Go 87.86% Shell 0.23% HTML 0.34% JavaScript 8.20% Handlebars 3.12% CSS 0.09%

conduit data-engineering data-integration data-pipeline data-stream etl go kafka kafkaconnect

conduit's People

Stargazers

Watchers

conduit's Issues

Pipeline started after being degraded still has the error

Description

A pipeline started after it's been degraded, still shows the previous error. To reproduce:

Create a pipeline, e.g. with a Kafka source.
Stop the Kafka broker
Wait for a bit
Pipeline is shown as degraded -- this is expected and actual behavior.
Start pipeline
Pipeline is shown as running, but with an error -- the expected behavior is that no error is shown.

Example:

[
    {
        "id": "65eb595b-57a6-45d7-85ff-21a721664855",
        "state": {
            "status": "STATUS_RUNNING",
            "error": "node cf6a452e-9261-4b64-b2f3-9b7cca6534ef stopped with error:\n    github.com/conduitio/conduit/pkg/pipeline.(*Service).runPipeline.func1\n        /Users/haris/projects/conduit/pkg/pipeline/lifecycle.go:376\n  - error reading from source:\n    github.com/conduitio/conduit/pkg/pipeline/stream.(*SourceNode).Run.func2\n        /Users/haris/projects/conduit/pkg/pipeline/stream/source.go:94\n  - source client read:\n    github.com/conduitio/conduit/pkg/plugins.(*sourceClient).Read\n        /Users/haris/projects/conduit/pkg/plugins/source.go:136\n  - rpc error: code = Unknown desc = source server read: failed getting a message received error from client localhost:9092/1: 1 request(s) timed out: disconnect (after 772261ms in state UP): received error from client localhost:9092/1: 1 request(s) timed out: disconnect (after 772261ms in state UP)"
        },
        "config": {
            "name": "pipeline-name-712442dd-64f3-43de-9483-5844b4f6649c",
            "description": "My new pipeline"
        },
        "connectorIds": [
            "cf6a452e-9261-4b64-b2f3-9b7cca6534ef",
            "7741daaa-6386-45ba-824a-6a5051993570"
        ],
        "processorIds": []
    }
]

Design acceptance tests for plugins

The plugin interface is not only structural but also behavioral (e.g. Open is guaranteed to be called before Read or Write, certain errors have special meaning). We should provide utilities to run an acceptance test on any plugin to figure out if it is behaving as expected. The goal is to write a design doc for creating acceptance tests that determines if a plugin is generally going to work with Conduit.

Performance Benchmarks

While everyone should do their own benchmarking, we need to give developers some where to start. This will anchor expectations and then we can help folks tune their installations as they progress.

Terraform Provider

Because conduit is software that is meant to be run as a service, developers will want ways to easily repeat their setup process. Conduit needs to be able to be set up via terraform.

Add CI action for generated code

We should add a Github action that runs make generate and succeeds if there is no diff. It should run on each push to a PR.

~~We could do the same with proto files (generate them and make sure there's no diff).~~ Proto files aren't generated in the repo anymore, we use Buf remote code generation now.

Load pipeline config files on startup

Conduit should by default load all YAML files in folder pipelines (relative to the conduit binary) and provision pipelines when it starts. The folder location should be configurable through a CLI flag.

Depends on #493.

Docker Getting Started

In terms of installation instructions, we need to be able to give developers the ability to install via docker if need be.

Postgres source: records duplicated in certain cases

If the last saved position in a Pg source is M, and a newly inserted row has ID N, then the newly created record will be returned N-M times by the source. For example, if the last row returned by a Pg source had ID 10, and then we insert a row with the ID 30, then the new row will be returned 20 times.

Build 0ec0aa6

Steps to reproduce

Setup a table in Pg, with one row, where ID is 1 (for example):

insert into persons (firstname, personid) values ('foo', 100);

Create a pipeline with a Pg source and any destination (can be a file destination). Pg source configuration I used:

"settings": {
	"table": "persons",
	"url": "postgresql://meroxauser:meroxapass@localhost/conduit-test-db?sslmode=disable"
}

Run the pipeline.
Expected (and actual behavior): you see following in the file destination:

{"firstname":"foo","personid":100}

Insert following row into Pg:

insert into persons (address, city, firstname, lastname, personid) values ('wall street', 'new york', 'foo', 'bar', 105);

Run the pipeline again.
6. Expected: you see following in the file destination:

{"firstname":"foo","personid":105}

Actual behavior:

{"firstname":"foo","personid":105}
{"firstname":"foo","personid":105}
{"firstname":"foo","personid":105}
{"firstname":"foo","personid":105}
{"firstname":"foo","personid":105}

Homebrew Installation & Instructions

Need a way for developers to be able to install conduit via Homebrew.

Export Pipeline to Meroxa

If a developer wants to go to the managed platform, they'll have the option of exporting their pipeline to Meroxa.

Extract Connector plugin SDK

Extract all functionality needed for creating a plugin into a separate repository. This means most of the code in package pkg/plugin/sdk should be moved. The newly created repository should not have Conduit as a dependency but should define everything it needs locally (e.g. structures for records, logging etc.). The reason is that Conduit will import the SDK repository and use it to communicate with plugins. The goal is to create a minimal repository that can be imported by plugins and contains everything that's needed to implement a plugin.

write Readme for S3 destination connector

S3 source has documentation, but the destination doesn't. (both should be in pkg/plugins/s3/README.md)

Extract built-in plugins

The conduit repository currently contains built-in plugins. We should rather move them to separate repositories and make sure they are included in Conduit when it's compiled. This way we can easily add more built-in plugins in the future, even ones that are not built and maintained in-house. The easiest way to include them at compile time is to just import them and add them to a global variable containing built-in plugins. The drawback of this approach is that we can only include plugins written in Go (probably not even a problem in the short term).

Depends on #37.

Once a plugin is extracted, we need to adjust the code so it references the extracted SDK instead of Conduit.

Transform: Flatten

Flatten a nested data structure, generating names for each field by concatenating the field names at each level with a configurable delimiter character.

More info on how kafka does it:

Pipeline file

Collects work needed to implement pipeline config file support in Conduit.

Leveled logger for plugins

Add utility functions for plugins to create a logger with support for leveled output. Log output from plugins is already captured by Conduit and included in its own log stream, but it's currently logged without any level. The goal is to allow the plugin to decide which log level will be used for a message and possibly add structured fields.

Error handling: Return proper status codes in API

The GRPC and HTTP API should return the proper status codes. For example GET /pipelines/{id} should return a 404 if a pipeline is not found, the corresponding GRPC endpoint should return status code 5.

This should probably be done with a middleware that contains a mapping between globally defined error values (see ConduitIO/conduit-old#262) and gRPC/HTTP status codes.

Additionally, we need to make sure to document this in protobuf annotations so that it shows up in the swagger docs.

Use mock connector in tests

Our tests are currently (ab)using the file connector plugin to run pipeline tests. We should rather mock the connector instance to better isolate unit tests and either remove current tests or tag them as integration tests.

Plugin documentation workflow

Figure out how we get documentation for plugins from repositories to the docs repository. (Sidenote: try to do it with pull, not push, then we can include 3rd party plugins in the future)

Entity locking in orchestration layer

Initially, each service (pipeline, connector, processor service) had its own instance lock. The problem was twofold - the services locked the instances only for the duration of the operation inside the service and they did not lock any related instance. Let's take the creation of a processor as an example: it requires us to get the pipeline, lock it in place so it's not modified in the meantime, create the processor, add it to the pipeline, commit the transaction to the DB and only then release the lock.

Now that we have the orchestration layer it should be the responsibility of this layer to lock any entities that will be modified in an operation before making changes. The goal is to make Conduit safe for concurrent use, i.e. multiple requests changing the same resource at the same time.

CLI configuration

The goal is to make the Conduit CLI configurable. We should support the combination of parsing of flags, environment variables and/or a config file. We need to decide what things need to be configurable (e.g. log level, API ports, path to config file, path to pipeline configs) and the corresponding flag/env var/config field names. We already have some simple CLI flags, the goal of this issue is to implement it in a way that makes it easy to add new options in the future.

Plugins technical documentation

We should create technical documentation for plugins targeted at developers. This should contain information like how to create a new connector plugin, how to debug it, how will Conduit call plugin functions (call order guarantees), what errors should be returned, how should logging be handled etc.

Make sure the link in the readme points to the correct file after creating this doc.

Acceptance Tests

We need to get our acceptance testing up to an acceptable level. Looking for a particular % of code coverage.

Connector: Vitess Source

Clustered Conduit

The problem to solve is High Availability. Should one node go down, another will need to take it's place. The goal is not disaster recovery. We will need to develop a separate solution for that.

Transform: WASM Transform

More to come on this one.

Propagate connector persister errors

Right now connector persister errors are only logged but not propagated back to the Conduit runtime. We need to make sure that an error in the connector persister is sent back to the Conduit runtime which then initiates a shutdown.

Change timestamp format in logger

We want to change the format of the timestamp in our logs to match 2021-05-04T14:52:00+00:00.

Kafka sink: Asynchronous sends

Currently, when sending messages in the Kafka sink connector, we use synchronous sends. While this is fine in the first version of the connector, we want to have asynchronous sends, which will increase performance.

Connector Directory Listing

Once the connector SDK has been developed, we need to be able to help other developers search and discover new connectors that have been created for conduit.

GRPC connector state endpoint

Currently we have an endpoint for updating a connector config. We need to either allow updating the state through that endpoint or add a separate one specifically for updating the state (e.g. setting the position). The assignee should figure out which approach to take and implement it.

Connector: Redshift Source

We need a redshift source connector.

Cleanup and release v0.1.0

Checklist before releasing:

Checklist after repo is public:

Readme
- Badges are displayed correctly
Issue templates display correctly and link to correct places (e.g. ask questions, open documentation issues)
Make sure godocs is accessible
Make sure forking is enabled
Social preview image
Make sure goreport is accessible and badge shows up correctly (the page is down on the day of open-sourcing)

Connector: MySQL Source

Conduit needs to be able to use MySQL as a source.

Replace confluent-kafka-go with Segment's kafka-go client

The Kafka client we use now is confluent-kafka-go. We've initially chosen it because it's one of the most used Go client for Kafka, it's quite configurable and has a simpler APIs (it's possible to read messages from a whole topic, whereas for most other clients messages need to be explicitly read from partitions).

However, because it has a dependency on CGo (under the hood it uses librdkafka), we couldn't find a way to build Conduit for a number of platforms and architectures (Windows and Darwin, AMD64, ARM64).

Replace the producer (destination connector)
Replace the consumer (source connector) #105
Handle case where new partitions are added #105

Readme section for sharp edges

We should add a section where we describe pitfalls /sharp edges / future work / known limitations of Conduit to manage expectations. Conduit is still in its infancy and not really meant to be used in production yet, we owe it to our users to be open about this. At the same time, we can use this section to clarify we know there is more work to do and we are on it.

This section should talk about Conduit and not plugins, as the limitations of plugins are listed in the readme of each plugin (we can mention this though).

Some things we already identified (not an exhaustive list):

no clustering support
no open CDC record format support
can't provide firm delivery guarantees (we are missing tests to prove it)
can't provide performance guarantees (again, no benchmarks)
no plugin management (Conduit does not have a list of available plugins, UI hardcodes plugins)

Conduit Project Landing Site

A site for the project. Requirements include:

Where to have conversations with community
What the project is about
How to get started
Project Goals
Links to the documentation

We may decide to include the documentation as part of the site itself. We would need to determine the information architecture.

Postgres CDC unexpected behavior

The test for Postgres CDC behavior is not expecting the correct behavior.

If we call Source.Read on an empty table that never had any rows in it we should receive a recoverable error.
The first call to Source.Read should be with the position nil.
All following calls to Source.Read should be with the position of the record that was last returned.
Once we add a row to the table Source.Read should return that record.
If we call Source.Read again it should return a recoverable error, since no change happened in the database so no change should be detected.
Only after updating the record (or deleting it, or inserting a new one) the Source.Read function should return another record.

Would be great to have tests for making sure we detect inserts, updates and deletes.

UI: Add file picker when creating a file source

DeVaris commented in https://www.notion.so/Conduit-Testing-48d24d2c15b64b4a82e478a39a5d3527?d=cdfdea66aa0d445f801fd2d53849c545#fbbd714f66e94706b3dff24613108c05

If someone choose a file connector we should present a file picker.

Transform: Javascript Transform

As the name implies :-)

~~#445~~
We're punting on this, as the UI is not a priority at the moment. We'll focus on getting the JS transforms into Conduit through pipeline config files. See: #32
#446
#447
#448

Stream Inspector

The goal is to give folks the ability to see what's happening inside the stream. This is not like tail. This is more like peek. Pull out some information from the stream so that a developer can see the data and data types.

End-to-end tests

We need to create end-to-end tests that treat Conduit as a black box, trigger requests to the API and make sure they produce the expected result. For now they should cover at least the main paths (e.g. creating a pipeline, starting it, stopping it, deleting it). These tests should be easy to run locally without setting up a specific environment (ideally with one make target), additionally we should trigger the tests in the CI as regression tests before merging any PR.

Sign Conduit Docker images

https://github.blog/2021-12-06-safeguard-container-signing-capability-actions/

Plugin management service

Create a service for managing plugins. For now this means that we need a list of pre-defined paths to plugins (built-in plugins), these plugins should be loaded up on startup to fetch their specifications and indexed in memory. When a new connector is created the plugin should be fetched by its name from this plugin manager, and not by the path to the plugin as is the case right now. In the future we can add the functionality for adding plugins on the fly, but this is not needed for now.

Provide pre-built binaries for Windows and Macs with M1 chips

This is a follow-up on ConduitIO/conduit-old#438.

Due to reasons mentioned in the PR, we currently do not have pre-built binaries of Conduit and plugins for Windows and also Macs with M1 chips.

We should investigate what are our options and what needs to be done so we get them as well.

Transform: Metadata extractor

We should create a transform that allows us to extract a value from a record key or payload and insert it into the record metadata. The original field should stay untouched. The metadata field name needs to be configurable as well as the payload/key field.

Example: a metadata extractor configured to extract the value of field foo and insert it as metadata into field bar applied on the following record:

metadata: {"baz":"irrelevant"}
payload: {"foo":"some value","bar":123}

Should produce:

metadata: {"bar":"some value","baz":"irrelevant"}
payload: {"foo":"some value","bar":123}

Postgres Source fails when handling bytea primary keys

Steps to Reproduce

Have a Postgres Source configured to read a table that has a primary key column of type bytea.
Configure a pipeline to run that plugin.

Expected Behavior

The Postgres Source should handle this functionality in the same way it would handle any other key column.

Actual Behavior

The Postgres connector cannot handle this query correctly and instead of returning the correct result or a descriptive error it returns an ErrEndData which implies that the plugin operated correctly and is now at its end.

Health check

We have a health check endpoint that always returns a healthy status. We need to implement an actual health check that makes sure Conduit is running correctly. Right now the only thing to check is if the DB is working correctly (e.g. ping the DB). Once we tackle clustering this check should (probably) be improved to definitively indicate if a node in the cluster is healthy.

Connector plugin SDK documentation

Document the process of creating, testing, building and using a new plugin to aid developers with the creation of plugins.

Transform: Cast

Cast fields or the entire key or value to a specific type, e.g. to force an integer field to a smaller width.

More info on how kafka does it:

conduitio / conduit Goto Github PK

conduit's People

Stargazers

Watchers

Forkers

conduit's Issues

Steps to Reproduce

Expected Behavior

Actual Behavior

Recommend Projects

Recommend Topics

Recommend Org

Jobs