conduitio / conduit Goto Github PK
View Code? Open in Web Editor NEWConduit streams data between data stores. Kafka Connect replacement. No JVM required.
Home Page: https://conduit.io
License: Apache License 2.0
Conduit streams data between data stores. Kafka Connect replacement. No JVM required.
Home Page: https://conduit.io
License: Apache License 2.0
Description
A pipeline started after it's been degraded, still shows the previous error. To reproduce:
Example:
[
{
"id": "65eb595b-57a6-45d7-85ff-21a721664855",
"state": {
"status": "STATUS_RUNNING",
"error": "node cf6a452e-9261-4b64-b2f3-9b7cca6534ef stopped with error:\n github.com/conduitio/conduit/pkg/pipeline.(*Service).runPipeline.func1\n /Users/haris/projects/conduit/pkg/pipeline/lifecycle.go:376\n - error reading from source:\n github.com/conduitio/conduit/pkg/pipeline/stream.(*SourceNode).Run.func2\n /Users/haris/projects/conduit/pkg/pipeline/stream/source.go:94\n - source client read:\n github.com/conduitio/conduit/pkg/plugins.(*sourceClient).Read\n /Users/haris/projects/conduit/pkg/plugins/source.go:136\n - rpc error: code = Unknown desc = source server read: failed getting a message received error from client localhost:9092/1: 1 request(s) timed out: disconnect (after 772261ms in state UP): received error from client localhost:9092/1: 1 request(s) timed out: disconnect (after 772261ms in state UP)"
},
"config": {
"name": "pipeline-name-712442dd-64f3-43de-9483-5844b4f6649c",
"description": "My new pipeline"
},
"connectorIds": [
"cf6a452e-9261-4b64-b2f3-9b7cca6534ef",
"7741daaa-6386-45ba-824a-6a5051993570"
],
"processorIds": []
}
]
The plugin interface is not only structural but also behavioral (e.g. Open
is guaranteed to be called before Read
or Write
, certain errors have special meaning). We should provide utilities to run an acceptance test on any plugin to figure out if it is behaving as expected. The goal is to write a design doc for creating acceptance tests that determines if a plugin is generally going to work with Conduit.
While everyone should do their own benchmarking, we need to give developers some where to start. This will anchor expectations and then we can help folks tune their installations as they progress.
Because conduit is software that is meant to be run as a service, developers will want ways to easily repeat their setup process. Conduit needs to be able to be set up via terraform.
We should add a Github action that runs make generate
and succeeds if there is no diff. It should run on each push to a PR.
We could do the same with proto files (generate them and make sure there's no diff). Proto files aren't generated in the repo anymore, we use Buf remote code generation now.
Conduit should by default load all YAML files in folder pipelines
(relative to the conduit
binary) and provision pipelines when it starts. The folder location should be configurable through a CLI flag.
Depends on #493.
In terms of installation instructions, we need to be able to give developers the ability to install via docker if need be.
If the last saved position in a Pg source is M, and a newly inserted row has ID N, then the newly created record will be returned N-M times by the source. For example, if the last row returned by a Pg source had ID 10, and then we insert a row with the ID 30, then the new row will be returned 20 times.
Build 0ec0aa6
Steps to reproduce
insert into persons (firstname, personid) values ('foo', 100);
"settings": {
"table": "persons",
"url": "postgresql://meroxauser:meroxapass@localhost/conduit-test-db?sslmode=disable"
}
{"firstname":"foo","personid":100}
insert into persons (address, city, firstname, lastname, personid) values ('wall street', 'new york', 'foo', 'bar', 105);
Run the pipeline again.
6. Expected: you see following in the file destination:
{"firstname":"foo","personid":105}
Actual behavior:
{"firstname":"foo","personid":105}
{"firstname":"foo","personid":105}
{"firstname":"foo","personid":105}
{"firstname":"foo","personid":105}
{"firstname":"foo","personid":105}
Need a way for developers to be able to install conduit via Homebrew.
If a developer wants to go to the managed platform, they'll have the option of exporting their pipeline to Meroxa.
Extract all functionality needed for creating a plugin into a separate repository. This means most of the code in package pkg/plugin/sdk
should be moved. The newly created repository should not have Conduit as a dependency but should define everything it needs locally (e.g. structures for records, logging etc.). The reason is that Conduit will import the SDK repository and use it to communicate with plugins. The goal is to create a minimal repository that can be imported by plugins and contains everything that's needed to implement a plugin.
S3 source has documentation, but the destination doesn't. (both should be in pkg/plugins/s3/README.md)
The conduit repository currently contains built-in plugins. We should rather move them to separate repositories and make sure they are included in Conduit when it's compiled. This way we can easily add more built-in plugins in the future, even ones that are not built and maintained in-house. The easiest way to include them at compile time is to just import them and add them to a global variable containing built-in plugins. The drawback of this approach is that we can only include plugins written in Go (probably not even a problem in the short term).
Depends on #37.
Once a plugin is extracted, we need to adjust the code so it references the extracted SDK instead of Conduit.
Flatten a nested data structure, generating names for each field by concatenating the field names at each level with a configurable delimiter character.
More info on how kafka does it:
Add utility functions for plugins to create a logger with support for leveled output. Log output from plugins is already captured by Conduit and included in its own log stream, but it's currently logged without any level. The goal is to allow the plugin to decide which log level will be used for a message and possibly add structured fields.
The GRPC and HTTP API should return the proper status codes. For example GET /pipelines/{id}
should return a 404 if a pipeline is not found, the corresponding GRPC endpoint should return status code 5.
This should probably be done with a middleware that contains a mapping between globally defined error values (see ConduitIO/conduit-old#262) and gRPC/HTTP status codes.
Additionally, we need to make sure to document this in protobuf annotations so that it shows up in the swagger docs.
Our tests are currently (ab)using the file connector plugin to run pipeline tests. We should rather mock the connector instance to better isolate unit tests and either remove current tests or tag them as integration tests.
Figure out how we get documentation for plugins from repositories to the docs repository. (Sidenote: try to do it with pull, not push, then we can include 3rd party plugins in the future)
Initially, each service (pipeline, connector, processor service) had its own instance lock. The problem was twofold - the services locked the instances only for the duration of the operation inside the service and they did not lock any related instance. Let's take the creation of a processor as an example: it requires us to get the pipeline, lock it in place so it's not modified in the meantime, create the processor, add it to the pipeline, commit the transaction to the DB and only then release the lock.
Now that we have the orchestration layer it should be the responsibility of this layer to lock any entities that will be modified in an operation before making changes. The goal is to make Conduit safe for concurrent use, i.e. multiple requests changing the same resource at the same time.
The goal is to make the Conduit CLI configurable. We should support the combination of parsing of flags, environment variables and/or a config file. We need to decide what things need to be configurable (e.g. log level, API ports, path to config file, path to pipeline configs) and the corresponding flag/env var/config field names. We already have some simple CLI flags, the goal of this issue is to implement it in a way that makes it easy to add new options in the future.
We should create technical documentation for plugins targeted at developers. This should contain information like how to create a new connector plugin, how to debug it, how will Conduit call plugin functions (call order guarantees), what errors should be returned, how should logging be handled etc.
Make sure the link in the readme points to the correct file after creating this doc.
We need to get our acceptance testing up to an acceptable level. Looking for a particular % of code coverage.
The problem to solve is High Availability. Should one node go down, another will need to take it's place. The goal is not disaster recovery. We will need to develop a separate solution for that.
More to come on this one.
Right now connector persister errors are only logged but not propagated back to the Conduit runtime. We need to make sure that an error in the connector persister is sent back to the Conduit runtime which then initiates a shutdown.
We want to change the format of the timestamp in our logs to match 2021-05-04T14:52:00+00:00
.
Currently, when sending messages in the Kafka sink connector, we use synchronous sends. While this is fine in the first version of the connector, we want to have asynchronous sends, which will increase performance.
Once the connector SDK has been developed, we need to be able to help other developers search and discover new connectors that have been created for conduit.
Currently we have an endpoint for updating a connector config. We need to either allow updating the state through that endpoint or add a separate one specifically for updating the state (e.g. setting the position). The assignee should figure out which approach to take and implement it.
We need a redshift source connector.
Checklist before releasing:
Checklist after repo is public:
Conduit needs to be able to use MySQL as a source.
The Kafka client we use now is confluent-kafka-go. We've initially chosen it because it's one of the most used Go client for Kafka, it's quite configurable and has a simpler APIs (it's possible to read messages from a whole topic, whereas for most other clients messages need to be explicitly read from partitions).
However, because it has a dependency on CGo (under the hood it uses librdkafka), we couldn't find a way to build Conduit for a number of platforms and architectures (Windows and Darwin, AMD64, ARM64).
We should add a section where we describe pitfalls /sharp edges / future work / known limitations of Conduit to manage expectations. Conduit is still in its infancy and not really meant to be used in production yet, we owe it to our users to be open about this. At the same time, we can use this section to clarify we know there is more work to do and we are on it.
This section should talk about Conduit and not plugins, as the limitations of plugins are listed in the readme of each plugin (we can mention this though).
Some things we already identified (not an exhaustive list):
A site for the project. Requirements include:
We may decide to include the documentation as part of the site itself. We would need to determine the information architecture.
The test for Postgres CDC behavior is not expecting the correct behavior.
Source.Read
on an empty table that never had any rows in it we should receive a recoverable error.Source.Read
should be with the position nil
.Source.Read
should be with the position of the record that was last returned.Source.Read
should return that record.Source.Read
again it should return a recoverable error, since no change happened in the database so no change should be detected.Source.Read
function should return another record.Would be great to have tests for making sure we detect inserts, updates and deletes.
DeVaris commented in https://www.notion.so/Conduit-Testing-48d24d2c15b64b4a82e478a39a5d3527?d=cdfdea66aa0d445f801fd2d53849c545#fbbd714f66e94706b3dff24613108c05
If someone choose a file connector we should present a file picker.
We need to create end-to-end tests that treat Conduit as a black box, trigger requests to the API and make sure they produce the expected result. For now they should cover at least the main paths (e.g. creating a pipeline, starting it, stopping it, deleting it). These tests should be easy to run locally without setting up a specific environment (ideally with one make target), additionally we should trigger the tests in the CI as regression tests before merging any PR.
Create a service for managing plugins. For now this means that we need a list of pre-defined paths to plugins (built-in plugins), these plugins should be loaded up on startup to fetch their specifications and indexed in memory. When a new connector is created the plugin should be fetched by its name from this plugin manager, and not by the path to the plugin as is the case right now. In the future we can add the functionality for adding plugins on the fly, but this is not needed for now.
This is a follow-up on ConduitIO/conduit-old#438.
Due to reasons mentioned in the PR, we currently do not have pre-built binaries of Conduit and plugins for Windows and also Macs with M1 chips.
We should investigate what are our options and what needs to be done so we get them as well.
We should create a transform that allows us to extract a value from a record key or payload and insert it into the record metadata. The original field should stay untouched. The metadata field name needs to be configurable as well as the payload/key field.
Example: a metadata extractor configured to extract the value of field foo
and insert it as metadata into field bar
applied on the following record:
metadata: {"baz":"irrelevant"}
payload: {"foo":"some value","bar":123}
Should produce:
metadata: {"bar":"some value","baz":"irrelevant"}
payload: {"foo":"some value","bar":123}
bytea
.The Postgres Source should handle this functionality in the same way it would handle any other key column.
The Postgres connector cannot handle this query correctly and instead of returning the correct result or a descriptive error it returns an ErrEndData
which implies that the plugin operated correctly and is now at its end.
We have a health check endpoint that always returns a healthy status. We need to implement an actual health check that makes sure Conduit is running correctly. Right now the only thing to check is if the DB is working correctly (e.g. ping the DB). Once we tackle clustering this check should (probably) be improved to definitively indicate if a node in the cluster is healthy.
Document the process of creating, testing, building and using a new plugin to aid developers with the creation of plugins.
Cast fields or the entire key or value to a specific type, e.g. to force an integer field to a smaller width.
More info on how kafka does it:
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.