wepay / kafka-connect-bigquery Goto Github PK

DEPRECATED. PLEASE USE https://github.com/confluentinc/kafka-connect-bigquery. A Kafka Connect BigQuery sink connector

License: Apache License 2.0

Shell 5.13% Java 94.23% Dockerfile 0.64%

kafka-connect-bigquery's Introduction

REPOSITORY IS DEPRECATED. PLEASE USE https://github.com/confluentinc/kafka-connect-bigquery.

Kafka Connect BigQuery Connector

This is an implementation of a sink connector from Apache Kafka to Google BigQuery, built on top of Apache Kafka Connect. For a comprehensive list of configuration options, see the Connector Configuration Wiki.

Download

The latest releases are available in the GitHub release tab, or via tarballs in Maven central.

Standalone Quickstart

NOTE: You must have the Confluent Platform installed in order to run the example.

Configuration Basics

Firstly, you need to specify configuration settings for your connector. These can be found in the kcbq-connector/quickstart/properties/connector.properties file. Look for this section:

########################################### Fill me in! ###########################################
# The name of the BigQuery project to write to
project=
# The name of the BigQuery dataset to write to (leave the '.*=' at the beginning, enter your
# dataset after it)
datasets=.*=
# The location of a BigQuery service account or user JSON credentials file
# or service account credentials or user credentials in JSON format (non-escaped JSON blob)
keyfile=
# 'FILE' if keyfile is a credentials file, 'JSON' if it's a credentials JSON
keySource=FILE

You'll need to choose a BigQuery project to write to, a dataset from that project to write to, and provide the location of a JSON key file that can be used to access a BigQuery service account that can write to the project/dataset pair. Once you've decided on these properties, fill them in and save the properties file.

Once you get more familiar with the connector, you might want to revisit the connector.properties file and experiment with tweaking its settings.

Building and Extracting a Tarball

If you haven't already, move into the repository's top-level directory:

$ cd /path/to/kafka-connect-bigquery/

Begin by creating a tarball of the connector with the Confluent Schema Retriever included:

$ ./gradlew clean distTar

And then extract its contents:

$ mkdir -p bin/jar/ && tar -C bin/jar/ -xf kcbq-confluent/build/distributions/kcbq-confluent-*.tar

Setting-Up Background Processes

Then move into the quickstart directory:

$ cd kcbq-connector/quickstart/

After that, if your Confluent Platform installation isn't in a sibling directory to the connector, specify its location (and do so before starting each of the subsequent processes in their own terminal):

$ export CONFLUENT_DIR=/path/to/confluent

Then, initialize the background processes necessary for Kafka Connect (one terminal per script): (Taken from http://docs.confluent.io/3.0.0/quickstart.html)

$ ./zookeeper.sh

(wait a little while for it to get on its feet)

$ ./kafka.sh

(wait a little while for it to get on its feet)

$ ./schema-registry.sh

(wait a little while for it to get on its feet)

Initializing the Avro Console Producer

Next, initialize the Avro Console Producer (also in its own terminal):

$ ./avro-console-producer.sh

Give it some data to start off with (type directly into the Avro Console Producer instance):

{"f1":"Testing the Kafka-BigQuery Connector!"}

Running the Connector

Finally, initialize the BigQuery connector (also in its own terminal):

$ ./connector.sh

Piping Data Through the Connector

Now you can enter Avro messages of the schema {"f1": "$SOME_STRING"} into the Avro Console Producer instance, and the pipeline instance should write them to BigQuery.

If you want to get more adventurous, you can experiment with different schemas or topics by adjusting flags given to the Avro Console Producer and tweaking the config settings found in the kcbq-connector/quickstart/properties directory.

Integration Testing the Connector

NOTE: You must have Docker installed and running on your machine in order to run integration tests for the connector.

This all takes place in the kcbq-connector directory.

How Integration Testing Works

Integration tests run by creating Docker instances for Zookeeper, Kafka, Schema Registry, and the BigQuery Connector itself, then verifying the results using a JUnit test.

They use schemas and data that can be found in the test/docker/populate/test_schemas/ directory, and rely on a user-provided JSON key file (like in the quickstart example) to access BigQuery.

The project and dataset they write to, as well as the specific JSON key file they use, can be specified by command-line flag, environment variable, or configuration file — the exact details of each can be found by running the integration test script with the -? flag.

Data Corruption Concerns

In order to ensure the validity of each test, any table that will be written to in the course of integration testing is preemptively deleted before the connector is run. This will only be an issue if you have any tables in your dataset whose names begin with kcbq_test_ and match the sanitized name of any of the test_schema subdirectories. If that is the case, you should probably consider writing to a different project/dataset.

Because Kafka and Schema Registry are run in Docker, there is no risk that running integration tests will corrupt any existing data that is already on your machine, and there is also no need to free up any of your ports that might currently be in use by real instances of the programs that are faked in the process of testing.

Running the Integration Tests

Running the series of integration tests is easy:

$ test/integrationtest.sh

This assumes that the project, dataset, and key file have been specified by variable or configuration file. For more information on how to specify these, run the test script with the --help flag.

NOTE: You must have a recent version of boot2docker, Docker Machine, Docker, etc. installed. Older versions will hang when cleaning containers, and linking doesn't work properly.

Adding New Integration Tests

Adding an integration test is a little more involved, and consists of two major steps: specifying Avro data to be sent to Kafka, and specifying via JUnit test how to verify that such data made it to BigQuery as expected.

To specify input data, you must create a new directory in the test/resources/test_schemas/ directory with whatever name you want the Kafka topic of your test to be named, and whatever string you want the name of your test's BigQuery table to be derived from. Then, create two files in that directory:

schema.json will contain the Avro schema of the type of data the new test will send through the connector.
data.json will contain a series of JSON objects, each of which should represent an Avro record that matches the specified schema. Each JSON object must occupy its own line, and each object cannot occupy more than one line (this inconvenience is due to limitations in the Avro Console Producer, and may be addressed in future commits).

To specify data verification, add a new JUnit test to the file src/integration-test/java/com/wepay/kafka/connect/bigquery/it/BigQueryConnectorIntegrationTest.java. Rows that are retrieved from BigQuery in the test are only returned as Lists of Objects. The names of their columns are not tracked. Construct a List of the Objects that you expect to be stored in the test's BigQuery table, retrieve the actual List of Objects stored via a call to readAllRows(), and then compare the two via a call to testRows().

NOTE: Because the order of rows is not guaranteed when reading test results from BigQuery, you must include a row number as the first field of any of your test schemas, and every row of test data must have a unique value for its row number (row numbers are one-indexed).

kafka-connect-bigquery's People

Contributors

Stargazers

Watchers

Forkers

mtagle kattmingming raisemarketplace atomicjets pruthvishetty kenji-h denrasill akovanda ewencp cerber whynick1 avigershon jgao54 turnipentropy revinate powerspace wrp rakhmad plegge scatrin aahmed-se smithakoduri wgriffurbn lrabiet aiven-open ovotech geopamplona revigreg anderseriksson confluentinc juyttenh wicknicks nicolasguyomar ggandhi27 milanj-1 quiqupltd rhauch magiciiboy jitkasempin imduffy15 seth-deal perxtech mycujoo ilanjir levzem pplaatje jim-yang-awx mosheblumbergx farmdawgnation mkubala yohei1126 jagamts1 darshanmehta10 dosvath merlinapp yimingl17 gabor1977 jeandanieljouanneaud gkstechie benzhang94 odracci benwaine zachary-povey chilispa aakashnshah ni-lti-sdm adelcast sahithi03 rentpath onurtokat dhruvr21 avanzabank iht dgilboa ilya-lt lightricks shravanfreigtbro pencilerazzer pauloscarin-dev shisheng-1 isnaman wad-youngjeonlee iq-scm

kafka-connect-bigquery's Issues

Add metrics hooks

(Migrated from internal Jira issue DI-408)

We should add metrics information to our BQ connector to track number of rows read/written, table latency, etc. We should also hook this up to Graphite. We could probably leverage Kafka's metrics library for this.

Deal with partial BigQuery failures more elegantly

BigQuery write requests can partially fail. Some rows for a request may be successfully written, and others will not be. If this is the case, the whole flush will be considered a failure we'll end up with a stack trace like this:

Caused by: java.util.concurrent.ExecutionException: com.wepay.kafka.connect.bigquery.exception.BigQueryConnectException: table insertion failed for the following rows:
    [row index 3000]: backendError: null
    [row index 3001]: backendError: null
    [row index 3002]: backendError: null

Since the whole flush is considered a failure, kafka connect will rebalance and end up re-writing all the rows to BQ. This results in duplicated rows.

This is not a huge issue (BQ views can be written to dedup duplicated rows), but, if possible, it would be nice to take advantage of the fact that some rows were successfully written and only attempt to write the unsuccessful rows.

Thread Leak in BigQuerySinkTask.stop()

For operational simplicity we do create one Sink per Topic with a lot of topics. What we observed is that the Kafka Workers regularly run out of threads (> 32K Thread) which in turn makes the system itself inaccessible.

In BigQuerySinkTask.stop() the executor.shutdown() method should be called.

This is really bad since this means that even connector restarts and every rebalance increase the thread count.

We fixed this locally and verified that this indeed alleviates the problem.

A general question though: Why use a thread pool for every task at all? Are other connectors doing this as well? Why not just increase the task.max config and let kafka connect handle it?

More logging when pausing partitions

When KCBQ pauses a partition when bufferSize is exceeded, you see:

Pausing all partitions for topic: foo

We never log when it's unpaused. We also don't log which partitions are being paused. This makes the logs look funny, since you just see a bunch of pausing messages over and over again.

Other asks:

Print the buffer size for the topic/table in question, at the time of pausing (since it can exceed bufferSize).
Log how long the topic was both paused and unpaused for.

Something like this would be really useful:

Pausing partitions for topic with buffer size 103434 after 17023ms: [foo-0, foo-4, foo-10]
Unpausing partitions for topic after 18304ms: [foo-0, foo-4, foo-10]

Upgrade Kafka library to 0.10.1

Implement exponential backoff while updating schemas

(Migrated from internal Jira issue DI-416)

When dynamic schema updates are enabled and a dynamic update has to happen, the current flow of events is:

Send write request.
Inspect response, see that errors are present and that all are due to invalid fields (otherwise if no errors are present, return; and if other errors are present, throw an exception).
Grab schema for topic from schema registry, convert to BigQuery format, and send table update request to BigQuery.
Go back to step 1.

Since table updates can take as long as two minutes, a single update attempt is not guaranteed to help. However, we should probably have some kind of exponential backoff (with a maximum number of attempts) in the loop so that we don't waste resources constantly sending update requests to BigQuery. It's also probably not necessary to send more than one update request.

Log table when schema update fails

We see errors like:

ERROR Task failed with com.wepay.kafka.connect.bigquery.exception.BigQueryConnectException error: Failed to update table schema

I believe this line:

https://github.com/wepay/kafka-connect-bigquery/blob/master/kcbq-connector/src/main/java/com/wepay/kafka/connect/bigquery/write/row/AdaptiveBigQueryWriter.java#L86

Should include the table ID, so that debugging is easier.

Error 400 while loading data to Google Bigquery by SingleBatchWriter

Dear wepay Team,

I am trying to test kafka-connect-bigquery with confluent in loading data to GG Bigquery.
However, after starting and creating table automatically, it meet an issue

Caused by: com.google.api.client.googleapis.json.GoogleJsonResponseException: 400 Bad Request
{
  "code" : 400,
  "errors" : [ {
    "domain" : "global",
    "message" : "The destination table's partition demo_comment_topic$2018427 is outside the allowed bounds. You can only stream to partitions within 0 days in the past and 0 days in the future relative to the current date.",
    "reason" : "invalid"
  } ],
  "message" : "The destination table's partition demo_comment_topic$2018427 is outside the allowed bounds. You can only stream to partitions within 0 days in the past and 0 days in the future relative to the current date.",
  "status" : "INVALID_ARGUMENT"
}
        at com.google.api.client.googleapis.json.GoogleJsonResponseException.from(GoogleJsonResponseException.java:146)
        at com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:113)
        at com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:40)
        at com.google.api.client.googleapis.services.AbstractGoogleClientRequest$1.interceptResponse(AbstractGoogleClientRequest.java:321)
        at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:1065)
        at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:419)
        at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:352)
        at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:469)
        at com.google.cloud.bigquery.spi.DefaultBigQueryRpc.insertAll(DefaultBigQueryRpc.java:283)
        ... 10 more

As far as I know, kafka-connect-bigquery is using the google-cloud-api out-of updated from 2016 (gcloud-java:0.2.7) rather than current libraries (google-cloud:0.45.0-alpha).
May be root core of issue?
Please help me to review this issue!

Thanks,
Tung Nguyen

connector.sh throws "Failed to find any class that implements Connector"

When running "connector.sh" it stops throwing following error:

ERROR Stopping after connector error (org.apache.kafka.connect.cli.ConnectStandalone:108) java.util.concurrent.ExecutionException: org.apache.kafka.connect.errors.ConnectException: Failed to find any class that implements Connector and which name matches com.wepay.kafka.connect.bigquery.BigQuerySinkConnector, available connectors are: PluginDesc{klass=class org.apache.kafka.connect.file.FileStreamSinkConnector, name='org.apache.kafka.connect.file.FileStreamSinkConnector', version='1.0.0-cp1', encodedVersion=1.0.0-cp1, type=sink, typeName='sink', location='classpath'}, PluginDesc{klass=class org.apache.kafka.connect.file.FileStreamSourceConnector, name='org.apache.kafka.connect.file.FileStreamSourceConnector', version='1.0.0-cp1', encodedVersion=1.0.0-cp1, type=source, typeName='source', location='classpath'}, PluginDesc{klass=class org.apache.kafka.connect.tools.MockConnector, name='org.apache.kafka.connect.tools.MockConnector', version='1.0.0-cp1', encodedVersion=1.0.0-cp1, type=connector, typeName='connector', location='classpath'}, PluginDesc{klass=class org.apache.kafka.connect.tools.MockSinkConnector, name='org.apache.kafka.connect.tools.MockSinkConnector', version='1.0.0-cp1', encodedVersion=1.0.0-cp1, type=sink, typeName='sink', location='classpath'}, PluginDesc{klass=class org.apache.kafka.connect.tools.MockSourceConnector, name='org.apache.kafka.connect.tools.MockSourceConnector', version='1.0.0-cp1', encodedVersion=1.0.0-cp1, type=source, typeName='source', location='classpath'}, PluginDesc{klass=class org.apache.kafka.connect.tools.SchemaSourceConnector, name='org.apache.kafka.connect.tools.SchemaSourceConnector', version='1.0.0-cp1', encodedVersion=1.0.0-cp1, type=source, typeName='source', location='classpath'}, PluginDesc{klass=class org.apache.kafka.connect.tools.VerifiableSinkConnector, name='org.apache.kafka.connect.tools.VerifiableSinkConnector', version='1.0.0-cp1', encodedVersion=1.0.0-cp1, type=source, typeName='source', location='classpath'}, PluginDesc{klass=class org.apache.kafka.connect.tools.VerifiableSourceConnector, name='org.apache.kafka.connect.tools.VerifiableSourceConnector', version='1.0.0-cp1', encodedVersion=1.0.0-cp1, type=source, typeName='source', location='classpath'} at org.apache.kafka.connect.util.ConvertingFutureCallback.result(ConvertingFutureCallback.java:79) at org.apache.kafka.connect.util.ConvertingFutureCallback.get(ConvertingFutureCallback.java:66) at org.apache.kafka.connect.cli.ConnectStandalone.main(ConnectStandalone.java:105)

I've installed Confluent Kafka V4.0 to /usr/bin

I've added kafka-connect-bigquery jars to /usr/share/java/kafka-connect-bigquery/ and added this directory to variables CLASSPATH and CONNECT_PLUGIN_PATH.

As I'm new to Kafka (Connect) I'm afraid I'm overlooking something simple..

Wildcard topic

For now we provide topic names in a properties file as following:

topics=kcbq-quickstart,app.click

If there are a few topics to deal with, It's ok to list all the topics in a comma-seperated way.
But when you have to deal with hundreds or thousands of topics, it doesn't help. In those situations, we would like to use a wildcard expression to specify topics.

How about implementing a method to retrive all the topic names matching a given wildcard expression in SchemaRegistrySchemaRetriever class?

Trigger flushes on put

KCBQ currently waits until flush() is called before it starts sending messages to BigQuery. This happens every 30s, by default. The flush() method, then sends the messages to BigQuery in a series of threads, and waits for all of them to respond back. This sometimes takes 30s (or longer) when there are a large number of messages in the buffer.

I believe that adding logic in the put() method to more aggressively begin sending messages to BigQuery BEFORE flush() has been called could significantly speed up writes, since we won't be sitting idle for 30s before doing the write. If we have a 30s flush interval, and it takes 30s to write messages to BigQuery, flushing during put could increase performance by as much as 2x, since the two 30s intervals would overlap.

flush() should just flush any outstanding data in the buffer, and then sync on all futures (including those that had been invoked during put() methods).

Some thought should be put into how this will impact the adaptive batch sizes.

Connector fails to start due to CLASSPATH issues on current master [MACOS]

Hi,

When using the current master with the instructions from the readme, the connector does not start on MacOs. It seems to be specifically due to the way that

mkdir bin/jar/ && tar -C bin/jar/ -xf kcbq-confluent/build/distributions/kcbq-confluent-*.tar

extracts the content into bin/jar/kcbq-confluent-1.1.0-SNAPSHOT/ instead of bin/jar directly? The exception thrown is:

[2018-09-28 12:21:32,123] ERROR Stopping after connector error (org.apache.kafka.connect.cli.ConnectStandalone:100)
java.util.concurrent.ExecutionException: org.apache.kafka.connect.errors.ConnectException: Failed to find any class that implements Connector and which name matches com.wepay.kafka.connect.bigquery.BigQuerySinkConnector available connectors are: io.confluent.connect.hdfs.HdfsSinkConnector, org.apache.kafka.connect.source.SourceConnector, org.apache.kafka.connect.tools.VerifiableSourceConnector, org.apache.kafka.connect.file.FileStreamSinkConnector, org.apache.kafka.connect.file.FileStreamSourceConnector, org.apache.kafka.connect.tools.VerifiableSinkConnector, io.confluent.connect.hdfs.tools.SchemaSourceConnector, org.apache.kafka.connect.sink.SinkConnector, io.confluent.connect.jdbc.JdbcSourceConnector
	at org.apache.kafka.connect.util.ConvertingFutureCallback.result(ConvertingFutureCallback.java:80)
	at org.apache.kafka.connect.util.ConvertingFutureCallback.get(ConvertingFutureCallback.java:67)
	at org.apache.kafka.connect.cli.ConnectStandalone.main(ConnectStandalone.java:97)
Caused by: org.apache.kafka.connect.errors.ConnectException: Failed to find any class that implements Connector and which name matches com.wepay.kafka.connect.bigquery.BigQuerySinkConnector available connectors are: io.confluent.connect.hdfs.HdfsSinkConnector, org.apache.kafka.connect.source.SourceConnector, org.apache.kafka.connect.tools.VerifiableSourceConnector, org.apache.kafka.connect.file.FileStreamSinkConnector, org.apache.kafka.connect.file.FileStreamSourceConnector, org.apache.kafka.connect.tools.VerifiableSinkConnector, io.confluent.connect.hdfs.tools.SchemaSourceConnector, org.apache.kafka.connect.sink.SinkConnector, io.confluent.connect.jdbc.JdbcSourceConnector
	at org.apache.kafka.connect.runtime.Worker.getConnectorClass(Worker.java:226)
	at org.apache.kafka.connect.runtime.Worker.startConnector(Worker.java:166)
	at org.apache.kafka.connect.runtime.standalone.StandaloneHerder.startConnector(StandaloneHerder.java:250)
	at org.apache.kafka.connect.runtime.standalone.StandaloneHerder.putConnectorConfig(StandaloneHerder.java:164)
	at org.apache.kafka.connect.cli.ConnectStandalone.main(ConnectStandalone.java:94)

When I add that specific directory to my classpath manually (export CLASSPATH="$CLASSPATH:$BASE_DIR/../../bin/jar/kcbq-confluent-1.1.0-SNAPSHOT/*") , the connector comntinues, but fails with a different error:

Caused by: org.apache.kafka.common.config.ConfigException: Invalid value com.wepay.kafka.connect.bigquery.schemaregistry.schemaretriever.SchemaRegistrySchemaRetriever for configuration schemaRetriever: Class com.wepay.kafka.connect.bigquery.schemaregistry.schemaretriever.SchemaRegistrySchemaRetriever could not be found.

Is the above a known issue, or am I doing something wrong? Thanks in advance!

Upgrade Confluent platform to 3.1.1 release

Good news everyone! Confluent 3.1.1 is out!

After bumping the version, we should also include a new integration test to make sure that identical schemas with different names no longer cause issues when using the Confluent Avro Converter. However, Confluent hasn't yet released a new Docker image for the 3.1.1 release of Schema Registry (issue created here), so no integration tests (new or existing) can be run with the new version of the platform just yet. We should probably block for now on them releasing updated Docker images before updating to 3.1.1 so that we can have some kind of reliable sanity check to make sure that everything works as expected.

Create integration test for schema updates

It doesn't look like we have an integration test that verifies that autoUpdateSchema actually works.

BigQuery write error invalid partition decorator `$YYYYMDD` instead of `$YYYYMMDD`

I've been having this problem with my connector:

[2018-02-26 11:34:16,607] INFO Decreased batch size to 1 for {datasetId=myproj_staging, tableId=dbz_myproj_mycompany_public_skills$2018226} (com.wepay.kafka.connect.bigquery.write.batch.DynamicBatchWriter:277)
[2018-02-26 11:34:16,791] ERROR WorkerSinkTask{id=kcbq-myproj-mycompany-0} Offset commit failed, rewinding to last committed offsets (org.apache.kafka.connect.runtime.WorkerSinkTask:351)
com.wepay.kafka.connect.bigquery.exception.BigQueryConnectException: Exception occurred while executing write threads
	at com.wepay.kafka.connect.bigquery.BigQuerySinkTask.flush(BigQuerySinkTask.java:238)
	at org.apache.kafka.connect.sink.SinkTask.preCommit(SinkTask.java:117)
	at org.apache.kafka.connect.runtime.WorkerSinkTask.commitOffsets(WorkerSinkTask.java:345)
	at org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:182)
	at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:166)
	at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:170)
	at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:214)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
Caused by: java.util.concurrent.ExecutionException: com.wepay.kafka.connect.bigquery.exception.BigQueryConnectException: Attempted to decrease batchSize below 1
	at java.util.concurrent.FutureTask.report(FutureTask.java:122)
	at java.util.concurrent.FutureTask.get(FutureTask.java:192)
	at com.wepay.kafka.connect.bigquery.BigQuerySinkTask.flush(BigQuerySinkTask.java:233)
	... 11 more
Caused by: com.wepay.kafka.connect.bigquery.exception.BigQueryConnectException: Attempted to decrease batchSize below 1
	at com.wepay.kafka.connect.bigquery.write.batch.DynamicBatchWriter.decreaseBatchSize(DynamicBatchWriter.java:274)
	at com.wepay.kafka.connect.bigquery.write.batch.DynamicBatchWriter.seekingWriteAll(DynamicBatchWriter.java:152)
	at com.wepay.kafka.connect.bigquery.write.batch.DynamicBatchWriter.writeAll(DynamicBatchWriter.java:101)
	at com.wepay.kafka.connect.bigquery.BigQuerySinkTask$TableWriter.call(BigQuerySinkTask.java:154)
	at com.wepay.kafka.connect.bigquery.BigQuerySinkTask$TableWriter.call(BigQuerySinkTask.java:131)
	... 4 more

My configuration is

    {
      "name": "kcbq-myproj-mycompany",
      "config": {
        "connector.class": "com.wepay.kafka.connect.bigquery.BigQuerySinkConnector",
        "tasks.max": "1",
        "topics": "dbz_myproj_mycompany.public.skills",
        "autoCreateTables": "true",
        "autoUpdateSchemas": "true",
        "sanitizeTopics": "true",
        "schemaRetriever": "com.wepay.kafka.connect.bigquery.schemaregistry.schemaretriever.SchemaRegistrySchemaRetriever",
        "schemaRegistryLocation": "http://mc-schema-registry:80",
        "bufferSize": "100000",
        "maxWriteSize": "10000",
        "tableWriteWait": "1000",
        "project": "mycompany-platform",
        "datasets": ".*=myproj_staging",
        "keyfile": "/etc/gcloud/creds/mc_kafka.json"
      }
    }

I think it has something to do with the tableId it's trying to write to tableId=dbz_myproj_mycompany_public_skills$2018226. The partition decorator looks like it's missing a zero (I think it should be 20180226 instead of 2018226.)

Handle null values

Kafka log compacted topics can receive null value messages, which signify deletes for a given key. The connector should handle these messages in some way. Either ignore them or something.

Handle nested array of structs in record values

A record with the following value will fail to write to BigQuery.

{
    'name': 'foo',
    'foos': [
        {'foo_id': 1},
        {'foo_id': 2}
    ]
}

The current resulting table will look like:

name	foos.foo_id
foo	null
	null

Instead, it should look like:

name	foos.foo_id
foo	1
	2

I believe this has to do with the BigQueryRecordConverter not converting each record in the field value's array.

git-tag a release

Depends on #19 to be done as nicely as I would hope. This gives us a nice user-friendly page of releases, e.g., https://github.com/wepay/signer-php/releases.

Big Query Connect Exception ..... not able to write data to BigQuery

I am having this error when i send the update avro schema to bigQuery connector:

[2018-07-02 13:23:45,421] ERROR Task failed with com.wepay.kafka.connect.bigquery.exception.BigQueryConnectException error: Failed to write rows after BQ schema update within 5 attempts for: {datasetId=intern_practice, tableId=full_public_cons} (com.wepay.kafka.connect.bigquery.write.batch.KCBQThreadPoolExecutor:66)
Exception in thread "pool-5-thread-3" com.wepay.kafka.connect.bigquery.exception.BigQueryConnectException: Failed to write rows after BQ schema update within 5 attempts for: {datasetId=intern_practice, tableId=full_public_cons}
	at com.wepay.kafka.connect.bigquery.write.row.AdaptiveBigQueryWriter.performWriteRequest(AdaptiveBigQueryWriter.java:120)
	at com.wepay.kafka.connect.bigquery.write.row.BigQueryWriter.writeRows(BigQueryWriter.java:117)
	at com.wepay.kafka.connect.bigquery.write.batch.TableWriter.run(TableWriter.java:77)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Changing avro schema in schemaRetriever

I am accessing postgres data in kafka connect and i want to only access some part of data to send it to bigQuery. I want to know where your code takes data from kafka topic and gives the schema to sink Connector. Can someone tell me which files can i modify to change the actual schema and send updated schema to Google Big Query.

Current HEAD is broken

commit #107 broke the build due to backwards incompatible changes in google-cloud between 0.10.0-alpha and 0.25.0-alpha. We are aware of the issue and are planning on fixing forward in the next few days.

How to implement delete in Kafka Connect BigQuery

When I delete data or row in my postgres db, the bigquery connector only reads the after as null and writes null in my BigQuery Database. Is there a class that is used to delete a row.

Add space in autoUpdateSchemas log line

Log line currently says:

WARN You may want to enable auto schema updates by specifyingautoUpdateSchemas=true in the properties file

Should say:

WARN You may want to enable auto schema updates by specifying autoUpdateSchemas=true in the properties file

running in distributed mode and cp 4.0.0

Does anybody have this running successfully with a distributed connect worker?

I have a standalone worker successfully loading data into my BQ table. When I try to do the same with a distributed worker I get the following exception:

ERROR WorkerSinkTask{id=...} Commit of offsets threw an unexpected exception for sequence number 1: null (org.apache.kafka.connect.runtime.WorkerSinkTask:233)
java.lang.ClassCastException: org.apache.kafka.clients.consumer.OffsetAndMetadata cannot be cast to org.apache.kafka.clients.consumer.OffsetAndMetadata
at com.wepay.kafka.connect.bigquery.BigQuerySinkTask.updateOffsets(BigQuerySinkTask.java:112)
at com.wepay.kafka.connect.bigquery.BigQuerySinkTask.flush(BigQuerySinkTask.java:101)
at org.apache.kafka.connect.sink.SinkTask.preCommit(SinkTask.java:117)

This seems like some sort of version mismatch issue, though I have tried it with confluent platform 3.0.0 as referenced in the docs, as well as the newer 4.0.0-3.

Any pointers on getting this running in distributed mode or dealing with these versioning issues in general?

Remove hard dependency on Schema Registry

Automatic table creation and schema updates are currently impossible without Schema Registry integration. We should find a way to allow for this functionality by leveraging the Kafka Connect Schema class's version field instead of just relying on being able to fetch the latest schema for a topic from Schema Registry.

The only complication with this new approach would be storing the version of the schema that is used for the current BigQuery table. Two approaches that have been proposed would be storing the version as a field in the table itself (which would potentially take up a lot of extra space) or storing the version as part of the BigQuery table's description (which would be pretty frail, since there's no guarantee the description wouldn't already be in use for other purposes and/or changed accidentally over the lifetime of the table).

If it's possible, we'd like to find a better approach than either of those two; however, if not, we should still go ahead with one of the two, since a slight complication in functionality for people using Schema Registry is worth a huge gain in functionality for people who aren't.

Exception in thread "main" java.lang.NoClassDefFoundError: com/google/cloud/bigquery/BigQuery

Hello,

I am trying to run kafka-connect-bigquery and getting the following error:
[2017-10-27 20:07:07,792] INFO Loading plugin from: /usr/share/java/kafka-connect-bigquery/kcbq-connector-1.0.0-SNAPSHOT.jar (org.apache.kafka.connect.runtime.isolation.DelegatingClassLoader:176)
Exception in thread "main" java.lang.NoClassDefFoundError: com/google/cloud/bigquery/BigQuery
at java.lang.Class.getDeclaredConstructors0(Native Method)
at java.lang.Class.privateGetDeclaredConstructors(Class.java:2671)
at java.lang.Class.getConstructor0(Class.java:3075)
at java.lang.Class.newInstance(Class.java:412)
at org.apache.kafka.connect.runtime.isolation.DelegatingClassLoader.getPluginDesc(DelegatingClassLoader.java:242)
at org.apache.kafka.connect.runtime.isolation.DelegatingClassLoader.scanPluginPath(DelegatingClassLoader.java:223)
at org.apache.kafka.connect.runtime.isolation.DelegatingClassLoader.scanUrlsAndAddPlugins(DelegatingClassLoader.java:198)
at org.apache.kafka.connect.runtime.isolation.DelegatingClassLoader.registerPlugin(DelegatingClassLoader.java:190)
at org.apache.kafka.connect.runtime.isolation.DelegatingClassLoader.initLoaders(DelegatingClassLoader.java:150)
at org.apache.kafka.connect.runtime.isolation.Plugins.(Plugins.java:47)
at org.apache.kafka.connect.cli.ConnectDistributed.main(ConnectDistributed.java:63)
Caused by: java.lang.ClassNotFoundException: com.google.cloud.bigquery.BigQuery
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at org.apache.kafka.connect.runtime.isolation.PluginClassLoader.loadClass(PluginClassLoader.java:62)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 11 more

I am using the following DockerFile to build the image. I have also tried including the tar ball directly instead of building the code, but i get the same error.

FROM solsson/kafka-connect-jdbc@sha256:a6108f094eaef52469c8ca7d3e60b2523cadd6f5283f481c96308033b26fc92e
WORKDIR /usr/src/app

RUN ["mkdir", "app"]

RUN ["mkdir", "/usr/logs"]
RUN ["chmod", "a+rwx", "/usr/logs"]

RUN ["mkdir", "/usr/share/java/kafka-connect-bigquery"]
RUN ["chmod", "a+rwx", "/usr/share/java/kafka-connect-bigquery"]

COPY . .
RUN ["./gradlew", "clean", "confluentTarBall"]
RUN ["tar", "-C", "/usr/share/java/kafka-connect-bigquery", "-xf", "/usr/src/app/bin/tar/kcbq-connector-1.0.0-SNAPSHOT-confluent-dist.tar"]

RUN ["rm", "-r", "app"]

WORKDIR /opt/kafka

Any ideas on what may be wrong? I tried googling for error, but didn't find much.

./connector.sh giving a NoSuchMethodError

On Running ./connector.sh command get this error.

Exception in thread "main" java.lang.NoSuchMethodError: com.google.common.collect.Sets$SetView.iterator()Lcom/google/common/collect/UnmodifiableIterator;
        at org.reflections.Reflections.expandSuperTypes(Reflections.java:380)
        at org.reflections.Reflections.<init>(Reflections.java:126)
        at org.apache.kafka.connect.runtime.isolation.DelegatingClassLoader.scanPluginPath(DelegatingClassLoader.java:258)
        at org.apache.kafka.connect.runtime.isolation.DelegatingClassLoader.scanUrlsAndAddPlugins(DelegatingClassLoader.java:201)
        at org.apache.kafka.connect.runtime.isolation.DelegatingClassLoader.initLoaders(DelegatingClassLoader.java:162)
        at org.apache.kafka.connect.runtime.isolation.Plugins.<init>(Plugins.java:47)
        at org.apache.kafka.connect.cli.ConnectStandalone.main(ConnectStandalone.java:75)

Question re: HttpClient

The httpclient that gets compiled here is pretty old - httpclient-4.0.1.jar
The current is httpclient-4.5.3.jar -- this normally wouldn't be a big issue but kafkaconnect doesn't isolate classpaths until 3.3 so confluent 3.2.1/3.2.2 has trouble running the S3 connector and bigquery simultaneously.

Was considering a PR to bring the gradle build's http client up but was wondering your thoughts on this.

Add insert time option

We currently have a kafkaData RECORD that contains topic, partition, and offset of the row that's inserted. BigQuery does not include an insert time for rows (nearest is _PARTITIONTIME, which is accurate to the day, and only for date partitioned tables). We recently hit an issue where it would have been useful to sort records into BigQuery by their insert time.

I propose adding a kcbqData RECORD (NULLABLE) with a single field containing the time the row is inserted (System.currentTimeMillis()).

Support BigQuery partitioning in Kafka connector

BigQuery supports YYYYMMDD partitioning tables. This feature is documented here. This will be a required feature when we start loading tracking data from Kafka into BigQuery, as we'll want to time-partition it, and implement a retention policy.

It seems like we might want this always enabled, simply by creating all tables with date partitions in the connector. We'll probably also want a regex-based config to be able to define if a field in the incoming message should be used as the partition date (e.g. a 'produced_ts' field).

Reconcile bufferSize and maxWriteSize in KC BQ

(Migrated from internal Jira issue DI-447)

The bufferSize config is currently per-task. The maxWriteSize config is currently per-worker. If I have 10 tasks (tasks.max=10) and:

bufferSize=100000
maxWriteSize=50000

I get writes to BQ that are partitioned into chunks of 5000, but each task's buffer has 100000+ elements in it. This is really confusing. The problem is in BigQuerySinkConnector:

int maxWritePerTask = config.getInt(config.MAX_WRITE_CONFIG) / maxTasks;

My vote would be to have both per-task (get rid of the above code), and change the config to include a task prefix (taskBufferSize, and taskMaxWriteSize).

Regex table names

I'd like the ability to rename tables based on a regex in KCBQ.

We have topics in this format:

db.<connector>.<cluster>.<database>.<table>

That are getting fed data from Kafka connect sources. We're loading this data into BigQuery. When doing so, the dataset is:

db.<connector>.<cluster>.<database>

But the tables loaded inside the dataset are still:

db.<connector>.<cluster>.<database>.<table>

I'd prefer if they were just <table>. Being able to define a regex in config that maps topics in a dataset to table names would be helpful.

Version should not be set in allprojects closure

Right now, the version number for the project is set via the allprojects closure in the build.gradle file. This should change, preferably to being set in the gradle.properties file; however, doing so generates a Javadocs error when I try to build. Someone with Gradle experience should probably take a look and see if they can figure out what's going on and/or a more idiomatic approach to storing version information in the project.

de-duplication support

BigQuery supports de-duplication. Does kafka-connect-bigquery support this feature?

https://cloud.google.com/bigquery/streaming-data-into-bigquery#dataconsistency

To help ensure data consistency, you can supply insertId for each inserted row.
BigQuery remembers this ID for at least one minute. If you try to stream the same
set of rows within that time period and the insertId property is set, BigQuery uses
the insertId property to de-duplicate your data on a best effort basis.

You might have to retry an insert because there's no way to determine the state
of a streaming insert under certain error conditions, such as network errors between
your system and BigQuery or internal errors within BigQuery. If you retry an insert,
use the same insertId for the same set of rows so that BigQuery can attempt to
de-duplicate your data. For more information, see troubleshooting streaming inserts.

In the rare instance of a Google datacenter losing connectivity unexpectedly, 
automatic deduplication may not be possible.

https://cloud.google.com/bigquery/streaming-data-into-bigquery#streaminginsertexamples

He is sample code for de-duplication feature using BigQuery Client library. Just specify insertId on addRow method. I can not find in on wiki nor code

InsertAllResponse response =
    bigquery.insertAll(
        InsertAllRequest.newBuilder(tableId)
            .addRow("rowId", rowContent)
            // More rows can be added in the same RPC by invoking .addRow() on the builder
            .build());

Class com.wepay.kafka.connect.bigquery.schemaregistry.schemaretriever.SchemaRegistrySchemaRetriever could not be found.

Hi there,

We're experiencing an issue with the distTar gradle command. It seems that running the ./gradlew clean distTar does create the tar with all required jars, but is missing the kcbq-confluent jar (which should include the SchemaRegistrySchemaRetriever? When I run the connector the following exception is thrown:


ERROR Error while starting connector bigquery-connector (org.apache.kafka.connect.runtime.WorkerConnector:109)
com.wepay.kafka.connect.bigquery.exception.SinkConfigConnectException: Couldn't start BigQuerySinkConnector due to configuration error
	at com.wepay.kafka.connect.bigquery.BigQuerySinkConnector.start(BigQuerySinkConnector.java:152)
	at org.apache.kafka.connect.runtime.WorkerConnector.doStart(WorkerConnector.java:101)
	at org.apache.kafka.connect.runtime.WorkerConnector.start(WorkerConnector.java:126)
	at org.apache.kafka.connect.runtime.WorkerConnector.transitionTo(WorkerConnector.java:183)
	at org.apache.kafka.connect.runtime.Worker.startConnector(Worker.java:178)
	at org.apache.kafka.connect.runtime.standalone.StandaloneHerder.startConnector(StandaloneHerder.java:250)
	at org.apache.kafka.connect.runtime.standalone.StandaloneHerder.putConnectorConfig(StandaloneHerder.java:164)
	at org.apache.kafka.connect.cli.ConnectStandalone.main(ConnectStandalone.java:94)
Caused by: org.apache.kafka.common.config.ConfigException: Invalid value com.wepay.kafka.connect.bigquery.schemaregistry.schemaretriever.SchemaRegistrySchemaRetriever for configuration schemaRetriever: Class com.wepay.kafka.connect.bigquery.schemaregistry.schemaretriever.SchemaRegistrySchemaRetriever could not be found.
	at org.apache.kafka.common.config.ConfigDef.parseType(ConfigDef.java:724)
	at org.apache.kafka.common.config.ConfigDef.parseValue(ConfigDef.java:469)
	at org.apache.kafka.common.config.ConfigDef.parse(ConfigDef.java:462)
	at org.apache.kafka.common.config.AbstractConfig.<init>(AbstractConfig.java:62)
	at org.apache.kafka.common.config.AbstractConfig.<init>(AbstractConfig.java:75)
	at com.wepay.kafka.connect.bigquery.config.BigQuerySinkConfig.<init>(BigQuerySinkConfig.java:517)
	at com.wepay.kafka.connect.bigquery.config.BigQuerySinkConnectorConfig.<init>(BigQuerySinkConnectorConfig.java:79)
	at com.wepay.kafka.connect.bigquery.BigQuerySinkConnector.start(BigQuerySinkConnector.java:150)
	... 7 more

Any ideas?

Thanks!
-patrick

Add badges for CI and whatnot

We should add badges for license, maven central, travis ci, code coverage, etc. Similar to DBZ's badges in their readme:

https://github.com/debezium/debezium

quota exceeded retries may not be working as expected

I recently saw this error while testing something locally:

[2016-11-10 13:29:42,733] ERROR Commit of WorkerSinkTask{id=bootstrap-test-03-3} offsets threw an unexpected exception:  (org.apache.kafka.connect.runtime.WorkerSinkTask:180)
com.wepay.kafka.connect.bigquery.exception.BigQueryConnectException: Exception occurred while executing write threads
	at com.wepay.kafka.connect.bigquery.BigQuerySinkTask.flush(BigQuerySinkTask.java:234)
	at org.apache.kafka.connect.runtime.WorkerSinkTask.commitOffsets(WorkerSinkTask.java:275)
	at org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:155)
	at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:142)
	at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:140)
	at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:175)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
Caused by: java.util.concurrent.ExecutionException: com.wepay.kafka.connect.bigquery.exception.BigQueryConnectException: Failed to write to BigQuery table {datasetId=u_moirat, tableId=pocdb_debezium_apilogger_log_api_log$20161110}
	at java.util.concurrent.FutureTask.report(FutureTask.java:122)
	at java.util.concurrent.FutureTask.get(FutureTask.java:192)
	at com.wepay.kafka.connect.bigquery.BigQuerySinkTask.flush(BigQuerySinkTask.java:229)
	... 10 more
Caused by: com.wepay.kafka.connect.bigquery.exception.BigQueryConnectException: Failed to write to BigQuery table {datasetId=u_moirat, tableId=pocdb_debezium_apilogger_log_api_log$20161110}
	at com.wepay.kafka.connect.bigquery.write.batch.DynamicBatchWriter.seekingWriteAll(DynamicBatchWriter.java:165)
	at com.wepay.kafka.connect.bigquery.write.batch.DynamicBatchWriter.writeAll(DynamicBatchWriter.java:101)
	at com.wepay.kafka.connect.bigquery.BigQuerySinkTask$TableWriter.call(BigQuerySinkTask.java:155)
	at com.wepay.kafka.connect.bigquery.BigQuerySinkTask$TableWriter.call(BigQuerySinkTask.java:132)
	... 4 more
Caused by: com.google.cloud.bigquery.BigQueryException: Exceeded rate limits: Your table exceeded quota for table.insert or table.update per table. For more information, see https://cloud.google.com/bigquery/troubleshooting-errors
	at com.google.cloud.bigquery.spi.DefaultBigQueryRpc.translate(DefaultBigQueryRpc.java:93)
	at com.google.cloud.bigquery.spi.DefaultBigQueryRpc.patch(DefaultBigQueryRpc.java:218)
	at com.google.cloud.bigquery.BigQueryImpl$10.call(BigQueryImpl.java:329)
	at com.google.cloud.bigquery.BigQueryImpl$10.call(BigQueryImpl.java:326)
	at com.google.cloud.RetryHelper.doRetry(RetryHelper.java:179)
	at com.google.cloud.RetryHelper.runWithRetries(RetryHelper.java:244)
	at com.google.cloud.bigquery.BigQueryImpl.update(BigQueryImpl.java:325)
	at com.wepay.kafka.connect.bigquery.SchemaManager.updateSchema(SchemaManager.java:57)
	at com.wepay.kafka.connect.bigquery.write.row.AdaptiveBigQueryWriter.performWriteRequest(AdaptiveBigQueryWriter.java:86)
	at com.wepay.kafka.connect.bigquery.write.row.BigQueryWriter.writeRows(BigQueryWriter.java:158)
	at com.wepay.kafka.connect.bigquery.write.batch.DynamicBatchWriter.seekingWriteAll(DynamicBatchWriter.java:131)
	... 7 more
Caused by: com.google.api.client.googleapis.json.GoogleJsonResponseException: 403 Forbidden
{
  "code" : 403,
  "errors" : [ {
    "domain" : "global",
    "location" : "table.write",
    "locationType" : "other",
    "message" : "Exceeded rate limits: Your table exceeded quota for table.insert or table.update per table. For more information, see https://cloud.google.com/bigquery/troubleshooting-errors",
    "reason" : "rateLimitExceeded"
  } ],
  "message" : "Exceeded rate limits: Your table exceeded quota for table.insert or table.update per table. For more information, see https://cloud.google.com/bigquery/troubleshooting-errors"
}
	at com.google.api.client.googleapis.json.GoogleJsonResponseException.from(GoogleJsonResponseException.java:146)
	at com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:113)
	at com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:40)
	at com.google.api.client.googleapis.services.AbstractGoogleClientRequest$1.interceptResponse(AbstractGoogleClientRequest.java:321)
	at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:1065)
	at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:419)
	at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:352)
	at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:469)
	at com.google.cloud.bigquery.spi.DefaultBigQueryRpc.patch(DefaultBigQueryRpc.java:213)
	... 16 more

This stack trace seems to suggest that retries for 403 errors are not working as I would expect.

Retry on 502 errors

KCBQ should retry with exponential backoff when it encounters a 502

[2016-11-02 21:55:01,816] ERROR Commit of WorkerSinkTask{id=bq-0} offsets threw an unexpected exception:  (org.apache.kafka.connect.runtime.WorkerSinkTask)
com.wepay.kafka.connect.bigquery.exception.BigQueryConnectException: Exception occurred while executing write threads
#011at com.wepay.kafka.connect.bigquery.BigQuerySinkTask.flush(BigQuerySinkTask.java:238)
#011at org.apache.kafka.connect.runtime.WorkerSinkTask.commitOffsets(WorkerSinkTask.java:275)
#011at org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:155)
#011at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:142)
#011at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:140)
#011at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:175)
#011at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
#011at java.util.concurrent.FutureTask.run(FutureTask.java:266)
#011at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
#011at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
#011at java.lang.Thread.run(Thread.java:745)
Caused by: java.util.concurrent.ExecutionException: com.wepay.kafka.connect.bigquery.exception.BigQueryConnectException: Failed to write to BigQuery table {datasetId=foo, tableId=bar$20161102}
#011at java.util.concurrent.FutureTask.report(FutureTask.java:122)
#011at java.util.concurrent.FutureTask.get(FutureTask.java:192)
#011at com.wepay.kafka.connect.bigquery.BigQuerySinkTask.flush(BigQuerySinkTask.java:233)
#011at java.util.concurrent.FutureTask.report(FutureTask.java:122)
#011at java.util.concurrent.FutureTask.get(FutureTask.java:192)
#011at com.wepay.kafka.connect.bigquery.BigQuerySinkTask.flush(BigQuerySinkTask.java:233)
#011... 10 more
Caused by: com.wepay.kafka.connect.bigquery.exception.BigQueryConnectException: Failed to write to BigQuery table {datasetId=foo, tableId=bar$20161102}
#011at com.wepay.kafka.connect.bigquery.write.batch.DynamicBatchWriter.establishedWriteAll(DynamicBatchWriter.java:227)
#011at com.wepay.kafka.connect.bigquery.write.batch.DynamicBatchWriter.writeAll(DynamicBatchWriter.java:103)
#011at com.wepay.kafka.connect.bigquery.BigQuerySinkTask$TableWriter.call(BigQuerySinkTask.java:151)
#011... 10 more
Caused by: com.wepay.kafka.connect.bigquery.exception.BigQueryConnectException: Failed to write to BigQuery table {datasetId=foo, tableId=bar$20161102}
#011at com.wepay.kafka.connect.bigquery.write.batch.DynamicBatchWriter.establishedWriteAll(DynamicBatchWriter.java:227)
#011at com.wepay.kafka.connect.bigquery.write.batch.DynamicBatchWriter.writeAll(DynamicBatchWriter.java:103)
#011at com.wepay.kafka.connect.bigquery.BigQuerySinkTask$TableWriter.call(BigQuerySinkTask.java:151)
#011at com.wepay.kafka.connect.bigquery.BigQuerySinkTask$TableWriter.call(BigQuerySinkTask.java:128)
#011... 4 more
Caused by: com.google.cloud.bigquery.BigQueryException: 502 Bad Gateway
<!DOCTYPE html>
<html lang=en>
 <meta charset=utf-8>
#011at com.wepay.kafka.connect.bigquery.BigQuerySinkTask$TableWriter.call(BigQuerySinkTask.java:128)
#011... 4 more
Caused by: com.google.cloud.bigquery.BigQueryException: 502 Bad Gateway
<!DOCTYPE html>
<html lang=en>
 <meta charset=utf-8>
 <meta name=viewport content="initial-scale=1, minimum-scale=1, width=device-width">
 <title>Error 502 (Server Error)!!1</title>
 <style>
 <meta name=viewport content="initial-scale=1, minimum-scale=1, width=device-width">
 <title>Error 502 (Server Error)!!1</title>
   *{margin:0;padding:0}html,code{font:15px/22px arial,sans-serif}html{background:#fff;color:#222;padding:15px}body{margin:7% auto 0;max-width:390px;min-height:180px;padding:30px 0 15px}* > body{background:url(//www.google.com/images/errors/robot.png) 100% 5px no-repeat;padding-right:205px}p{margin:11px 0 22px;overflow:hidden}ins{color:#777;text-decoration:none}a img{border:0}@media screen and (max-width:772px){body{background:none;margin-top:0;max-width:none;padding-right:0}}#logo{background:url(//www.google.com/images/branding/googlelogo/1x/googlelogo_color_150x54dp.png) no-repeat;margin-left:-5px}@media only screen and (min-resolution:192dpi){#logo{background:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) no-repeat 0% 0%/100% 100%;-moz-border-image:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) 0}}@media only screen and (-webkit-min-device-pixel-ratio:2){#logo{background:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) no-repeat;-webkit-background-size:100% 100%}}#logo{display:inline-block;height:54px;width:150px}
 </style>
 <a href=//www.google.com/><span id=logo aria-label=Google></span></a>
 <style>
   *{margin:0;padding:0}html,code{font:15px/22px arial,sans-serif}html{background:#fff;color:#222;padding:15px}body{margin:7% auto 0;max-width:390px;min-height:180px;padding:30px 0 15px}* > body{background:url(//www.google.com/images/errors/robot.png) 100% 5px no-repeat;padding-right:205px}p{margin:11px 0 22px;overflow:hidden}ins{color:#777;text-decoration:none}a img{border:0}@media screen and (max-width:772px){body{background:none;margin-top:0;max-width:none;padding-right:0}}#logo{background:url(//www.google.com/images/branding/googlelogo/1x/googlelogo_color_150x54dp.png) no-repeat;margin-left:-5px}@media only screen and (min-resolution:192dpi){#logo{background:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) no-repeat 0% 0%/100% 100%;-moz-border-image:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) 0}}@media only screen and (-webkit-min-device-pixel-ratio:2){#logo{background:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) no-repeat;-webkit-background-size:100% 100%}}#logo{display:inline-block;height:54px;width:150px}
 </style>
 <a href=//www.google.com/><span id=logo aria-label=Google></span></a>
 <p><b>502.</b> <ins>That’s an error.</ins>
 <p>The server encountered a temporary error and could not complete your request.<p>Please try again in 30 seconds.  <ins>That’s all we know.</ins>
#011at com.google.cloud.bigquery.spi.DefaultBigQueryRpc.translate(DefaultBigQueryRpc.java:93)
#011at com.google.cloud.bigquery.spi.DefaultBigQueryRpc.insertAll(DefaultBigQueryRpc.java:287)
#011at com.google.cloud.bigquery.BigQueryImpl.insertAll(BigQueryImpl.java:410)
#011at com.wepay.kafka.connect.bigquery.write.row.SimpleBigQueryWriter.performWriteRequest(SimpleBigQueryWriter.java:66)
#011at com.wepay.kafka.connect.bigquery.write.row.BigQueryWriter.writeRows(BigQueryWriter.java:142)
#011at com.wepay.kafka.connect.bigquery.write.batch.DynamicBatchWriter.establishedWriteAll(DynamicBatchWriter.java:216)
#011... 7 more
Caused by: com.google.api.client.googleapis.json.GoogleJsonResponseException: 502 Bad Gateway
<!DOCTYPE html>
<html lang=en>
 <meta charset=utf-8>
 <meta name=viewport content="initial-scale=1, minimum-scale=1, width=device-width">
 <title>Error 502 (Server Error)!!1</title>
 <style>
   *{margin:0;padding:0}html,code{font:15px/22px arial,sans-serif}html{background:#fff;color:#222;padding:15px}body{margin:7% auto 0;max-width:390px;min-height:180px;padding:30px 0 15px}* > body{background:url(//www.google.com/images/errors/robot.png) 100% 5px no-repeat;padding-right:205px}p{margin:11px 0 22px;overflow:hidden}ins{color:#777;text-decoration:none}a img{border:0}@media screen and (max-width:772px){body{background:none;margin-top:0;max-width:none;padding-right:0}}#logo{background:url(//www.google.com/images/branding/googlelogo/1x/googlelogo_color_150x54dp.png) no-repeat;margin-left:-5px}@media only screen and (min-resolution:192dpi){#logo{background:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) no-repeat 0% 0%/100% 100%;-moz-border-image:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) 0}}@media only screen and (-webkit-min-device-pixel-ratio:2){#logo{background:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) no-repeat;-webkit-background-size:100% 100%}}#logo{display:inline-block;height:54px;width:150px}
 </style>
 <a href=//www.google.com/><span id=logo aria-label=Google></span></a>
 <p><b>502.</b> <ins>That’s an error.</ins>
 <p>The server encountered a temporary error and could not complete your request.<p>Please try again in 30 seconds.  <ins>That’s all we know.</ins>
#011at com.google.api.client.googleapis.json.GoogleJsonResponseException.from(GoogleJsonResponseException.java:146)
#011at com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:113)
#011at com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:40)
#011at com.google.api.client.googleapis.services.AbstractGoogleClientRequest$1.interceptResponse(AbstractGoogleClientRequest.java:321)
#011at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:1065)
#011at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:419)
#011at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:352)
#011at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:469)
#011at com.google.cloud.bigquery.spi.DefaultBigQueryRpc.insertAll(DefaultBigQueryRpc.java:283)
#011... 11 more

Deserialization exception: Unknown magic byte!

Hi, I'm using version 1.1.0 of this connector and I'm trying to export some data from a Kafka topic to Google Big Query. I verified that I have a correct schema in my schema registry and that I can consume using the kafka-avro-console-consumer (from Confluent platform).

When starting the connector, however, I'm getting the following exceptions in the log:
Caused by: org.apache.kafka.common.errors.SerializationException: Error deserializing Avro message for id -1 Caused by: org.apache.kafka.common.errors.SerializationException: Unknown magic byte!

The table is created in BigQuery and the fields look correct, but no rows are being written as a result of the exception.

Any pointers?

Add a config to set all columns to nullable

KCBQ currently translates the Kafka connect required/nullable settings for fields into BigQuery fields as-is. That is, if a field is required in Kafka connect's schema, it will be REQUIRED in BigQuery. The same is true for NULLABLE fields.

It would be useful to be able to tell KCBQ to ignore these settings and always set BigQuery fields to have NULLABLE for all fields. This will give us a bit more flexibility in the BigQuery pipeline, since it's difficult to migrate tables from REQUIRED to NULLABLE in BigQuery, and upstream systems occasionally want to do this. I acknowledge that when upstream systems do this, it's an incompatible operation for downstream consumers, but in cases where KCBQ is the only consumer, making it a bit more resilient to the change (vs. having to re-bootstrap the data to change the schema).

I suggest exposing this option via a config. To make things compatible for existing connectors that are already running, it would be nice if the config worked such that it gracefully only set NULLABLE for all fields that it doesn't already know about. That is, if a table already exists in BQ with required fields, it leaves those as they are, but any new fields that are added are always nullable.

Don't hard-code version in test/integrationtest.sh

This line:

cp "$BASE_DIR/../build/distributions/kafka-connect-bigquery-0.2.1-dist.tar" "$DOCKER_DIR/connect/kcbq.tar"

Means that we have to update it every time we bump versions.

Add backoff and retry when KC BQ hits quota limit

(Migrated from internal Jira issue DI-448)

We currently have a hard-coded sleep of 1000ms (BigQuerySinkTask.TABLE_WRITE_INTERVAL) for writes to each table. This is a performance penalty that's pretty expensive when running a bootstrap of a lot of data (we're behind in the log). During load testing, I was seeing 30000 rows (212 bytes each) take > 3 seconds to write to BigQuery using the streaming API. The single-writer performance in the BQ stream API seems to be in the 1-2 megs/sec range.

Eliminating this config will expose us to quota issues with BigQuery. They only allow 100,000 rows/sec/table. It's going to be really difficult for us to tune this properly using configuration in a distributed environment, since we'll have some number of writers distributed across multiple machines. I think the right approach is actually just to add back off and retry logic when we receive a quota_exceeded error. This will cause the tasks to automatically handle quota exceeded errors, and will allow us to go full throttle when bootstrapping data.

Use docs from Kafka connect structs

We should use the doc from Kafka connect's structs to set the documentation string for tables and fields in BigQuery.

Make a statement/commitment to follow Semantic Versioning.

Maybe this is what you're already doing, and maybe semver is implied, but making a statement that this project will adopt and commit to using semantic versioning would be a positive move that I expect would be greatly appreciated by the wider community of users.

More descriptive batch logging

The current batch logging that we get is:

INFO Increased batch size to 16000 (com.wepay.kafka.connect.bigquery.write.batch.DynamicBatchWriter:268)

This is a bit too vague. It'd be nice to know the table.

Using the connector with Json-based topic

Hi guys,

I've successfully used the connector to stream data from an avro-based topic to BigQuery.
Does the connector support streaming Json-based topics as well?

Thanks.

Add a Google Cloud Storage BQ writer

KCBQ streams all writes into BigQuery. This is actually less efficient when trying to bulk load data into BigQuery, compared to the GCS->BQ loading style. We should investigate adding the ability to bulk load data by writing it into a GCS bucket, and then loading it into BigQuery

This will be a bit complicated, as GCS is eventually consistent, and files that are uploaded might not (and often do not) appear for > 5 minutes after they've been uploaded. We can't block the flush() method for that long, so there might have to be two phases: 1) upload to GCS, 2) load from GCS into BQ. (2) can happen asynchronously, but must be tolerant to failures (i.e. if KCBQ is restarted, it should still load the remaining GCS files that were not yet bulk loaded into BQ).

Another thing to consider is dynamically switching back and forth depending on the KCBQ throughput (i.e. if you get more than 1 million rows in N seconds, switch to GCS/bulk load mode).

This is a long-term thing, but something to think about.

Add a CHANGELOG.md

I would recommend maintaining a CHANGELOG for changes between versions. We already do this for some of our other packages. (e.g., CHANGELOG.md)

This format applies the methodology from http://keepachangelog.com, and uses Chag to commit Git tags with the changelog as an extended description — which is handy.

It's also easy to wrap-up this process in a Makefile (or whatever tool you choose) when scripting build steps. (e.g., Makefile)

kafka connectors cannot add their own metrics

Presently, kcbq is unable to add metrics to the existing kafka metrics.

There is an (extremely sparse) kafka jira on this topic: https://issues.apache.org/jira/browse/KAFKA-2376

To add metrics, kcbq needs to have access to the existing kafka metrics and add them to there, but at the moment kafka does not provide public access to the metrics.

Duplicated records

I am getting duplicated records in big query compared to the offsets in kafka for the same topic. Is this an an expected results? Thanks!

wepay / kafka-connect-bigquery Goto Github PK

kafka-connect-bigquery's Introduction

REPOSITORY IS DEPRECATED. PLEASE USE https://github.com/confluentinc/kafka-connect-bigquery.

Kafka Connect BigQuery Connector

Download

Standalone Quickstart

Configuration Basics

Building and Extracting a Tarball

Setting-Up Background Processes

Initializing the Avro Console Producer

Running the Connector

Piping Data Through the Connector

Integration Testing the Connector

How Integration Testing Works

Data Corruption Concerns

Running the Integration Tests

Adding New Integration Tests

kafka-connect-bigquery's People

Contributors

Stargazers

Watchers

Forkers

kafka-connect-bigquery's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs