GithubHelp home page GithubHelp logo

googlecloudplatform / dataflowtemplates Goto Github PK

View Code? Open in Web Editor NEW
1.1K 87.0 903.0 15.29 MB

Cloud Dataflow Google-provided templates for solving in-Cloud data tasks

Home Page: https://cloud.google.com/dataflow/docs/guides/templates/provided-templates

License: Apache License 2.0

Java 88.72% JavaScript 0.32% Dockerfile 0.03% Python 0.21% PureBasic 0.01% Go 0.77% FreeMarker 0.01% HCL 9.94%
apache-beam dataflow-templates google-cloud-dataflow google-cloud-storage google-cloud-spanner bigquery bigtable

dataflowtemplates's Introduction

Google Cloud Dataflow Template Pipelines

These Dataflow templates are an effort to solve simple, but large, in-Cloud data tasks, including data import/export/backup/restore and bulk API operations, without a development environment. The technology under the hood which makes these operations possible is the Google Cloud Dataflow service combined with a set of Apache Beam SDK templated pipelines.

Google is providing this collection of pre-implemented Dataflow templates as a reference and to provide easy customization for developers wanting to extend their functionality.

Open in Cloud Shell

Note on Default Branch

As of November 18, 2021, our default branch is now named "main". This does not affect forks. If you would like your fork and its local clone to reflect these changes you can follow GitHub's branch renaming guide.

Building

Maven commands should be run on the parent POM. An example would be:

mvn clean package -pl v2/pubsub-binary-to-bigquery -am

Template Pipelines

For documentation on each template's usage and parameters, please see the official docs.

Getting Started

Requirements

  • Java 11
  • Maven 3

Building the Project

Build the entire project using the maven compile command.

mvn clean compile

Building/Testing from IntelliJ

IntelliJ, by default, will often skip necessary Maven goals, leading to build failures. You can fix these in the Maven view by going to Module_Name > Plugins > Plugin_Name where Module_Name and Plugin_Name are the names of the respective module and plugin with the rule. From there, right-click the rule and select "Execute Before Build".

The list of known rules that require this are:

  • common > Plugins > protobuf > protobuf:compile
  • common > Plugins > protobuf > protobuf:test-compile

Formatting Code

From either the root directory or v2/ directory, run:

mvn spotless:apply

This will format the code and add a license header. To verify that the code is formatted correctly, run:

mvn spotless:check

Executing a Template File

Once the template is staged on Google Cloud Storage, it can then be executed using the gcloud CLI tool. Please check Running classic templates or Using Flex Templates for more information.

Developing/Contributing Templates

Templates Plugin

Templates plugin was created to make the workflow of creating, testing and releasing Templates easier.

Before using the plugin, please make sure that the gcloud CLI is installed and up-to-date, and that the client is properly authenticated using:

gcloud init
gcloud auth application-default login

After authenticated, install the plugin into your local repository:

mvn clean install -pl plugins/templates-maven-plugin -am

Staging (Deploying) Templates

To stage a Template, it is necessary to upload the images to Artifact Registry (for Flex templates) and copy the template to Cloud Storage.

Although there are different steps that depend on the kind of template being developed. The plugin allows a template to be staged using the following single command:

mvn clean package -PtemplatesStage  \
  -DskipTests \
  -DprojectId="{projectId}" \
  -DbucketName="{bucketName}" \
  -DstagePrefix="images/$(date +%Y_%m_%d)_01" \
  -DtemplateName="Cloud_PubSub_to_GCS_Text_Flex" \
  -pl v2/googlecloud-to-googlecloud -am

Notes:

  • Change -pl v2/googlecloud-to-googlecloud and -DtemplateName to point to the specific Maven module where your template is located. Even though -pl is not required, it allows the command to run considerably faster.
  • In case -DtemplateName is not specified, all templates for the module will be staged.

Generating Template's Terraform module

This repository can generate a terraform module that prompts users for template specific parameters and launch a Dataflow Job. To generate a template specific terraform module, see the instructions for classic and flex templates below.

Plugin artifact dependencies

The required plugin artifact dependencies are listed below:

These are outputs from the cicd/cmd/run-terraform-schema. See cicd/cmd/run-terraform-schema/README.md for further details.

For Classic templates:

mvn clean prepare-package \
  -DskipTests \
  -PtemplatesTerraform \
  -pl v1 -am

Next, terraform fmt the modules after generating:

terraform fmt -recursive v1

The resulting terraform modules are generated in v1/terraform.

For Flex templates:

mvn clean prepare-package \
  -DskipTests \
  -PtemplatesTerraform \
  -pl v2/googlecloud-togooglecloud -am

Next, terraform fmt the modules after generating:

terraform fmt -recursive v2

The resulting terraform modules are generated in v2/<source>-to-<sink>/terraform, for example v2/bigquery-to-bigtable/terraform.

Notes:

  • Change -pl v2/googlecloud-to-googlecloud and -DtemplateName to point to the specific Maven module where your template is located.

Running a Template

A template can also be executed on Dataflow, directly from the command line. The command-line is similar to staging a template, but it is required to specify -Dparameters with the parameters that will be used when launching the template. For example:

mvn clean package -PtemplatesRun \
  -DskipTests \
  -DprojectId="{projectId}" \
  -DbucketName="{bucketName}" \
  -Dregion="us-central1" \
  -DtemplateName="Cloud_PubSub_to_GCS_Text_Flex" \
  -Dparameters="inputTopic=projects/{projectId}/topics/{topicName},windowDuration=15s,outputDirectory=gs://{outputDirectory}/out,outputFilenamePrefix=output-,outputFilenameSuffix=.txt" \
  -pl v2/googlecloud-to-googlecloud -am

Notes:

  • When running a template, -DtemplateName is mandatory, as -Dparameters= are different across templates.
  • -PtemplatesRun is self-contained, i.e., it is not required to run ** Deploying/Staging Templates** before. In case you want to run a previously staged template, the existing path can be provided as -DspecPath=gs://.../path
  • -DjobName="{name}" may be informed if a specific name is desirable ( optional).
  • If you encounter the error Template run failed: File too large, try adding -DskipShade to the mvn args.

Running Integration Tests

To run integration tests, the developer plugin can be also used to stage template on-demand (in case the parameter -DspecPath= is not specified).

For example, to run all the integration tests in a specific module (in the example below, v2/googlecloud-to-googlecloud):

mvn clean verify \
  -PtemplatesIntegrationTests \
  -Dproject="{project}" \
  -DartifactBucket="{bucketName}" \
  -Dregion=us-central1 \
  -pl v2/googlecloud-to-googlecloud -am

The parameter -Dtest= can be given to test a single class (e.g., -Dtest=PubsubToTextIT) or single test case (e.g., -Dtest=PubsubToTextIT#testTopicToGcs).

The same happens when the test is executed from an IDE, just make sure to add the parameters -Dproject=, -DartifactBucket= and -Dregion= as program or VM arguments.

Metadata Annotations

A template requires more information than just a name and description. For example, in order to be used from the Dataflow UI, parameters need a longer help text to guide users, as well as proper types and validations to make sure parameters are being passed correctly.

We introduced annotations to have the source code as a single source of truth, along with a set of utilities / plugins to generate template-accompanying artifacts (such as command specs, parameter specs).

@Template Annotation

Every template must be annotated with @Template. Existing templates can be used for reference, but the structure is as follows:

@Template(
    name = "BigQuery_to_Elasticsearch",
    category = TemplateCategory.BATCH,
    displayName = "BigQuery to Elasticsearch",
    description = "A pipeline which sends BigQuery records into an Elasticsearch instance as JSON documents.",
    optionsClass = BigQueryToElasticsearchOptions.class,
    flexContainerName = "bigquery-to-elasticsearch")
public class BigQueryToElasticsearch {

@TemplateParameter Annotation

A set of @TemplateParameter.{Type} annotations were created to allow the definition of options for a template, and the proper rendering in the UI, and validations by the template launch service. Examples can be found in the repository, but the general structure is as follows:

@TemplateParameter.Text(
    order = 2,
    optional = false,
    regexes = {"[,a-zA-Z0-9._-]+"},
    description = "Kafka topic(s) to read the input from",
    helpText = "Kafka topic(s) to read the input from.",
    example = "topic1,topic2")
@Validation.Required
String getInputTopics();
@TemplateParameter.GcsReadFile(
    order = 1,
    description = "Cloud Storage Input File(s)",
    helpText = "Path of the file pattern glob to read from.",
    example = "gs://your-bucket/path/*.csv")
String getInputFilePattern();
@TemplateParameter.Boolean(
    order = 11,
    optional = true,
    description = "Whether to use column alias to map the rows.",
    helpText = "If enabled (set to true) the pipeline will consider column alias (\"AS\") instead of the column name to map the rows to BigQuery.")
@Default.Boolean(false)
Boolean getUseColumnAlias();
@TemplateParameter.Enum(
    order = 21,
    enumOptions = {"INDEX", "CREATE"},
    optional = true,
    description = "Build insert method",
    helpText = "Whether to use INDEX (index, allows upsert) or CREATE (create, errors on duplicate _id) with Elasticsearch bulk requests.")
@Default.Enum("CREATE")
BulkInsertMethodOptions getBulkInsertMethod();

Note: order is relevant for templates that can be used from the UI, and specify the relative order of parameters.

@TemplateIntegrationTest Annotation

This annotation should be used by classes that are used for integration tests of other templates. This is used to wire a specific IT class with a template, and allows environment preparation / proper template staging before tests are executed on Dataflow.

Template tests have to follow this general format (please note the @TemplateIntegrationTest annotation and the TemplateTestBase super-class):

@TemplateIntegrationTest(PubsubToText.class)
@RunWith(JUnit4.class)
public final class PubsubToTextIT extends TemplateTestBase {

Please refer to Templates Plugin to use and validate such annotations.

Using UDFs

User-defined functions (UDFs) allow you to customize a template's functionality by providing a short JavaScript function without having to maintain the entire codebase. This is useful in situations which you'd like to rename fields, filter values, or even transform data formats before output to the destination. All UDFs are executed by providing the payload of the element as a string to the JavaScript function. You can then use JavaScript's in-built JSON parser or other system functions to transform the data prior to the pipeline's output. The return statement of a UDF specifies the payload to pass forward in the pipeline. This should always return a string value. If no value is returned or the function returns undefined, the incoming record will be filtered from the output.

UDF Function Specification

Template UDF Input Type Input Description UDF Output Type Output Description
Datastore Bulk Delete String A JSON string of the entity String A JSON string of the entity to delete; filter entities by returning undefined
Datastore to Pub/Sub String A JSON string of the entity String The payload to publish to Pub/Sub
Datastore to GCS Text String A JSON string of the entity String A single-line within the output file
GCS Text to BigQuery String A single-line within the input file String A JSON string which matches the destination table's schema
Pub/Sub to BigQuery String A string representation of the incoming payload String A JSON string which matches the destination table's schema
Pub/Sub to Datastore String A string representation of the incoming payload String A JSON string of the entity to write to Datastore
Pub/Sub to Splunk String A string representation of the incoming payload String The event data to be sent to Splunk HEC events endpoint. Must be a string or a stringified JSON object

UDF Examples

For a comprehensive list of samples, please check our udf-samples folder.

Adding fields

/**
 * A transform which adds a field to the incoming data.
 * @param {string} inJson
 * @return {string} outJson
 */
function transform(inJson) {
  var obj = JSON.parse(inJson);
  obj.dataFeed = "Real-time Transactions";
  obj.dataSource = "POS";
  return JSON.stringify(obj);
}

Filtering records

/**
 * A transform function which only accepts 42 as the answer to life.
 * @param {string} inJson
 * @return {string} outJson
 */
function transform(inJson) {
  var obj = JSON.parse(inJson);
  // only output objects which have an answer to life of 42.
  if (obj.hasOwnProperty('answerToLife') && obj.answerToLife === 42) {
    return JSON.stringify(obj);
  }
}

Generated Documentation

This repository contains generated documentation, which contains a list of parameters and instructions on how to customize and/or build every template.

To generate the documentation for all templates, the following command can be used:

mvn clean prepare-package \
  -DskipTests \
  -PtemplatesSpec

Release Process

Templates are released in a weekly basis (best-effort) as part of the efforts to keep Google-provided Templates updated with latest fixes and improvements.

In case desired, you can stage and use your own changes using the Staging (Deploying) Templates steps.

To execute the release of multiple templates, we provide a single Maven command to release Templates, which is a shortcut to stage all templates while running additional validations.

mvn clean verify -PtemplatesRelease \
  -DprojectId="{projectId}" \
  -DbucketName="{bucketName}" \
  -DlibrariesBucketName="{bucketName}-libraries" \
  -DstagePrefix="$(date +%Y_%m_%d)-00_RC00"

Maven artifacts

As part of the Templates development process, we release the common artifact snapshots to Maven Central, not modules that contain finalized templates. This allows users to consume those resources and modules without forking the entire project, while keeping artifacts at a reasonable size.

In order to release artifacts, ~/.m2/settings.xml should be configured to contain Sonatype's username and password:

<servers>
  <server>
    <id>ossrh</id>
    <username>(user)</username>
    <password>(password)</password>
  </server>
</servers>

And the command to release (for example, the development plugin and Spanner together):

mvn clean deploy -am -Prelease \
  -pl plugins/templates-maven-plugin \
  -pl v2/spanner-common

If you intend to use those resources in an external project, your pom.xml should include:

<repositories>
  <repository>
    <id>ossrh</id>
    <url>https://oss.sonatype.org/content/repositories/snapshots</url>
  </repository>
</repositories>
<pluginRepositories>
  <pluginRepository>
    <id>ossrh</id>
    <url>https://oss.sonatype.org/content/repositories/snapshots</url>
  </pluginRepository>
</pluginRepositories>

More Information

  • Dataflow Templates - basic template concepts.
  • Google-provided Templates - official documentation for templates provided by Google (the source code is in this repository).
  • Dataflow Cookbook: Blog, GitHub Repository - pipeline examples and practical solutions to common data processing challenges.
  • Dataflow Metrics Collector - CLI tool to collect dataflow resource & execution metrics and export to either BigQuery or Google Cloud Storage. Useful for comparison and visualization of the metrics while benchmarking the dataflow pipelines using various data formats, resource configurations etc
  • Apache Beam
    • Overview
    • Quickstart: Java, Python, Go
    • Tour of Beam - an interactive tour with learning topics covering core Beam concepts from simple ones to more advanced ones.
    • Beam Playground - an interactive environment to try out Beam transforms and examples without having to install Apache Beam.
    • Beam College - hands-on training and practical tips, including video recordings of Apache Beam and Dataflow Templates lessons.
    • Getting Started with Apache Beam - Quest - A 5 lab series that provides a Google Cloud certified badge upon completion.

dataflowtemplates's People

Contributors

abacn avatar adrw-google avatar aksharauke avatar alexeykukuku avatar ali-ince avatar allenpradeep avatar an2x avatar andreigurau avatar ash-ddog avatar billyjacobson avatar biswanag avatar bvolpato avatar cherepushko avatar cloud-teleport avatar damondouglas avatar deep1998 avatar dhercher avatar dippatel98 avatar drumcircle avatar fbiville avatar fozzie15 avatar georgecma avatar melbrodrigues avatar oleg-semenov avatar pabloem avatar polber avatar pranavbhandari24 avatar shreyakhajanchi avatar theshanbhag avatar zhoufek avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dataflowtemplates's Issues

Pubsub to BigQuery - Cumlated errors make jobs crash

Hi

I've been using Google Dataflow Templates to send messages from pub/sub to BigQuery based on this: https://cloud.google.com/dataflow/docs/templates/provided-templates#cloudpubsubtobigquery

Since I've launched the dataflow job in streaming mode, the job has started to generate errors and finally crash based on the way Dataflow exceptions are handled:
https://cloud.google.com/dataflow/faq#how-are-java-exceptions-handled-in-cloud-dataflow

Here is the error:
java.lang.RuntimeException: java.io.IOException: Insert failed: [{"errors":[{"debugInfo":"","location":"","message":"Repeated record added outside of an array.","reason":"invalid"}],"index":0}, {"errors":[{"debugInfo":"","location":"","message":"","reason":"stopped"}],"index":1}, {"errors":[{"debugInfo":"","location":"","message":"","reason":"stopped"}],"index":2}, {"errors":[{"debugInfo":"","location":"","message":"","reason":"stopped"}],"index":3} .......]
org.apache.beam.sdk.io.gcp.bigquery.StreamingWriteFn.flushRows(StreamingWriteFn.java:125)
org.apache.beam.sdk.io.gcp.bigquery.StreamingWriteFn.finishBundle(StreamingWriteFn.java:94)

captura de pantalla de 2018-09-13 13-54-58

image

As I understand, this kind of behaviour makes this template not useful for data streaming.

Are there any possibilities to configure the template to avoid exceptions thrown but still send them to stackdriver?

Thanks

Record type not supported in TextIOToBigQuery template

Hi guys,

The biq query schema file that is used in this template cannot have a RECORD type defined..else get error below . Looking at the code.. does not look like code accommodates nested/recursive build up of table schema in the withSchema() block of code to deal with RECORD schema definition... I would be happy to code it up if you like....

Error: Field xxxx is type RECORD but has no schema.

DatastoreToBigQuery CREATE_IF_NEEDED problem

Hi,

I'm trying to compile the DatastoreToBigQuery template. But I get this error

An exception occured while executing the Java class. CreateDisposit ion is CREATE_IF_NEEDED, however no schema was provided

I tried to compile the DatastoreToText and TextToBigQuery and both worked, but somehow, from DatastoreToBigQuery isn't working.

Here is the whole Log:

    at org.apache.beam.vendor.guava.v20_0.com.google.common.base.Preconditions.checkArgument (Preconditions.java:122)
    at org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO$Write.expandTyped (BigQueryIO.java:2122)
    at org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO$Write.expand (BigQueryIO.java:2099)
    at org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO$Write.expand (BigQueryIO.java:1445)
    at org.apache.beam.sdk.Pipeline.applyInternal (Pipeline.java:537)
    at org.apache.beam.sdk.Pipeline.applyTransform (Pipeline.java:488)
    at org.apache.beam.sdk.values.PCollection.apply (PCollection.java:370)
    at com.google.cloud.teleport.templates.DatastoreToBigQuery.main (DatastoreToBigQuery.java:79)
    at sun.reflect.NativeMethodAccessorImpl.invoke0 (Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke (NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke (Method.java:498)
    at org.codehaus.mojo.exec.ExecJavaMojo$1.run (ExecJavaMojo.java:282)
    at java.lang.Thread.run (Thread.java:748)```

Maybe someone had the same problem and can help me. 

DataFlow Template : Sequential BigQuery Table Insertion issue

@sabhyankar

Hi , I am trying to create a dataflow template where it needs to write in two tables sequentially in bigquery.

e.g First it writes in "Patient" table. Ones its done then it writes in "Statistics" table. But I am facing this issue - once it writes in Patient table , it does not execution any code after that.

If you can please give some suggestion , it will be really helpful.

Thanks!

BigQuery to PubSub

Would BigQuery to PubSub be a good addition to these templates?
In the Dataflow Codelab, the NYC Taxi BigQuery dataset is used to publish to a topic. I think developers can utilize the existing, well-maintained BigQuery datasets to test out various streaming solutions through PubSub.

Use same group.id in consumer properties

Setting the group.id in the .updateConsumerProperties() still makes the reader to start reading at offset 0 for all jobs.

The setup

        Map<String, Object> props = new HashMap<>();
        props.put("group.id", "dataflow-reader");
        props.put("auto.offset.reset", "earliest");

        PCollection<KafkaRecord<String, String>> pcol = p.apply(KafkaIO.<String, String>read()
            .withBootstrapServers(options.getBootstrapServers())
            .withTopics(topics)
            .withKeyDeserializer(StringDeserializer.class)
            .withValueDeserializer(StringDeserializer.class)
            .withNumSplits(1)
            .updateConsumerProperties(props));

when I start a new job this is logged in the console.

Reader-0: reading from name-of-topic-0 starting at offset 0
ConsumerConfig values:
auto.commit.interval.ms = 5000
auto.offset.reset = earliest
bootstrap.servers = [xxxx]
check.crcs = true
client.id =
connections.max.idle.ms = 540000
enable.auto.commit = false
exclude.internal.topics = true
fetch.max.bytes = 52428800
fetch.max.wait.ms = 500
fetch.min.bytes = 1
group.id = Reader-0_offset_consumer_778069295_dataflow-reader

And it looks like that happens here

Is it possible to make the reader not start from offset 0 for each new dataflow job instance?

Pub/Sub to BQ fails to serialize json

I am trying to get going with the Dataflow template for Pub/Sub subscription to BigQuery. All of the messages end up in the table for errors records. The stack trace says: "java.lang.RuntimeException: Failed to serialize json to table row" and it fails for the actual message body.

The body of each message is an JSON array like:
[{"itemNo":"00050330","itemType":"Sales","itemDesc":"A Table","quantity":4.0,"extendedAmount":120.0,"orderId":null,"itemGroup":"0421","originalDocumentNo":null}]

Any help would be appreciated!

V2 Flex Templates are broken for Python

When a flex template is created using both my own templates and the flex wordcount template provided the dataflow pipeline fails to build. The issue causes the dataflow pipeline to not build correctly once the API call is made.

I believe the issue arises from the base docker images found at: gcr.io/dataflow-templates-base

When I have tested these with:

docker run --interactive --tty gcr.io/dataflow-templates-base/java8-template-launcher-base bash

I get the following error before it kicks me from the container and shuts it down:

2019/11/13 09:50:10 proto: duplicate proto type registered: google.api.Http
2019/11/13 09:50:10 proto: duplicate proto type registered: google.api.HttpRule
2019/11/13 09:50:10 proto: duplicate proto type registered: google.api.CustomHttpPattern
2019/11/13 09:50:10 proto: duplicate proto type registered: google.protobuf.FieldMask
Created new fluentd log writer for: /var/log/dataflow/template_launcher/runner-json.log

Custome shard templates with YYYY/MM/dd/HH/mm replacements

Hi all,

We are trying to setup a custom shard template like gs://bucket/YYYY/MM/dd/HH/W/P-SS-of-NN, so that the bucket can still be easily browse manually.

We are having issues with that custom shard template as it seems like those replacements are not supported. The short description listed when creating the job does not says anything about the likes of year/month/date/hour replacements. It just says:

The shard template defines the unique/dynamic portion of each windowed file. Recommended to use the default (W-P-SS-of-NN). At runtime, 'W' is replaced with the window date range and 'P' is replaced with the pane info.

Searching on this repo we found these references but we are not sure if these are available as part of to the custom shard template.

Questions:

  • Are YYYY, MM, dd, HH available to the custom shard template?

  • What are the supported replacements?

Any pointers will be greatly appreciated.

Javascript UDF error when parsing JSON

I have used the Pub/Sub to BigQuery template to stream JSON data that are sent to a Pub/Sub topic. Through Dataflow I want to flatten the data to match the BigQuery schema and stream them.

Here is the Javascript UDF for the Dataflow process:

function transform(inJson) {
    var obj = JSON.parse(inJson);
    // variable declarations
    // ... 
    data['domain'] = obj['data']['domain']; // line 18
    ...
    return JSON.stringify(data);
}

I've also tried:

data.domain = obj.data.domain;

I've just copied the example from this repo and extended it to flatten the JSON data.

Here is the error message:

TypeError: Cannot read property "domain" from undefined in <eval> at line number 18

and here is th stacktrace:

javax.script.ScriptException: TypeError: Cannot read property "domain" from undefined in <eval> at line number 18
    at jdk.nashorn.api.scripting.NashornScriptEngine.throwAsScriptException(NashornScriptEngine.java:470)
    at jdk.nashorn.api.scripting.NashornScriptEngine.invokeImpl(NashornScriptEngine.java:392)
    at jdk.nashorn.api.scripting.NashornScriptEngine.invokeFunction(NashornScriptEngine.java:190)
    at com.google.cloud.teleport.templates.common.JavascriptTextTransformer$JavascriptRuntime.invoke(JavascriptTextTransformer.java:156)
    at com.google.cloud.teleport.templates.common.JavascriptTextTransformer$FailsafeJavascriptUdf$1.processElement(JavascriptTextTransformer.java:315)
    at com.google.cloud.teleport.templates.common.JavascriptTextTransformer$FailsafeJavascriptUdf$1$DoFnInvoker.invokeProcessElement(Unknown Source)
    at org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner.invokeProcessElement(SimpleDoFnRunner.java:275)
    at org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner.processElement(SimpleDoFnRunner.java:240)
    at org.apache.beam.runners.dataflow.worker.SimpleParDoFn.processElement(SimpleParDoFn.java:325)
    at org.apache.beam.runners.dataflow.worker.util.common.worker.ParDoOperation.process(ParDoOperation.java:44)
    at org.apache.beam.runners.dataflow.worker.util.common.worker.OutputReceiver.process(OutputReceiver.java:49)
    at org.apache.beam.runners.dataflow.worker.SimpleParDoFn$1.output(SimpleParDoFn.java:272)
    at org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner.outputWindowedValue(SimpleDoFnRunner.java:309)
    at org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner.access$700(SimpleDoFnRunner.java:77)
    at org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner$DoFnProcessContext.output(SimpleDoFnRunner.java:621)
    at org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner$DoFnProcessContext.output(SimpleDoFnRunner.java:609)
    at com.google.cloud.teleport.templates.PubSubToBigQuery$PubsubMessageToFailsafeElementFn.processElement(PubSubToBigQuery.java:412)
    at com.google.cloud.teleport.templates.PubSubToBigQuery$PubsubMessageToFailsafeElementFn$DoFnInvoker.invokeProcessElement(Unknown Source)
    at org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner.invokeProcessElement(SimpleDoFnRunner.java:275)
    at org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner.processElement(SimpleDoFnRunner.java:240)
    at org.apache.beam.runners.dataflow.worker.SimpleParDoFn.processElement(SimpleParDoFn.java:325)
    at org.apache.beam.runners.dataflow.worker.util.common.worker.ParDoOperation.process(ParDoOperation.java:44)
    at org.apache.beam.runners.dataflow.worker.util.common.worker.OutputReceiver.process(OutputReceiver.java:49)
    at org.apache.beam.runners.dataflow.worker.SimpleParDoFn$1.output(SimpleParDoFn.java:272)
    at org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner.outputWindowedValue(SimpleDoFnRunner.java:309)
    at org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner.access$700(SimpleDoFnRunner.java:77)
    at org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner$DoFnProcessContext.output(SimpleDoFnRunner.java:621)
    at org.apache.beam.sdk.transforms.DoFnOutputReceivers$WindowedContextOutputReceiver.output(DoFnOutputReceivers.java:71)
    at org.apache.beam.sdk.transforms.MapElements$1.processElement(MapElements.java:122)
    at org.apache.beam.sdk.transforms.MapElements$1$DoFnInvoker.invokeProcessElement(Unknown Source)
    at org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner.invokeProcessElement(SimpleDoFnRunner.java:275)
    at org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner.processElement(SimpleDoFnRunner.java:240)
    at org.apache.beam.runners.dataflow.worker.SimpleParDoFn.processElement(SimpleParDoFn.java:325)
    at org.apache.beam.runners.dataflow.worker.util.common.worker.ParDoOperation.process(ParDoOperation.java:44)
    at org.apache.beam.runners.dataflow.worker.util.common.worker.OutputReceiver.process(OutputReceiver.java:49)
    at org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation.runReadLoop(ReadOperation.java:201)
    at org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation.start(ReadOperation.java:159)
    at org.apache.beam.runners.dataflow.worker.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:76)
    at org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.process(StreamingDataflowWorker.java:1233)
    at org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.access$1000(StreamingDataflowWorker.java:144)
    at org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker$6.run(StreamingDataflowWorker.java:972)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
Caused by: <eval>:18 TypeError: Cannot read property "domain" from undefined
    at jdk.nashorn.internal.runtime.ECMAErrors.error(ECMAErrors.java:57)
    at jdk.nashorn.internal.runtime.ECMAErrors.typeError(ECMAErrors.java:213)
    at jdk.nashorn.internal.runtime.ECMAErrors.typeError(ECMAErrors.java:185)
    at jdk.nashorn.internal.runtime.ECMAErrors.typeError(ECMAErrors.java:172)
    at jdk.nashorn.internal.runtime.Undefined.get(Undefined.java:157)
    at jdk.nashorn.internal.scripts.Script$Recompilation$1$7667A$\^eval\_.transform(<eval>:18)
    at jdk.nashorn.internal.runtime.ScriptFunctionData.invoke(ScriptFunctionData.java:639)
    at jdk.nashorn.internal.runtime.ScriptFunction.invoke(ScriptFunction.java:494)
    at jdk.nashorn.internal.runtime.ScriptRuntime.apply(ScriptRuntime.java:393)
    at jdk.nashorn.api.scripting.ScriptObjectMirror.callMember(ScriptObjectMirror.java:199)
    at jdk.nashorn.api.scripting.NashornScriptEngine.invokeImpl(NashornScriptEngine.java:386)
    ... 42 more

When I try the Javascript locally by passing some sample data it works as expected without any errors.

GroupByKey Exception in common/DatastoreConverters.java

Trying to use the Pub/Sub to Datastore template I ran into the error described in this StackOverflow thread. I tried the recommendation posted there, but then got the following exception:

java.lang.IllegalStateException: GroupByKey cannot be applied to non-bounded PCollection in the GlobalWindow without a trigger. Use a Window.into or Window.triggering transform prior to GroupByKey.
at org.apache.beam.sdk.transforms.GroupByKey.applicableTo(GroupByKey.java:173)
at org.apache.beam.sdk.transforms.GroupByKey.expand(GroupByKey.java:204)
at org.apache.beam.sdk.transforms.GroupByKey.expand(GroupByKey.java:120)
at org.apache.beam.sdk.Pipeline.applyInternal(Pipeline.java:537)
at org.apache.beam.sdk.Pipeline.applyTransform(Pipeline.java:472)
at org.apache.beam.sdk.values.PCollection.apply(PCollection.java:286)
at com.google.cloud.teleport.templates.common.DatastoreConverters$CheckSameKey.expand(DatastoreConverters.java:210)
at com.google.cloud.teleport.templates.common.DatastoreConverters$CheckSameKey.expand(DatastoreConverters.java:172)
at org.apache.beam.sdk.Pipeline.applyInternal(Pipeline.java:537)
at org.apache.beam.sdk.Pipeline.applyTransform(Pipeline.java:491)
at org.apache.beam.sdk.values.PCollection.apply(PCollection.java:299)
at com.google.cloud.teleport.templates.common.DatastoreConverters$WriteJsonEntities.expand(DatastoreConverters.java:158)
at com.google.cloud.teleport.templates.common.DatastoreConverters$WriteJsonEntities.expand(DatastoreConverters.java:134)
at org.apache.beam.sdk.Pipeline.applyInternal(Pipeline.java:537)
at org.apache.beam.sdk.Pipeline.applyTransform(Pipeline.java:472)
at org.apache.beam.sdk.values.PCollection.apply(PCollection.java:286)
at com.google.cloud.teleport.templates.PubsubToDatastore.main(PubsubToDatastore.java:69)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.codehaus.mojo.exec.ExecJavaMojo$1.run(ExecJavaMojo.java:282)
at java.lang.Thread.run(Thread.java:748)

There is a workaround for this in another StackOverflow thread, that bypasses this issue, but it wouldn't hurt if someone can take a look at it. Thanks.

PubSub to Bigquery #Null Pointer Exception

After changes made to PubSubToBigQuery on April 2nd , I am unable to build dataflow template and getting NullPointer Exception d240b96

Even the error information doesn't help me to locate what params/options is missing while running through mvn in debug mode .Please fix

mvn compile exec:java -Dexec.mainClass=com.google.cloud.teleport.templates.PubSubToBigQuery -Dexec.cleanupDaemonThreads=false -Dexec.args="--project=${PROJECT_ID} --stagingLocation=gs://${BUCKET_NAME}/dataflow/pipelines/pubsub-to-bigquery/staging --tempLocation=gs://${BUCKET_NAME}/dataflow/pipelines/pubsub-to-bigquery/temp --templateLocation=gs://${BUCKET_NAME}/dataflow/pipelines/pubsub-to-bigquery/template --runner=DataflowRunner"

[WARNING]
java.lang.NullPointerException
at com.google.cloud.teleport.templates.PubSubToBigQuery.run (PubSubToBigQuery.java:226)
at com.google.cloud.teleport.templates.PubSubToBigQuery.main (PubSubToBigQuery.java:191)
at sun.reflect.NativeMethodAccessorImpl.invoke0 (Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke (NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke (Method.java:498)
at org.codehaus.mojo.exec.ExecJavaMojo$1.run (ExecJavaMojo.java:282)
at java.lang.Thread.run (Thread.java:748)

ClassCastException importing integer array from BigQuery Avro export

When importing an Avro dump of a BigQuery table with repeated integer column (array of integers) we see a class cast exception:

Caused by: java.lang.ClassCastException: java.lang.Integer cannot be cast to java.lang.Long
    at com.google.cloud.teleport.spanner.AvroRecordConverter.lambda$readInt64Array$11(AvroRecordConverter.java:359)

This looks like a line number from the previous commit of this code: 42ce71e

I have experienced a similar problem when trying to write the output of BigQuery with a nullable INTEGER column into a Spanner INT64 and I had to do use a function like:

  private Long longVal(Object v, Long defaultValue) {
        return v == null ? defaultValue : Long.parseLong(v.toString());
    }

So I think the fix is probably replace

value.stream().map(x -> x == null ? null : (long) x).collect(Collectors.toList()));

with

value.stream().map(x -> x == null ? null : Long.parseLong(x.toString)).collect(Collectors.toList()));

I have not had a chance to try this fix or develop a test + PR but thought I would share the bug report in case other people are Googling for the answer to those pesky exceptions like I was.

Publishing Template

Just pulling project & publish this by mvn command:
mvn compile exec:java \ -Dexec.mainClass=com.google.cloud.teleport.templates.PubsubToAvro \ -Dexec.cleanupDaemonThreads=false \ -Dexec.args=" \ --project=${gcp_project_id} \ --stagingLocation=gs://${bucket_name}/staging \ --tempLocation=gs://${bucket_name}/temp \ --templateLocation=gs://${bucket_name}/templates/default-pubsub-to-avro.json \ --runner=DataflowRunner"

and create a job with default template from Dataflow & see that warning:

No metadata file found for this template.

Support for ORC to BigTable

Is there a supported template for converting ORC on GCS to BigTable? (I know it is a bit eccentric conversion...) I am trying to implement a batch dataflow for migrating some hive table data (Originally located from AWS S3, storage transferring to GCS) to BigTable.

If not, if there is any workaround/example to this, please leave a comment here.

PubSub Subscription to BQ unable to set number of max workers

Hi,

I created a dataflow job from PubSub_Subscription_to_BigQuery template and set the max number of worker to 10. The job was created successfully. However, from the job details in GCP console, the maxNumWorkers is always reverted back to 3. Is there any workaround for this? Thanks!

PS: I have tried to create the job from UI and gcloud cli and was getting the same result.

GCS Avro to BigQuery

This is not a issue, but a question. In absence of any other method, asking my question in form of issue. Sorry for that. Here goes the question:

Will the code GCS Avro to BigTable work for BQ also? Is there anything I should take take of, before applying it for a BQ case?

Thanks much
Pramod

AvroToBigtable

I am tying to import and run this project .. The AVRO to Big Table conversion is missing the BigTableRow.java and BigTableCell.java and hence getting compilation errors on the example

Multiple BootstrapServers for KafkaToBigQuery parameters

Hi,

I notice there is a parameter called bootstrapServers in https://github.com/GoogleCloudPlatform/DataflowTemplates/blob/master/src/main/java/com/google/cloud/teleport/templates/KafkaToBigQuery.java. I am trying to using 3 kafka brokers to sync data to BigQuery from Kafka. I cannot use comma "," to separate the brokers parameter.

Could you tell me how to set the bootstrap servers so the dataflow would ingest from mulitple kafka brokers?

Thanks in advance.

Unable to create template

C:\Users\anirusharma\poc\DataflowTemplates>mvn compile exec:java -Dexec.mainClass=com.google.cloud.teleport.templates.PubsubToPubsub -Dexec.cleanupDaemonThreads=false -Dexec.args="--project=testbatch-211413 --stagingLocation=gs://templates_test_as/staging --tempLocation=gs://templates_test_as/temp --templateLocation=gs://template_data_as/templates/PubsubToPubsub.json --filesToStage=gs://templates_test_as/staging2 --runner=DataflowRunner"
[INFO] Scanning for projects...
[INFO] ------------------------------------------------------------------------
[INFO] Detecting the operating system and CPU architecture
[INFO] ------------------------------------------------------------------------
[INFO] os.detected.name: windows
[INFO] os.detected.arch: x86_64
[INFO] os.detected.version: 10.0
[INFO] os.detected.version.major: 10
[INFO] os.detected.version.minor: 0
[INFO] os.detected.classifier: windows-x86_64
[INFO]
[INFO] --------< com.google.cloud.teleport:google-cloud-teleport-java >--------
[INFO] Building Google Cloud Teleport 0.1-SNAPSHOT
[INFO] --------------------------------[ jar ]---------------------------------
Downloading from central: https://repo.maven.apache.org/maven2/io/grpc/grpc-core/maven-metadata.xml
Downloaded from central: https://repo.maven.apache.org/maven2/io/grpc/grpc-core/maven-metadata.xml (1.8 kB at 1.4 kB/s)
Downloading from central: https://repo.maven.apache.org/maven2/io/netty/netty-codec-http2/maven-metadata.xml
Downloaded from central: https://repo.maven.apache.org/maven2/io/netty/netty-codec-http2/maven-metadata.xml (2.0 kB at 20 kB/s)
Downloading from Apache Snapshots Repository: https://repository.apache.org/content/repositories/snapshots/io/grpc/grpc-core/maven-metadata.xml
Downloading from Sonatype Snapshots Repository: https://oss.sonatype.org/content/repositories/snapshots/io/grpc/grpc-core/maven-metadata.xml
Downloaded from Sonatype Snapshots Repository: https://oss.sonatype.org/content/repositories/snapshots/io/grpc/grpc-core/maven-metadata.xml (802 B at 850 B/s)
Downloading from Sonatype Snapshots Repository: https://oss.sonatype.org/content/repositories/snapshots/io/netty/netty-codec-http2/maven-metadata.xml
Downloading from Apache Snapshots Repository: https://repository.apache.org/content/repositories/snapshots/io/netty/netty-codec-http2/maven-metadata.xml
Downloaded from Sonatype Snapshots Repository: https://oss.sonatype.org/content/repositories/snapshots/io/netty/netty-codec-http2/maven-metadata.xml (1.5 kB at 6.7 kB/s)
[INFO]
[INFO] --- maven-enforcer-plugin:3.0.0-M1:enforce (enforce) @ google-cloud-teleport-java ---
[INFO] artifact io.grpc:grpc-core: checking for updates from central
[INFO] artifact io.netty:netty-codec-http2: checking for updates from central
[INFO]
[INFO] --- maven-enforcer-plugin:3.0.0-M1:enforce (enforce-banned-dependencies) @ google-cloud-teleport-java ---
[INFO]
[INFO] --- protobuf-maven-plugin:0.5.1:compile (default) @ google-cloud-teleport-java ---
[INFO] Compiling 1 proto file(s) to C:\Users\anirusharma\poc\DataflowTemplates\target\generated-sources\protobuf\java
[INFO]
[INFO] --- protobuf-maven-plugin:0.5.1:compile-custom (default) @ google-cloud-teleport-java ---
[INFO] Compiling 1 proto file(s) to C:\Users\anirusharma\poc\DataflowTemplates\target\generated-sources\protobuf\grpc-java
[INFO]
[INFO] --- avro-maven-plugin:1.8.2:schema (default) @ google-cloud-teleport-java ---
[INFO]
[INFO] --- maven-resources-plugin:2.6:resources (default-resources) @ google-cloud-teleport-java ---
[INFO] Using 'UTF-8' encoding to copy filtered resources.
[INFO] Copying 3 resources
[INFO] Copying 1 resource
[INFO] Copying 1 resource
[INFO]
[INFO] --- maven-compiler-plugin:3.6.2:compile (default-compile) @ google-cloud-teleport-java ---
[INFO] Changes detected - recompiling the module!
[INFO] Compiling 93 source files to C:\Users\anirusharma\poc\DataflowTemplates\target\classes
[INFO] /C:/Users/anirusharma/poc/DataflowTemplates/src/main/java/com/google/cloud/teleport/templates/common/BigQueryConverters.java: Some input files use or override a deprecated API.
[INFO] /C:/Users/anirusharma/poc/DataflowTemplates/src/main/java/com/google/cloud/teleport/templates/common/BigQueryConverters.java: Recompile with -Xlint:deprecation for details.
[INFO] /C:/Users/anirusharma/poc/DataflowTemplates/src/main/java/com/google/cloud/teleport/kafka/connector/KafkaRecordCoder.java: Some input files use unchecked or unsafe operations.
[INFO] /C:/Users/anirusharma/poc/DataflowTemplates/src/main/java/com/google/cloud/teleport/kafka/connector/KafkaRecordCoder.java: Recompile with -Xlint:unchecked for details.
[INFO]
[INFO] --- exec-maven-plugin:1.6.0:java (default-cli) @ google-cloud-teleport-java ---
[WARNING]
java.lang.RuntimeException: Failed to construct instance from factory method DataflowRunner#fromOptions(interface org.apache.beam.sdk.options.PipelineOptions)
at org.apache.beam.sdk.util.InstanceBuilder.buildFromMethod (InstanceBuilder.java:224)
at org.apache.beam.sdk.util.InstanceBuilder.build (InstanceBuilder.java:155)
at org.apache.beam.sdk.PipelineRunner.fromOptions (PipelineRunner.java:55)
at org.apache.beam.sdk.Pipeline.create (Pipeline.java:145)
at com.google.cloud.teleport.templates.PubsubToPubsub.run (PubsubToPubsub.java:149)
at com.google.cloud.teleport.templates.PubsubToPubsub.main (PubsubToPubsub.java:138)
at sun.reflect.NativeMethodAccessorImpl.invoke0 (Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke (NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke (Method.java:498)
at org.codehaus.mojo.exec.ExecJavaMojo$1.run (ExecJavaMojo.java:282)
at java.lang.Thread.run (Thread.java:748)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0 (Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke (NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke (Method.java:498)
at org.apache.beam.sdk.util.InstanceBuilder.buildFromMethod (InstanceBuilder.java:214)
at org.apache.beam.sdk.util.InstanceBuilder.build (InstanceBuilder.java:155)
at org.apache.beam.sdk.PipelineRunner.fromOptions (PipelineRunner.java:55)
at org.apache.beam.sdk.Pipeline.create (Pipeline.java:145)
at com.google.cloud.teleport.templates.PubsubToPubsub.run (PubsubToPubsub.java:149)
at com.google.cloud.teleport.templates.PubsubToPubsub.main (PubsubToPubsub.java:138)
at sun.reflect.NativeMethodAccessorImpl.invoke0 (Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke (NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke (Method.java:498)
at org.codehaus.mojo.exec.ExecJavaMojo$1.run (ExecJavaMojo.java:282)
at java.lang.Thread.run (Thread.java:748)
Caused by: java.lang.RuntimeException: Unable to get application default credentials. Please see https://developers.google.com/accounts/docs/application-default-credentials for details on how to specify credentials. This version of the SDK is dependent on the gcloud core component version 2015.02.05 or newer to be able to get credentials from the currently authorized user via gcloud auth.
at org.apache.beam.sdk.extensions.gcp.auth.NullCredentialInitializer.throwNullCredentialException (NullCredentialInitializer.java:60)
at org.apache.beam.runners.dataflow.util.DataflowTransport.chainHttpRequestInitializer (DataflowTransport.java:99)
at org.apache.beam.runners.dataflow.util.DataflowTransport.newDataflowClient (DataflowTransport.java:76)
at org.apache.beam.runners.dataflow.options.DataflowPipelineDebugOptions$DataflowClientFactory.create (DataflowPipelineDebugOptions.java:134)
at org.apache.beam.runners.dataflow.options.DataflowPipelineDebugOptions$DataflowClientFactory.create (DataflowPipelineDebugOptions.java:131)
at org.apache.beam.sdk.options.ProxyInvocationHandler.returnDefaultHelper (ProxyInvocationHandler.java:592)
at org.apache.beam.sdk.options.ProxyInvocationHandler.getDefault (ProxyInvocationHandler.java:533)
at org.apache.beam.sdk.options.ProxyInvocationHandler.invoke (ProxyInvocationHandler.java:158)
at com.sun.proxy.$Proxy42.getDataflowClient (Unknown Source)
at org.apache.beam.runners.dataflow.DataflowClient.create (DataflowClient.java:41)
at org.apache.beam.runners.dataflow.DataflowRunner. (DataflowRunner.java:338)
at org.apache.beam.runners.dataflow.DataflowRunner.fromOptions (DataflowRunner.java:332)
at sun.reflect.NativeMethodAccessorImpl.invoke0 (Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke (NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke (Method.java:498)
at org.apache.beam.sdk.util.InstanceBuilder.buildFromMethod (InstanceBuilder.java:214)
at org.apache.beam.sdk.util.InstanceBuilder.build (InstanceBuilder.java:155)
at org.apache.beam.sdk.PipelineRunner.fromOptions (PipelineRunner.java:55)
at org.apache.beam.sdk.Pipeline.create (Pipeline.java:145)
at com.google.cloud.teleport.templates.PubsubToPubsub.run (PubsubToPubsub.java:149)
at com.google.cloud.teleport.templates.PubsubToPubsub.main (PubsubToPubsub.java:138)
at sun.reflect.NativeMethodAccessorImpl.invoke0 (Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke (NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke (Method.java:498)
at org.codehaus.mojo.exec.ExecJavaMojo$1.run (ExecJavaMojo.java:282)
at java.lang.Thread.run (Thread.java:748)
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 01:40 min
[INFO] Finished at: 2018-12-18T00:01:47-05:00
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.codehaus.mojo:exec-maven-plugin:1.6.0:java (default-cli) on project google-cloud-teleport-java: An exception occured while executing the Java class. Failed to construct instance from factory method DataflowRunner#fromOptions(interface org.apache.beam.sdk.options.PipelineOptions): InvocationTargetException: Unable to get application default credentials. Please see https://developers.google.com/accounts/docs/application-default-credentials for details on how to specify credentials. This version of the SDK is dependent on the gcloud core component version 2015.02.05 or newer to be able to get credentials from the currently authorized user via gcloud auth. -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException

Create Template to bulk import from Datastore/Firestore exported files

Cloud Firestore and Cloud Datastore share the same leveldb export format.

Feature request to support bulkd processing of datastore/firestore files.

reference:

I can confirm , the following python snippet reads and displays the export files.

#!/usr/bin/python
# virtualenv env
# source env/bin/activate

import sys
sys.path.append('/apps/google-cloud-sdk/platform/google_appengine/')
from google.appengine.api.files import records
from google.appengine.datastore import entity_pb
from google.appengine.api import datastore

raw = open('2018-11-05T17_49_44_60804_all_namespaces_all_kinds_output-0', 'r')
reader = records.RecordsReader(raw)
for record in reader:
    entity = datastore.Entity.FromPb(entity_pb.EntityProto(contents=record))
    print entity

java.lang.RuntimeException: Failed to serialize json to table row:

actually i am trying to store json string in big query table, so when i tried to publish below simple json in publish message box, records are not inserted in big query table. i checked in logs, it give above error.

{
"childName":"Aijaz_Google555",
"present":"Gift_Google555",
"JsonObject":[{"Actions":"test actions","CreatedBy": "test created by", "CreatedTimestamp": "test","Extended": "test extended"}]
}

Child Name, Present and Json object are 3 string type columns, in 3rd column jsonobject i want to store json object as string. kindly help.

also from code c#, i am trying to make string like below

string content = "{ "childName":"Aijaz_Google666","present":"Gift_Google666","JsonObject":"{"Actions": "test actions","CreatedBy": "test created by", "CreatedTimestamp": "2015 - 10 - 28T10: 15:30(ISO Date Time Format)","Extended": "test extended"}"}";

Thanks in advance.

BQ to GCS Avro or Parquet

Any plans to support this? I wanted to be able to have an incremental unload from BA to GCS to see with Apache Spark

Rename repository

Since Dataflow supports both Java and Python, it's misleading to name it as just DataflowTemplates and only keep Java examples in there.

I suggest that it should use the preferred nomenclature of DataflowTemplates-Java the same way google-cloud SDK does.

I can help add some templates for python. Here's a sample directory which I have been keeping some samples in: https://github.com/VikramTiwari/dataflow-samples

PS: I know, even I am guilty of not using proper nomenclature, but I am not Google :)

PubSub to BQ deadletter table

Seeing the following error which seems to be an uncaught exception which I believe is skipping the deadletter table:

java.lang.RuntimeException: java.io.IOException: Insert failed: [{"errors":[{"debugInfo":"","location":"location.latitude","message":"Cannot convert value to floating point (bad value):","reason":"invalid"}],"index":0}, {"errors":[{"debugInfo":"","location":"","message":"","reason":"stopped"}],"index":1}, {"errors":[{"debugInfo":"","location":"","message":"","reason":"stopped"}],"index":2}, {"errors":[{"debugInfo":"","location":"","message":"","reason":"stopped"}],"index":3}, {"errors":[{"debugInfo":"","location":"","message":"","reason":"stopped"}],"index":4}, {"errors":[{"debugInfo":"","location":"","message":"","reason":"stopped"}],"index":5}, {"errors":[{"debugInfo":"","location":"","message":"","reason":"stopped"}],"index":6}, {"errors":
...[{"debugInfo":"","location":"","message":"","reason":"stopped"}],"index":209}]
at org.apache.beam.sdk.io.gcp.bigquery.StreamingWriteFn.flushRows(StreamingWriteFn.java:142)
at org.apache.beam.sdk.io.gcp.bigquery.StreamingWriteFn.finishBundle(StreamingWriteFn.java:103)
Caused by: java.io.IOException: Insert failed: [{"errors":[{"debugInfo":"","location":"location.latitude","message":"Cannot convert value to floating point (bad value):","reason":"invalid"}],"index":0}, {"errors":[{"debugInfo":"","location":"","message":"","reason":"stopped"}],"index":1}, {"errors":[{"debugInfo":"","location":"","message":"","reason":"stopped"}],"index":2}, {"errors":[{"debugInfo":"","location":"","message":"","reason":"stopped"}],"index":3}, {"errors":[{"debugInfo":"","location":"","message":"","reason":"stopped"}],"index":4}, {"errors": ...

ZIP compression in Bulk Compress Cloud Storage Files template

I would like to use the ZIP compression type in the Bulk Compress Cloud Storage Files Template, which the documentation states is a viable option:
image

However, it is not an option when running the template through GCP-Dataflow though.

Are there plans to make this an option? If not, the documentation should be updated.

Thank you!

GroupByKey cannot be applied to non-bounded PCollection in the GlobalWindow without a trigger

I am new to running pipeline jobs on the google cloud and I am running to the issue with PubSub to DataQuery. 'mvn clean && mvn compile' worked but the command to create the template fails.
--Command
mvn compile exec:java \ -Dexec.mainClass=com.google.cloud.teleport.templates.PubsubToDatastore \ -Dexec.cleanupDaemonThreads=false \ -Dexec.args=" \ --project=PROJECT_ID \ --pubsubReadTopic=projects/PROJECT_ID/topics/topic \ --javascriptTextTransformGcsPath=gs://PROJECT_ID/*.js. \ --javascriptTextTransformFunctionName=transform \ --stagingLocation=gs://PROJECT_ID/staging \ --tempLocation=gs://PROJECT_ID/temp \ --templateLocation=gs://<PROJECT_ID>/templates/PubSub_to_Datastore.json \ --runner=DataflowRunner"

-- javascript
`/**

  • A transform which adds a field to the incoming data.
  • @param {string} inJson
  • @return {string} outJson
    */
    function transform(line) {
    var values = line.split(',');

var obj = new Object();
obj._description = values[0];
obj._east = values[1];
obj._last_updt = values[2];
obj._north = values[3];
obj._region_id = values[4];
obj._south = values[5];
obj._west = values[6];
obj.current_speed = values[7];
obj.region = values[8];
var jsonString = JSON.stringify(obj);

return jsonString;
}--Datastore Schema{
"Datastore Schema": [
{
"name": "_description",
"type": "STRING"
},
{
"name": "_east",
"type": "FLOAT"
},
{
"name": "_last_updt",
"type": "TIMESTAMP"
},
{
"name": "_north",
"type": "FLOAT"
},
{
"name": "_region_id",
"type": "INTEGER"
},
{
"name": "_south",
"type": "FLOAT"
},
{
"name": "_west",
"type": "FLOAT"
},
{
"name": "current_speed",
"type": "FLOAT"
},
{
"name": "region",
"type": "STRING"
}
]
}`

Add format transformation for json-like input for BigQueryConverters

Hi there,

Got the below issue (missing double-quote at beginning of filedname) when creating a kafka-pubsub-bigquery pipeline using the pubsub to bigquery template.

Failed to serialize json to table row: {SPEED=71.2, FREEWAY_ID=163, TIMESTAMP=2008-11-01 00:00:00, LONGITUDE=-117.155519, FREEWAY_DIR=S, LATITUDE=32.749679, LANE=3}
.....
Caused by: com.fasterxml.jackson.core.JsonParseException: Unexpected character ('S' (code 83)): was expecting double-quote to start field name
 at [Source: {SPEED=71.2, FREEWAY_ID=163, TIMESTAMP=2008-11-01 00:00:00, LONGITUDE=-117.155519, FREEWAY_DIR=S, LATITUDE=32.749679, LANE=3}; line: 1, column: 3]

I think the error popped up from the below codes which cannot deal with missing double-quote or some non-standard json-like format (like "key=value" format), which got created from upstream pipeline... I am wondering if you could add something that may look similar to this https://www.mkyong.com/java/jackson-was-expecting-double-quote-to-start-field-name/ for our case
to the source code like an input format validation, transformation and then exception handling. In case of having an upstream that could take standard json but feed non-standard jsons into pubsub and then this dataflow template, it would be better to have this added.

https://github.com/GoogleCloudPlatform/DataflowTemplates/blob/master/src/main/java/com/google/cloud/teleport/templates/common/BigQueryConverters.java

/**
  * Converts a JSON string to a {@link TableRow} object. If the data fails to convert, a {@link
  * RuntimeException} will be thrown.
  *
  * @param json The JSON string to parse.
  * @return The parsed {@link TableRow} object.
  */
 private static TableRow convertJsonToTableRow(String json) {
   TableRow row;
   // Parse the JSON into a {@link TableRow} object.
   try (InputStream inputStream =
       new ByteArrayInputStream(json.getBytes(StandardCharsets.UTF_8))) {
     row = TableRowJsonCoder.of().decode(inputStream, Context.OUTER);

   } catch (IOException e) {
     throw new RuntimeException("Failed to serialize json to table row: " + json, e);
   }

   return row;
 }

Thank you!

Specify project id for spanner to avro export

I'd like to run the "Cloud Spanner to Cloud Storage Avro" template in a different project than the project my spanner instance lives.
Currently I cannot specify the project in the template:

SpannerConfig spannerConfig =
SpannerConfig.create()
.withHost(options.getSpannerHost())
.withInstanceId(options.getInstanceId())
.withDatabaseId(options.getDatabaseId());

It is however possible in the spanner to text export:
SpannerConfig spannerConfig =
SpannerConfig.create()
.withProjectId(options.getSpannerProjectId())
.withInstanceId(options.getSpannerInstanceId())
.withDatabaseId(options.getSpannerDatabaseId());

Is this something that can be added?

Add support for importing ARRAY, BYTES and STRUCT types into Spanner

Case comment in https://github.com/GoogleCloudPlatform/DataflowTemplates/blob/master/src/main/java/com/google/cloud/teleport/spanner/TextImportPipeline.java mentions the following:

NOTE: BYTES, ARRAY, STRUCT types are not supported.

It is impossible to import tables containing one of these types. The following exception error is thrown.

"Unrecognized or unsupported column data type: ARRAY<STRING(20)>"

Can we please add support for these 3 code types?

SpannerIO: Support read write transactions

Is it possible to perform read-write transaction in spanner connector for dataflow/beam?
I have use case which is currently implemented in Java App Engine flex app, but would like to see if I can do it in dataflow.

I have gone through grouped mutation but not sure if I can read in it

PubSub To PubSub

Hi there,

Got a situation with the dataflow template PubSub to PubSub
I follow the documentation https://cloud.google.com/dataflow/docs/guides/templates/provided-templates#cloudpubsubtocloudpubsub

I created two topics
For one I created a subscription in order to put in the inputSubscription parameter

For a particular reason it doesn't work, the dataflow appears to read no messages, but when i execute the following command in the terminal it returns messages from the topic

gcloud pubsub subscriptions pull projects/test-project/subscriptions/testtopic --limit 100 --format="json"

Pub/Sub to BigQuery Errors PayloadString is not valid JSON

The errors from the Cloud Pub/Sub Subscription to BigQuery template aren't saved as valid JSON in the PayloadString column and I am unable to replay them.

It looks like this:
{event={userId=1234, sessionEvent={sessionId=DSFG, ua=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36, browser={name=Chrome, version=74.0.3729.131, major=74}}}

If this is intended, how do I convert it and push it back to Pub/Sub?

Support dumping multiple Spanner databases to Avro

To be able to use Cloud Scheduler effectively with the Spanner->Avro template, it would be ideal if https://github.com/GoogleCloudPlatform/DataflowTemplates/blob/master/src/main/java/com/google/cloud/teleport/spanner/ExportPipeline.java allowed specifying multiple Database IDs (instead of a single one, as happens currently)

The current template already creates a subdirectory for the exported database in the GCS output directory: if multiple databases were specified multiple subdirectories would be created, one for each database.

As an extension, it would be very useful even to make the Database ID optional, in which case the dataflow would have to enumerate the databases in the specified Spanner instance, and then export all of them.

The goal is to be able to trigger an export of one, multiple or all databases on a spanner instance from a cloud scheduler job.

PubSub to BigQuery Javascript UDF destroys attributes

Hi,
We're using the PubSub Subscription to bigquery template. We have data in both PubSubMessage Attributes and the body. Our body contains an array without a field name i.e

[
 {"id": "item1"},
 {"id": "item2"}
]

Which the template had issues parsing, so we added a simple UDF

function process(str){
    var arrayOfItems = JSON.parse(str);
    var outObject = {items: arrayOfItems};
    return JSON.stringify(outObject);

When this template runs it seems like the attributes are discarded after the UDF step.

I'm not that well versed with BEAM but it seems that when the InvokeUDF step is built it's discarding everything but the message payload

 PCollectionTuple udfOut =
          input
              // Map the incoming messages into FailsafeElements so we can recover from failures
              // across multiple transforms.
              .apply("MapToRecord", ParDo.of(new PubsubMessageToFailsafeElementFn()))
              .apply(
                  "InvokeUDF",
                  FailsafeJavascriptUdf.<PubsubMessage>newBuilder()
                      .setFileSystemPath(options.getJavascriptTextTransformGcsPath())
                      .setFunctionName(options.getJavascriptTextTransformFunctionName())
                      .setSuccessTag(UDF_OUT)
                      .setFailureTag(UDF_DEADLETTER_OUT)
                      .build());

The PubsubMessageToFailsafeElementFn looks like this

 static class PubsubMessageToFailsafeElementFn
      extends DoFn<PubsubMessage, FailsafeElement<PubsubMessage, String>> {
    @ProcessElement
    public void processElement(ProcessContext context) {
      PubsubMessage message = context.element();
      context.output(
          FailsafeElement.of(message, new String(message.getPayload(), StandardCharsets.UTF_8)));
    }
  }

It seems to call message.getPlayload which would probably cause the issue.

So my question is: Am I doing something wrong, is there some way of getting both the attributes and the payload through the UDF? Or do I have to modify the java template?

Thanks in advance!

Cannot rebuild template as it is

I pulled the repo and followed the instruction to build TextToBigQueryStreaming template as it is. But I got the following error:
[INFO] ------------------------------------------------------------------------ [INFO] BUILD FAILURE [INFO] ------------------------------------------------------------------------ [INFO] Total time: 45.236 s [INFO] Finished at: 2019-08-28T10:30:13-04:00 [INFO] ------------------------------------------------------------------------ [ERROR] Failed to execute goal org.codehaus.mojo:exec-maven-plugin:1.6.0:java (default-cli) on project google-cloud-teleport-java: An exception occured while executing the Java class. Cannot define class using reflection: Cannot define nest member class java.lang.reflect.AccessibleObject$Cache + within different package then class org.apache.beam.repackaged.core.net.bytebuddy.mirror.AccessibleObject -> [Help 1] org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute goal org.codehaus.mojo:exec-maven-plugin:1.6.0:java (default-cli) on project google-cloud-teleport-java: An exception occured while executing the Java class. Cannot define class using reflection: Cannot define nest member class java.lang.reflect.AccessibleObject$Cache + within different package then class org.apache.beam.repackaged.core.net.bytebuddy.mirror.AccessibleObject at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:215) at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:156) at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:148) at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject (LifecycleModuleBuilder.java:117) at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject (LifecycleModuleBuilder.java:81) at org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build (SingleThreadedBuilder.java:56) at org.apache.maven.lifecycle.internal.LifecycleStarter.execute (LifecycleStarter.java:128) at org.apache.maven.DefaultMaven.doExecute (DefaultMaven.java:305) at org.apache.maven.DefaultMaven.doExecute (DefaultMaven.java:192) at org.apache.maven.DefaultMaven.execute (DefaultMaven.java:105) at org.apache.maven.cli.MavenCli.execute (MavenCli.java:956) at org.apache.maven.cli.MavenCli.doMain (MavenCli.java:288) at org.apache.maven.cli.MavenCli.main (MavenCli.java:192) at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0 (Native Method) at jdk.internal.reflect.NativeMethodAccessorImpl.invoke (NativeMethodAccessorImpl.java:62) at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke (Method.java:567) at org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced (Launcher.java:282) at org.codehaus.plexus.classworlds.launcher.Launcher.launch (Launcher.java:225) at org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode (Launcher.java:406) at org.codehaus.plexus.classworlds.launcher.Launcher.main (Launcher.java:347) Caused by: org.apache.maven.plugin.MojoExecutionException: An exception occured while executing the Java class. Cannot define class using reflection: Cannot define nest member class java.lang.reflect.AccessibleObject$Cache + within different package then class org.apache.beam.repackaged.core.net.bytebuddy.mirror.AccessibleObject at org.codehaus.mojo.exec.ExecJavaMojo.execute (ExecJavaMojo.java:339) at org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo (DefaultBuildPluginManager.java:137) at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:210) at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:156) at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:148) at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject (LifecycleModuleBuilder.java:117) at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject (LifecycleModuleBuilder.java:81) at org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build (SingleThreadedBuilder.java:56) at org.apache.maven.lifecycle.internal.LifecycleStarter.execute (LifecycleStarter.java:128) at org.apache.maven.DefaultMaven.doExecute (DefaultMaven.java:305) at org.apache.maven.DefaultMaven.doExecute (DefaultMaven.java:192) at org.apache.maven.DefaultMaven.execute (DefaultMaven.java:105) at org.apache.maven.cli.MavenCli.execute (MavenCli.java:956) at org.apache.maven.cli.MavenCli.doMain (MavenCli.java:288) at org.apache.maven.cli.MavenCli.main (MavenCli.java:192) at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0 (Native Method) at jdk.internal.reflect.NativeMethodAccessorImpl.invoke (NativeMethodAccessorImpl.java:62) at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke (Method.java:567) at org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced (Launcher.java:282) at org.codehaus.plexus.classworlds.launcher.Launcher.launch (Launcher.java:225) at org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode (Launcher.java:406) at org.codehaus.plexus.classworlds.launcher.Launcher.main (Launcher.java:347) Caused by: java.lang.UnsupportedOperationException: Cannot define class using reflection: Cannot define nest member class java.lang.reflect.AccessibleObject$Cache + within different package then class org.apache.beam.repackaged.core.net.bytebuddy.mirror.AccessibleObject at org.apache.beam.repackaged.core.net.bytebuddy.dynamic.loading.ClassInjector$UsingReflection$Dispatcher$Initializable$Unavailable.defineClass (ClassInjector.java:410) at org.apache.beam.repackaged.core.net.bytebuddy.dynamic.loading.ClassInjector$UsingReflection.injectRaw (ClassInjector.java:235) at org.apache.beam.repackaged.core.net.bytebuddy.dynamic.loading.ClassInjector$AbstractBase.inject (ClassInjector.java:111) at org.apache.beam.repackaged.core.net.bytebuddy.dynamic.loading.ClassLoadingStrategy$Default$InjectionDispatcher.load (ClassLoadingStrategy.java:232) at org.apache.beam.repackaged.core.net.bytebuddy.dynamic.loading.ClassLoadingStrategy$Default.load (ClassLoadingStrategy.java:143) at org.apache.beam.repackaged.core.net.bytebuddy.dynamic.TypeResolutionStrategy$Passive.initialize (TypeResolutionStrategy.java:100) at org.apache.beam.repackaged.core.net.bytebuddy.dynamic.DynamicType$Default$Unloaded.load (DynamicType.java:5623) at org.apache.beam.sdk.transforms.reflect.ByteBuddyDoFnInvokerFactory.generateInvokerClass (ByteBuddyDoFnInvokerFactory.java:351) at org.apache.beam.sdk.transforms.reflect.ByteBuddyDoFnInvokerFactory.getByteBuddyInvokerConstructor (ByteBuddyDoFnInvokerFactory.java:247) at org.apache.beam.sdk.transforms.reflect.ByteBuddyDoFnInvokerFactory.newByteBuddyInvoker (ByteBuddyDoFnInvokerFactory.java:220) at org.apache.beam.sdk.transforms.reflect.ByteBuddyDoFnInvokerFactory.newByteBuddyInvoker (ByteBuddyDoFnInvokerFactory.java:151) at org.apache.beam.sdk.transforms.reflect.DoFnInvokers.invokerFor (DoFnInvokers.java:35) at org.apache.beam.runners.core.construction.SplittableParDo.expand (SplittableParDo.java:170) at org.apache.beam.runners.core.construction.SplittableParDo.expand (SplittableParDo.java:87) at org.apache.beam.sdk.Pipeline.applyReplacement (Pipeline.java:564) at org.apache.beam.sdk.Pipeline.replace (Pipeline.java:290) at org.apache.beam.sdk.Pipeline.replaceAll (Pipeline.java:208) at org.apache.beam.runners.dataflow.DataflowRunner.replaceTransforms (DataflowRunner.java:995) at org.apache.beam.runners.dataflow.DataflowRunner.run (DataflowRunner.java:712) at org.apache.beam.runners.dataflow.DataflowRunner.run (DataflowRunner.java:179) at org.apache.beam.sdk.Pipeline.run (Pipeline.java:313) at org.apache.beam.sdk.Pipeline.run (Pipeline.java:299) at com.google.cloud.teleport.templates.TextToBigQueryStreaming.run (TextToBigQueryStreaming.java:255) at com.google.cloud.teleport.templates.TextToBigQueryStreaming.main (TextToBigQueryStreaming.java:136) at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0 (Native Method) at jdk.internal.reflect.NativeMethodAccessorImpl.invoke (NativeMethodAccessorImpl.java:62) at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke (Method.java:567) at org.codehaus.mojo.exec.ExecJavaMojo$1.run (ExecJavaMojo.java:282) at java.lang.Thread.run (Thread.java:835) [ERROR] [ERROR] [ERROR] For more information about the errors and possible solutions, please read the following articles: [ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException

The command I ran was:
mvn -X compile exec:java \ -Dexec.mainClass=com.google.cloud.teleport.templates.TextToBigQueryStreaming \ -Dexec.cleanupDaemonThreads=false \ -Dexec.args=" \ --project=[project] \ --stagingLocation=gs://[bucket]/staging \ --tempLocation=gs://[bucket]/temp \ --templateLocation=gs://[bucket]/templates/text_to_bq_streaming.json \ --runner=DataflowRunner"

The output of 'mvn --version' is
Apache Maven 3.6.1 (d66c9c0b3152b2e69ee9bac180bb8fcc8e6af555; 2019-04-04T15:00:29-04:00) Maven home:[HOME]/apache/maven/apache-maven-3.6.1 Java version: 12.0.2, vendor: Oracle Corporation, runtime: [HOME]/jdk/jdk-12.0.2 Default locale: en_US, platform encoding: UTF-8 OS name: "linux", version: "4.19.37-5+deb10u1rodete2-amd64", arch: "amd64", family: "unix"

AutoValue_DynamicJdbcIO_DynamicRead.Builder

Hi I have configured google cloud tools for eclipse and am facing this issue after cloning Dataflow templates, I have added all these dependencies auto-service-1.0-rc1.jar, guava-16.0.1.jar, jsr-305-2.0.3.jar,auto-value-1.0-rc1.jar but could not resolve the issue :AutoValue_DynamicJdbcIO_DynamicRead.Builder could not be resolved to a type

BulkDecompressor -- Error writing failures CSV file

When BulkDecompressor tries to write an error CSV file, seeing this stack trace:

Caused by: java.lang.IllegalArgumentException: No quotes mode set but no escape character is set
	at org.apache.commons.csv.CSVFormat.validate(CSVFormat.java:1397)
	at org.apache.commons.csv.CSVFormat.<init>(CSVFormat.java:647)
	at org.apache.commons.csv.CSVFormat.withQuoteMode(CSVFormat.java:1832)
	at com.google.cloud.teleport.templates.BulkDecompressor.lambda$run$9962e4b6$1(BulkDecompressor.java:234)
	at org.apache.beam.sdk.transforms.Contextful.lambda$fn$36334a93$1(Contextful.java:112)
	at org.apache.beam.sdk.transforms.MapElements$1.processElement(MapElements.java:123)

From reading the CSVFormat code, it looks like you need to call withEscape() before you call withQuoteMode(). I will have a pull request to fix this shortly.

Getting Exception when Creating New Template

Hi,

I am trying to creating a new template (specifically, modifying one of the DataflowTemplates as a new one). But, when I run the dataflow, I am getting an exception that I could not trace because there is no related script in this repo files. Could you give me advice about tracing and solving this issue?

fyi, the template that I am trying to modify is KafkaToBigQuery.java.

java.lang.NullPointerException
        org.apache.beam.vendor.guava.v20_0.com.google.common.base.Preconditions.checkNotNull(Preconditions.java:770)
        org.apache.beam.sdk.util.WindowedValue$TimestampedWindowedValue.<init>(WindowedValue.java:263)
        org.apache.beam.sdk.util.WindowedValue$TimestampedValueInGlobalWindow.<init>(WindowedValue.java:278)
        org.apache.beam.sdk.util.WindowedValue.timestampedValueInGlobalWindow(WindowedValue.java:117)
        org.apache.beam.runners.dataflow.worker.WorkerCustomSources$UnboundedReaderIterator.getCurrent(WorkerCustomSources.java:827)
        org.apache.beam.runners.dataflow.worker.WorkerCustomSources$UnboundedReaderIterator.getCurrent(WorkerCustomSources.java:759)
        org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation$SynchronizedReaderIterator.getCurrent(ReadOperation.java:394)
        org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation.runReadLoop(ReadOperation.java:201)
        org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation.start(ReadOperation.java:159)
        org.apache.beam.runners.dataflow.worker.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:77)
        org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.process(StreamingDataflowWorker.java:1287)
        org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.access$1000(StreamingDataflowWorker.java:149)
        org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker$6.run(StreamingDataflowWorker.java:1024)
        java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        java.lang.Thread.run(Thread.java:745)

Thanks in advance

Unable to dump unbounded PubSub content to gs:// bucket

Hi there,

I'm completely new to Apache Beam and its programming model is quite surprising, and while trying to workaround a Parquet writer while reading from a PubSub Writer I can't wrap my head around the following...

https://github.com/GoogleCloudPlatform/DataflowTemplates/blob/master/src/main/java/com/google/cloud/teleport/templates/PubsubToAvro.java taking that template as base.

Seems that AvroIO has support for windowed writes to buckets such as gs://my-bucket/YYYY/MM/DD, being 'YYYY' variables automatically filled at runtime by the AvroIO handler.

Is there any way to achieve this using ParquetIO? The only bits of Parquet I've seen are the following ones, but none of them write by date...

Tried a first approach and, even if the code compile and runs one event after another, I can't get it to run in local with DirectRunner and against a bucket. The code I've got so far is the following one.

SOLVED

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.