pulsar-flink

Pulsar Flink connector is an elastic data processing with Apache Pulsar and Apache Flink.

Prerequisites

Java 8 or later
Flink 1.9.0 or later
Pulsar 2.4.0 or later

Preparations

Link

Client library

For Scala/Java applications using SBT/Maven project definitions, link your application with the following artifact:

    groupId = io.streamnative.connectors
    artifactId = pulsar-flink-connector_{{SCALA_BINARY_VERSION}}
    version = {{PULSAR_FLINK_VERSION}}

Currently, the artifact is available in Bintray Maven repository of StreamNative. For Maven project, you can add the repository to your pom.xml as follows:

  <repositories>
    <repository>
      <id>central</id>
      <layout>default</layout>
      <url>https://repo1.maven.org/maven2</url>
    </repository>
    <repository>
      <id>bintray-streamnative-maven</id>
      <name>bintray</name>
      <url>https://dl.bintray.com/streamnative/maven</url>
    </repository>
  </repositories>

To build an application JAR that contains all dependencies required for libraries and pulsar flink connector, you can use the following shade plugin definition template:

<plugin>
  <!-- Shade all the dependencies to avoid conflicts -->
  <groupId>org.apache.maven.plugins</groupId>
  <artifactId>maven-shade-plugin</artifactId>
  <version>${maven-shade-plugin.version}</version>
  <executions>
    <execution>
      <phase>package</phase>
      <goals>
        <goal>shade</goal>
      </goals>
      <configuration>
        <createDependencyReducedPom>true</createDependencyReducedPom>
        <promoteTransitiveDependencies>true</promoteTransitiveDependencies>
        <minimizeJar>false</minimizeJar>

        <artifactSet>
          <includes>
            <include>io.streamnative.connectors:*</include>
            <!-- more libs to include here -->
          </includes>
        </artifactSet>
        <filters>
          <filter>
            <artifact>*:*</artifact>
            <excludes>
              <exclude>META-INF/*.SF</exclude>
              <exclude>META-INF/*.DSA</exclude>
              <exclude>META-INF/*.RSA</exclude>
            </excludes>
          </filter>
        </filters>
        <transformers>
          <transformer implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer" />
          <transformer implementation="org.apache.maven.plugins.shade.resource.PluginXmlResourceTransformer" />
        </transformers>
      </configuration>
    </execution>
  </executions>
</plugin>

Deploy

Client library

As with any Flink applications, ./bin/flink run is used to compile and launch your application.
If you have already built a fat jar using the shade maven plugin above, your jar can be added to flink run using --classpath.

Note

The format of a path must be a protocol (for example, file://) and the path should be accessible on all nodes.

Example

$ ./bin/flink run
  -c com.example.entry.point.ClassName file://path/to/jars/your_fat_jar.jar
  ...

Scala REPL

For experimenting on the interactive Scala shell bin/start-scala-shell.sh, you can use --addclasspath to add pulsar-flink-connector_{{SCALA_BINARY_VERSION}}-{{PULSAR_FLINK_VERSION}}.jar directly.

Example

$ ./bin/start-scala-shell.sh remote <hostname> <portnumber>
  --addclasspath pulsar-flink-connector_{{SCALA_BINARY_VERSION}}-{{PULSAR_FLINK_VERSION}}.jar

For more information about submitting applications with CLI, see Command-Line Interface.

Usage

Read data from Pulsar

Create a Pulsar source for streaming queries

The following examples are in Scala.

val env = StreamExecutionEnvironment.getExecutionEnvironment
val props = new Properties()
props.setProperty("service.url", "pulsar://...")
props.setProperty("admin.url", "http://...")
props.setProperty("partitionDiscoveryIntervalMillis", "5000")
props.setProperty("startingOffsets", "earliest")
props.setProperty("topic", "test-source-topic")
val source = new FlinkPulsarSource(props)
// you don't need to provide a type information to addSource since FlinkPulsarSource is ResultTypeQueryable
val dataStream = env.addSource(source)(null)

// chain operations on dataStream of Row and sink the output
// end method chaining

env.execute()

Register topics in Pulsar as streaming tables

The following examples are in Scala.

val env = StreamExecutionEnvironment.getExecutionEnvironment
val tEnv = StreamTableEnvironment.create(env)
val props = new Properties()
props.setProperty("service.url", "pulsar://...")
props.setProperty("admin.url", "http://...")
props.setProperty("partitionDiscoveryIntervalMillis", "5000")
props.setProperty("startingOffsets", "earliest")
props.setProperty("topic", "test-source-topic")
tEnv
  .connect(new Pulsar().properties(props))
  .inAppendMode()
  .registerTableSource("pulsar-test-table")

The following options must be set for the Pulsar source.

Option	Value	Description
`topic`	A topic name string	The topic to be consumed. Only one of `topic`, `topics` or `topicsPattern` options can be specified for Pulsar source.
`topics`	A comma-separated list of topics	The topic list to be consumed. Only one of `topic`, `topics` or `topicsPattern` options can be specified for Pulsar source.
`topicsPattern`	A Java regex string	The pattern used to subscribe to topic(s). Only one of `topic`, `topics` or `topicsPattern` options can be specified for Pulsar source.
`service.url`	A service URL of your Pulsar cluster	The Pulsar `serviceUrl` configuration.
`admin.url`	A service HTTP URL of your Pulsar cluster	The Pulsar `serviceHttpUrl` configuration.

The following configurations are optional.

Option Value Default Description

`startingOffsets`

Option	Value	Default	Description
`startingOffsets`	The following are valid values: "earliest"(streaming and batch queries) "latest" (streaming query) A JSON string Example """ {"topic-1":[8,11,16,101,24,1,32,1],"topic-5":[8,15,16,105,24,5,32,5]} """	"latest"	`startingOffsets` option controls where a consumer reads data from. "earliest": lacks a valid offset, the consumer reads all the data in the partition, starting from the very beginning. "latest": lacks a valid offset, the consumer reads from the newest records written after the consumer starts running. A JSON string: specifies a starting offset for each Topic. You can use `org.apache.flink.pulsar.JsonUtils.topicOffsets(Map[String, MessageId])` to convert a message offset to a JSON string. Note: "latest" only applies when a new query is started, and the resuming will always pick up from where the query left off. Newly discovered partitions during a query will start at "earliest".
`partitionDiscoveryIntervalMillis`	A long value or a string which can be converted to long	-1	`partitionDiscoveryIntervalMillis` option controls whether the source discovers newly added topics or partitions match the topic options while executing the streaming job. A positive long `l` would trigger the discoverer run every `l` milliseconds, and negative values would turn off a topic or a partition discoverer.

The following are valid values:

"earliest"(streaming and batch queries)
"latest" (streaming query)
A JSON string

Example

""" {"topic-1":[8,11,16,101,24,1,32,1],"topic-5":[8,15,16,105,24,5,32,5]} """

"latest"

startingOffsets option controls where a consumer reads data from.

"earliest": lacks a valid offset, the consumer reads all the data in the partition, starting from the very beginning.
"latest": lacks a valid offset, the consumer reads from the newest records written after the consumer starts running.
A JSON string: specifies a starting offset for each Topic.
You can use org.apache.flink.pulsar.JsonUtils.topicOffsets(Map[String, MessageId]) to convert a message offset to a JSON string.

Note:

"latest" only applies when a new query is started, and the resuming will always pick up from where the query left off. Newly discovered partitions during a query will start at "earliest".

`partitionDiscoveryIntervalMillis`

A long value or a string which can be converted to long

-1

partitionDiscoveryIntervalMillis option controls whether the source discovers newly added topics or partitions match the topic options while executing the streaming job. A positive long l would trigger the discoverer run every l milliseconds, and negative values would turn off a topic or a partition discoverer.

Schema of Pulsar source

For topics without schema or with primitive schema in Pulsar, messages payload is loaded to a value column with the corresponding type with Pulsar schema.
For topics with Avro or JSON schema, their field names and field types are kept in the result rows.

Besides, each row in the source has the following metadata fields as well.

Column	Type
`__key`	Bytes
`__topic`	String
`__messageId`	Bytes
`__publishTime`	Timestamp
`__eventTime`	Timestamp

Example

The Pulsar topic of AVRO schema s (example 1) converted to a Flink table has the following schema (example 2).

Example 1

  case class Foo(@BeanProperty i: Int, @BeanProperty f: Float, @BeanProperty bar: Bar)
  case class Bar(@BeanProperty b: Boolean, @BeanProperty s: String)
  val s = Schema.AVRO(Foo.getClass)

Example 2

root
 |-- i: INT
 |-- f: FLOAT
 |-- bar: ROW<`b` BOOLEAN, `s` STRING>
 |-- __key: BYTES
 |-- __topic: STRING
 |-- __messageId: BYTES
 |-- __publishTime: TIMESTAMP(3)
 |-- __eventTime: TIMESTAMP(3)

The following is the schema of a Pulsar topic with Schema.DOUBLE:

root
|-- value: DOUBLE
|-- __key: BYTES
|-- __topic: STRING
|-- __messageId: BYTES
|-- __publishTime: TIMESTAMP(3)
|-- __eventTime: TIMESTAMP(3)

Write data to Pulsar

The DataStream written to Pulsar can have an arbitrary type.

For DataStream[Row], __topic field is used to identify the topic this message will be sent to, __key is encoded as metadata of Pulsar message, and all the other fields are grouped and encoded using AVRO and put in value():

producer.newMessage().key(__key).value(avro_encoded_fields)

For DataStream[T] where T is a POJO type, each record in data stream will be encoded using AVRO and put in Pulsar messages value(), optionally, you could provide an extra topicKeyExtractor that identify topic and key for each record.

Create a Pulsar sink for streaming queries

The following examples are in Scala.

val env = StreamExecutionEnvironment.getExecutionEnvironment
val stream = .....

val prop = new Properties()
prop.setProperty("service.url", serviceUrl)
prop.setProperty("admin.url", adminUrl)
prop.setProperty("flushOnCheckpoint", "true")
prop.setProperty("failOnWrite", "true")
props.setProperty("topic", "test-sink-topic")

stream.addSink(new FlinkPulsarSink(prop, DummyTopicKeyExtractor))
env.execute()

Write a streaming table to Pulsar

The following examples are in Scala.

val env = StreamExecutionEnvironment.getExecutionEnvironment
val tEnv = StreamTableEnvironment.create(env)

val prop = new Properties()
prop.setProperty("service.url", serviceUrl)
prop.setProperty("admin.url", adminUrl)
prop.setProperty("flushOnCheckpoint", "true")
prop.setProperty("failOnWrite", "true")
props.setProperty("topic", "test-sink-topic")

tEnv
  .connect(new Pulsar().properties(props))
  .inAppendMode()
  .registerTableSource("sink-table")

val sql = "INSERT INTO sink-table ....."
tEnv.sqlUpdate(sql)
env.execute()

The following options must be set for a Pulsar sink.

Option	Value	Description
`service.url`	A service URL of your Pulsar cluster	The Pulsar `serviceUrl` configuration.
`admin.url`	A service HTTP URL of your Pulsar cluster	The Pulsar `serviceHttpUrl` configuration.

The following configurations are optional.

Option	Value	Default	Description
`topic`	A topic name string	None	The topic to be write to. If this option is not set, DataStreams or tables write to Pulsar must contain a TopicKeyExtractor that return nonNull topics or `__topic` field.
`flushOnCheckpoint`	Whether flush all records write until checkpoint and wait for confirms.	true	At-least-once semantic is achieved when `flushOnCheckpoint` is set to `true` and checkpoint is enabled on execution environment. Otherwise, you get no write guarantee.
`failOnWrite`	Whether fail the sink while sending records to Pulsar fail.	false	None

Limitations

Currently, we provide at-least-once semantic when flushOnCheckpoint is set to true. Consequently, when writing streams to Pulsar, some records may be duplicated. We would provide exactly-once sink semantic when Pulsar has transaction supports.

Pulsar specific configurations

Client/producer/consumer configurations of Pulsar can be set in properties with pulsar.client./pulsar.producer./pulsar.consumer. prefix.

Example

prop.setProperty("pulsar.consumer.ackTimeoutMillis", "10000")

For possible Pulsar parameters, see Pulsar client libraries.

Build Pulsar Flink Connector

If you want to build a Pulsar Flink connector reading data from Pulsar and writing results to Pulsar, follow the steps below.

Check out the source code.

$ git clone https://github.com/streamnative/pulsar-flink.git
$ cd pulsar-flink

Install Docker.

Pulsar-flink connector is using Testcontainers for integration tests. In order to run the integration tests, make sure you have installed Docker.
Set a Scala version.

Change scala.version and scala.binary.version in pom.xml.

Note

Scala version should be consistent with the Scala version of flink you use.
Build the project.
```
$ mvn clean install -DskipTests
```
Run the tests.
```
$ mvn clean install
```

Once the installation is finished, there is a fat jar generated under both local maven repo and target directory.

progart / data_flink_pulsar_connector Goto Github PK

data_flink_pulsar_connector's Introduction

pulsar-flink

Prerequisites

Preparations

Link

Client library

Deploy

Client library

Note

Scala REPL

Usage

Read data from Pulsar

Create a Pulsar source for streaming queries

Register topics in Pulsar as streaming tables

Schema of Pulsar source

Write data to Pulsar

Create a Pulsar sink for streaming queries

Write a streaming table to Pulsar

Limitations

Pulsar specific configurations

Build Pulsar Flink Connector

Note

data_flink_pulsar_connector's People

Contributors

Watchers

Recommend Projects

Recommend Topics

Recommend Org

Jobs