streamnative / pulsar-rs Goto Github PK

Rust Client library for Apache Pulsar

License: Other

Rust 100.00%

pulsar-rs's Introduction

pulsar-rs: Future-based Rust client for Apache Pulsar

This is a pure Rust client for Apache Pulsar that does not depend on the C++ Pulsar library. It provides an async/await based API, compatible with Tokio and async-std.

Features:

URL based (pulsar:// and pulsar+ssl://) connections with DNS lookup;
Multi topic consumers (based on a regex or list);
TLS connection;
Configurable executor (Tokio or async-std);
Automatic reconnection with exponential back off;
Message batching;
Compression with LZ4, zlib, zstd or Snappy (can be deactivated with Cargo features);
Telemetry using tracing crate (can be activated with Cargo features).

Getting Started

Add the following dependencies in your Cargo.toml:

futures = "0.3"
pulsar = "5.1"
tokio = "1.0"

Try out examples:

Project Maintainers

Contribution

This project welcomes your PR and issues. For example, refactoring, adding features, correcting English, etc.

Thanks to all the people who already contributed!

License

This library is licensed under the terms of both the MIT license and the Apache License (Version 2.0), and may include packages written by third parties which carry their own copyright notices and license terms.

See LICENSE-APACHE, LICENSE-MIT, and COPYRIGHT for details.

History

This project is originally created by @stearnsc and others at Wyyerd at 2018. Later at 2022, the orginal creators decided to transfer the repository to StreamNative.

Currently, this project is actively maintained under the StreamNative organization with a diverse maintainers group.

About StreamNative

Founded in 2019 by the original creators of Apache Pulsar, StreamNative is one of the leading contributors to the open-source Apache Pulsar project. We have helped engineering teams worldwide make the move to Pulsar with StreamNative Cloud, a fully managed service to help teams accelerate time-to-production.

pulsar-rs's People

Contributors

Stargazers

Watchers

Forkers

geal clevercloud pierrez xjump tempbottle vivint-smarthome muttnikus igxactly luciofranco leshow atul9 gperinazzo vmalloc cameronpickham terkwood liangyuanpeng rtakasu yew1eb chrislearn streamnative-oss freeznet org-mars keruspe bewaremypower vkill icodein quinniup okorolko aesteve keksoj mbrobbel akhilles valleyzw fantapsody jszwedko omegaphoenix sammoh94 isgasho cstrahan-blueshift andtan91 abbudao cibingeorge stupnikovvg shanicky seeday daheige flassie remikalbe atkinschang giwayne barodeur linecode aierui sijie donghunlouislee kannarfr doehyunbaek tpiperatgod rs-god n00m4d coderzc tennyzhuang juyunlee skyzh felixonmars gmh5225 tisonkun cleverakanoa evanrichter darwinium-com mayhemheroes kotlin2018 miton18 dghilardi florentindubois kagetsuki1997 montacome tabversion crazycollin quickwit-oss morigs klaatu01 grv07 mihaigalos afonsosribeiro michaeljmarshall yuweisung afonsoribeiro wonshtrum liamkinne ausenergyresearchlabs joaquimvieirasp iq-scm bakjos samcarey frisoft geomagilles cirias stellit bwstearns

pulsar-rs's Issues

changing the acknowledgement interface

while exploring #51 and #64, and looking at other clients (Java, Go, C++), I realized the current acknowledgement system cannot work properly.

If we want to close a consumer, it will have no impact on currently existing Ack instances, and if we use acknowledgement timeout, we might even try to trigger redelivering on a consumer that does not exist. Also, calling close_consumer() from the Drop implementation apparently does not work in all cases.

Proposal

Move acknowledgement methods to Consumer

Instead of having the ack(), cumulative_ack() and nack() methods on Ack, it would make more sense to put those methods on the consumer, as is done in clients from other languages. so we'd write consumer.acknowledge(&message).await, instead of the current Ack::ack() method which is not async

Send Ack and Nack directly from Consumer methods when possible

The current AckHandler implementation is complex, because it implements acknowledgement timeout (which is discouraged in the documentation). It could be simplified by sending acks directly from Consumer::ack(), and nacks directly from Consumer::nack() if we're not using the timeout. The AckHandler would only be used if we require acknowledgement timeout.

Properly closing the consumer

Async drop is a known problem, but we can have a solution here:

the consumer holds a oneshot::Sender<()>
we spawn an async block that holds a clone of the Arc<Connection> of the consumer, and awaits on the corresponding Receiver
once the consumer is dropped, the sender is dropped, the receiver returns Err(Canceled)
the async block will then call close_consumer()

ping @stearnsc any opinions on this?

support hostname as address

Hi,

Currently the client uses only ip address to reach pulsar.

It would be great to support hostname url, dig them and use random ip, if the client stop because this ip becomes unreachable or something try to switch on other and again for resiliency.

TLS support?

Is there plan to upgrade to new futures?

Hello, just wonder is there plan to upgrade to futures 0.3 and tokio 0.2?

Example for token-based authentication

Where can I find documentation or examples of how to use the authentication parameter? Trying to connect to my Pulsar instance using a token.

https://pulsar.apache.org/docs/en/security-jwt/

Message encryption support

https://pulsar.apache.org/docs/en/security-encryption/

message extraction does not handle errors

I've seen some cases like this one: https://github.com/wyyerd/pulsar-rs/blob/master/src/connection.rs#L233

where we're waiting for the success response but an error response can come. See both options for CommandSend here: https://github.com/wyyerd/pulsar-rs/blob/master/PulsarApi.proto#L877-L878

TLS Support

I need TLS support in order to connect to my deployed Pulsar server. I was able to get it working by modifying the pulsar-rs source to use tokio-native-tls in connection.rs but wanted to defer to the project maintainers as to the best next steps here.

Handling batched messages

The current implementation does not deal with batched messages. When pulsar sends a batch of messages, the message payload will contain a sequence of SingleMessageMetadata followed by the individual message payload, for each message.

Due to this, a naive implementation of DeserializeMessage will fail when receiving a batched message, as it has to read the SingleMessageMetadata before processing the payload.

This also requires #36 to be resolved, as the entire batch must be decompressed.

Write a README

Simple how-to-use

1.0 Release tracking issue

This is a tracking issue for our upcoming pulsar 1.0 release.

Features we want included:

Upgrade to futures 0.3
Clean, stable consumer/producer API[1]
Transparent reconnect for producers and consumers[2]
Consumer message batching
Producer message batching (currently possible, but awkward)
Compression
Dead-letter topic
Finalized message serialization/deserialization API
<What others?>

[1]: Right now we duplication creation w/ builders + constructors. To make future changes non-breaking, moving to a builder-only pattern might be desirable.

[2]: This is tricky and requires some thought. For pulsar, the concept of a producer is connection-based (unique ID within a connection, etc). It's not clear to me what use cases exist for exposing that to the user - if possible, I'd like to abstract that away, and transparently handle generating a new producer on a new connection when an existing connection fails. If so, this should be configurable, and expose an API to check whether the client is connected.

Build a "Reader"

Implement Reader (https://pulsar.apache.org/docs/en/concepts-clients/#reader-interface)

Consumers should be able to subscribe to multiple topics

This means a breaking change to the consumer API, as we'll need to pass topic info to the stream consumer

Correctly handle broker-discovery

Currently we drop "LookupTopic" responses on the floor

Expose Pulsar healthcheck

vectordotdev/vector#2475 (comment)

Producer::send can't send Vec<u8> or [u8]

As far as I can tell, it's not possible to send binary data to a Producer.

send requires:

<T: SerializeMessage>(...message: &T)

There is a SerializeMessage impl for [u8], but none for Vec<u8>.

If you try to send a [u8], rustc will complain that the value is potentially unsized:

error[E0277]: the size for values of type `[u8]` cannot be known at compilation time
   --> src/sinks/pulsar.rs:100:58
    |
100 |         let fut = self.producer.send(self.topic.clone(), &message[..]);
    |                                                          ^^^^^^^^^^^^ doesn't have a size known at compile-time

If you try to send a Vec<u8> or &Vec<u8>, it won't work because there is no SerializeMessage impl for it.

fix: Adding a bound like T: SerializeMessage + ?Sized is a quick fix to get [u8] data working. Vec<u8> should probably also be implemented. In the long run, it may be worth looking at the interplay between message: &T and serialize_message(input: &Self).

Depending on what you're looking for, I can make one of these changes.

the consumer should provide the `Ack` even for deserialization errors

the consumer provides a Stream of Result<(T, Ack), ConsumerError>. If there was a deserialization error, the message is not acked and will still be present in the topic.
I'd propose that the consumer provides instead a Stream of (Result<T, ConsumerError>, Ack), to let the client decide to acknowledge the message to drop it in case of error

Producers and Consumers should reconnect transparently

Right now, each producer or consumer has a single connection, and if that connection disconnects, the producer or consumer must be recreated. It would be nice to make this happen transparently, possibly with a configurable timeout.

Some issues that need to be resolved:

(1) What's the best abstraction for the producers / consumers to use in creating a new connection? Each could contain a full Pulsar client and use that to reconnect (easy to accomplish if we limit creating to being done via the client). This has the benefit of allowing better connection re-use (if each client references the same connection manager, they can get references to existing connections without difficulty).

(2) How do we handle enqueued messages? Currently the producer and the connection tasks communicate using futures::sync::mpsc channels; in the event of a reconnect, any messages enqueued would be dropped. Is there a clean way to recover them and re-send? (Could have a "return messages" queue or something that gets filled with any pending messages in a Drop impl, but this starts adding complexity that makes me nervous).

(3) The machinery around swapping out a connection is nontrivial in the producer itself; reconnecting will resolve with a Future of Connection, but will need some ability to actually mutate the consumer or producer, which implies some sort of sync mutability, but I don't really like the idea of every connection being behind a lock & refcell, so possibly there are some architecture improvements that could be made there to make that easier?

Creating a Producer from a dedicated function always result in a 'Disconnected' state

Hi!

I'm having trouble creating a Producer from a dedicated function. I'm always experiencing a disconnect when returning a Producer. Here's a example that is reproducing the bug:

    pub fn new_basic_producer() -> Result<Producer, ConnectionError> {
        let runtime = tokio::runtime::Runtime::new().unwrap();
        Producer::new(String::from("127.0.0.1:6650"), "topic", None, None, None, runtime.executor()).wait()
    }

    #[test]
    fn cloneable() {
        let producer = new_basic_producer().unwrap();
        let serialized = vec![240, 159, 146, 150];
        let send_1 = producer.send_raw(serialized, None).wait().unwrap();
    }

thread 'tests::cloneable' panicked at 'called Result::unwrap() on an Err value: Disconnected', src/libcore/result.rs:999:5

If I put back the Producer inline, everything is working. Same issue when creating a Pool using r2d2::Pool.

send failed because receiver is gone

Hello, before this library supports futures 0.3. I use it with futures 0.3 with compat. some code like this:

	consumer.compat().for_each(move |msg|{
		if msg.is_err() {
			return future::ready(());
		}
		let msg = msg.unwrap();
		msg.ack.ack();

But when I run this code, the ack is not success, in fact there is an error happend: "send failed because receiver is gone".
Anyone an help me to see how to let it work correctly? thanks.

logging and error management

currently it's a bit annoying to understand failures, since most of them will result in a Error::Disconnected without any context.
I propose two things:

adding the log crate and integrating logs here and there
error structures with more context, like which consumer id generated it, which topics. I'm not fond of using crates like error-chain or failure, it's easy enough to make the error structures manually

a consumer should not be closed until all current messages are dropped or acknowledged

Here's some code to show the issue:

In this case we use for_each on the consumer, so the consumer is still available whenever we call msg.ack.ack():

    Pulsar::new(
        pulsar_addr.parse().unwrap(),
        auth,
        runtime.executor(),
    )
    .and_then(|client| {
        client.create_consumer::<Data,_,_>(topic, "my_subscriber", SubType::Shared, None, None, None)
    })
    .and_then(|mut c| {
        info!("created consumer");
        c.for_each(move |msg: pulsar::Message<Result<Data, ConsumerError>>| {
          info!("got message: {:?}", msg.payload);
          msg.ack.ack();
          Ok(())
        })
    })
    .map_err(|e| {
        error!("got error: {:?}", e);
        e
    })
    .wait()
    .unwrap();

Here we take a message from the consumer, pass it to another part of the code, while the consumer is dropped. If we try to ack the message after the consumer was dropped, the message is never acknowledged.

    Pulsar::new(
        pulsar_addr.parse().unwrap(),
        auth,
        runtime.executor(),
    )
    .and_then(|client| {
        client.create_consumer::<Data,_,_>(topic, "my_subscriber", SubType::Shared, None, None, None)
    })
    .and_then(|mut c| {
        info!("created consumer");
      c.take(1).collect()
    }).and_then(|mut msgs: Vec<pulsar::Message<Result<Data, ConsumerError>>>| {
      let msg = msgs.remove(0);
        info!("got message: {:?}", msg.payload);
        msg.ack.ack();
        info!("acked");
        Ok(())
    })
    .map_err(|e| {
        error!("got error: {:?}", e);
        e
    })
    .wait()
    .unwrap();

Upgrade to futures 0.3

I've started work on the futures 0.3 upgrade, but I think this is probably also a good time to actually solve some of the other major issues I've been thinking about, as they will likely involve breaking changes to the API.

(1) How we handle reconnects. Right now, a disconnect will fail a producer or consumer (at least the single-topic variety), requiring users to manually handle reconnects. In particular, since pulsar is frequently used in cloud applications, I'd like to be resilient to e.g. DNS changes (if a user redeploys their pulsar cluster, it would be nice for existing applications to transparently reconnect to the new cluster). Currently, I'm thinking having a single separate task that handles all of the connections, producers, and consumers might be simplest, but it would be worth verifying that this doesn't negatively affect throughput.

(2) How to handle slowly adding pulsar features without breaking changes each time. Right now, as we slowly build out full functionality, we keep needing to add configurable options to constructors. Right now, each of these requires a minor version bump. One option would be to hide everything behind builders, and just add more with_<option>(...) methods. Others?

(3) How to handle spawning tasks for multiple executors. With the increasing popularity of async-std, it seems important to not be tied to the tokio runtime. futures exposes a Spawn (or really, SpawnExt) trait that we can wrap to pass around our own executor, but right now it appears that async-std doesn't actually expose anything that implements it, so that might not be a workable solution. Alternatively, we could have client creation return an "engine" future that callers are required to spawn themselves, and internally use channels to pass anything that needs to be handled to that future. We could also use feature-flags to specify runtime. Right now, there are only two major ones (tokio and async-std), but if the ecosystem grows, this could quickly become untenable.

The `batch_size` option in the producer is disconnected

In struct ProducerOptions, there is a batch_size option, however this option is not actually used anywhere. In fact, I deleted the line from the struct locally, and no build errors are produced.

Is there a plan to implement this option for producers?

handling remote Java null pointer exceptions from local safe Rust

I regularly get the following error Err(ServiceDiscovery(Query("java.lang.NullPointerException"))) when testing automatic reconnection with a standalone Pulsar. This message comes from the CommandLookupTopicResponse message field: https://github.com/wyyerd/pulsar-rs/blob/master/PulsarApi.proto#L393

I do not know yet how I should handle this. I guess the client should just return an error? Can we recover from it (usually it means that the standalone pulsar has not completely started yet)

handling acknowledgement across reconnections

we should verify how acknowledgements work if we reconnect before sending them, if we can retry them, etc. I think the unacked messages will be redelivered automatically to the new consumer anywa, but we should check this

Support other executors

Since the executor is already hidden behind a trait, it should be possible to support async-std and other executors

configurable TLS support

cf https://pulsar.apache.org/api/client/2.3.0/ for API ideas

We should allow specifying a custom root certificate, and changing the algorithms. Algorithms should also use a safe default

Handle auth

new Client and ConnectionManager interfaces

while working on the service discovery feature, I got an idea of how to make an API that's easy to use.

Connection manager

Handling connections to a Pulsar cluster is a lot of work. Through a single client, we should be able to connect to multiple brokers, either directly, or through a proxy, in TCP plaintext or over TLS. A broker connection could be shared between multiple producers and consumers. Also, handling authentication.

I explored a way to handle it in the service discovery PR: a function that takes a broker address, and returns a Future of Arc<Connection>. If the connection exists, we get it immediately, otherwise we try to connect to that broker.

I'd like to extract that from the SD code, and have a ConnectionManager struct, that would be created from the following information:

IP and port of either a broker or the proxy
authentication data (if available)
activate TLS or not

This connection manager would handle connecting to other broker, passing along authentication data, creating TLS connections, and answering queries for Connection object.
I'm not sure it should reconnect automatically when connections drop, because consumers using them would need to be created again, but it should allow reconnecting if asked to.

When used through SD, the connection manager would be automatically populated with the discovered brokers.
Consumers and producers could be created with connections coming from the manager (they already use a Arc<Connection> internally).

pulsar::Client

This would be a wrapping object that can connect, perform lookups, create producers and consumers, while hiding the protocol details (the current Connection, Consumer, Producer and ServiceDiscovery APIs should still be available if people want to get a bit lower in the stack).

It would be created from connection info like the connection manager, and would have methods like these:

lookup_topic/lookup_partitioned_topic
create_consumer/create_producer from a broker address and a topic name
consume_topic/consume_partitioned_topic, possibly creating one consumer per partitioned topic?
send_message to send a single message to a topic

This Client interface would hide a lot of boilerplate to implement common use cases.

Introduce releases

Hi @stearnsc,

Can you introduce releases for crates.io?

Regressions: unacked message redelivery and multitopic test

I broke those tests while refactoring. Fixing the NackHandler would require registering currently in flight messages to the Nackhandler, and removing them from that list if they're acknowledged (essetally returning to the previous code), but there's a way to rewrite the handler with async block, that I plan to add this week.
For the multitopic consumer, I do not know yet what is causing the issue, but I'm planning to rewrite it as an async block anyway

update prost 5 in Cargo.toml

Hoy hoy,
Prost is unstable in versions less then 6 , (nasty stack overflow)

Link to more info:

https://rust.firosolutions.com/paste/mC9weAfR5UMKX4Ic9K08hd9nTGZkozrkKjObQajJbd0

Plz update your Cargo.toml file

the Producer is actually a ProducerFactory and it's confusing

a Consumer is created with a link to a topic, as follows:

let consumer: Consumer<TestData> = pulsar
                .consumer()
                .with_topic("test")
                .with_consumer_name("test_consumer")
                .with_subscription_type(SubType::Exclusive)
                .with_subscription("test_subscription")
                .build()
                .await
                .unwrap();

We can then use it directly as a stream.

But for producer, we first create a Producer with let producer = pulsar.producer(None);, the only argument being ProducerOptions, to setup encryption, batching, schema and various metadata.
We only select the topic when sending: producer.send("topic", message).await.unwrap()

This relies on a complicated machinery (with a spawned ProducerEngine, etc) to create on the fly a TopicProducer with the same options as Producer, then send a message on that topic.

What we actually want in most cases is a TopicProducer (that we can already create with Pulsar::create_producer), so I propose that we remove the Producer entirely, and replace it with TopicProducer.

This will be consistent with the official java client:

Producer<byte[]> producer = client.newProducer()
    .topic("my-topic")
    .batchingMaxPublishDelay(10, TimeUnit.MILLISECONDS)
    .sendTimeout(10, TimeUnit.SECONDS)
    .blockIfQueueFull(true)
    .create();

pass message metadata and to DeserializeMessage

it stores some information about the message, like the schema or the compression algorithm, that we might need when deserializing

Cursor control for consumers

Both starting message and Seek

send_message should care about which producer is handling the request

in send_message, we request from the producer engine that it sends a message on a topic without specifying which producer should do it (we could have multiple producers for one topic): https://github.com/wyyerd/pulsar-rs/blob/master/client/src/producer.rs#L97-L100
we should have some kind of internal identifier to decide which producer should send the message

Feature-flag out serde dependency

Right now we're too-strongly tied to serde; it would be nice to either remove the dependency entirely or else hide it behind a feature flag.

Add type-safe producer wrapper

Write now, the producer just takes the topic name and whatever data you want to send, which makes it really easy to send data that the consumer isn't expecting. We should expose a good way of specifying for a given topic what data the producer can accept.

Publish to crates.io

Need to figure out what we want done before doing so

Improve / abstract serialization and deserialization

Currently the serde and serde_json crates are dependencies, but it would be nice for them to be optional (feature-gated), and provide a better / more unified experience for people using other solutions.

In the spirit of making getting started as easy as possible, I think a good outcome would be for people who use serde to be able to send/receive types that impl Serialize + Deserialize without having to do any manual setup (or at least as minimal as possible), and for people doing something else to be able to "drop in" a replacement with a small amount of configuration (impl'ing a trait or two seems reasonable).

handling message format

pulsar messages can specify a format (cf https://pulsar.apache.org/docs/en/concepts-schema-registry/#supported-schema-formats )
I don't know yet how we could connect to the format registry to download Avro specifications (and the Rust Avro library did not seem very easy to use)

keep messages in the producer until we get the receipt

in case of reconnections, messages previously sent might have been lost, so we should keep them in a queue until we get the send receipt, and if we reconnected, send them again. Before implementing this, we should look at the Java client's implementation and make sure to follow the same semantics

API cleanup

the available methods are not consistent between Pulsar, Producer, TopicProducer, Consumer and MultiConsumer:

we're still exposing send_raw methods while most members of a message should be defined by the producer
Producer has send_all but not TopicProducer nor Pulsar
we should add an options() method to Producer and TopicProducer to expose the ProducerOptions
MultiTopicConsumer should have a topics() method to return the list of topics currently followed
-MultitopicConsumer should have nack() and cumulative_nack() methods

handling compression

messages can be compressed with various algorithms, and the algorithm in use is indicated in metadata

Producer and Consumer options

there are various options that could be activated on producers and consumers, like priority levels, metadata, schema. I'd like to add a ProducerOptions and ConsumerOptions arguments at creation, that will match to the optional elements of CommandSubscribe and CommandProducer. Making separate structures allow easier documentation, and adding more logic when converting to the protobuf structures

Host docs

add an example of producer batching usage

I just implemented message batching in TopicProducer in 182d450 and I have found that using it can be a bit tricky. calling send() on a prducer returns a future of CommandSendReceipt. When batching, that future will not resolved until the batch has been sent, so we cannot do producer.send(...).await in a loop, as is done in the round trip example, because we would be stuck at the first send() https://github.com/wyyerd/pulsar-rs/blob/182d45071c02e6f5885dcfa8293bd2a299178d6c/examples/round_trip.rs#L74-L81
Instead, we need to collect the receipt futures, and await on them all at once:

let producer = pulsar.create_producer(
            "test",
            Some("my-producer".to_string()),
            producer::ProducerOptions {
                batch_size: Some(5),
                ..Default::default()
            }).await.unwrap();

let mut receipts = Vec::new();
        loop {
            let receipt = async {
                producer.send(
                    TestData {
                        data: "data".to_string(),
                    },
                    ).await
            };
            receipts.push(receipt);
            //.await.unwrap();
            counter += 1;
            if counter % 5 == 0 {
                println!("sent {} messages", counter);
                break;
            }
        }

        println!("received receipts: {:?}",join_all(receipts).await);

Maybe there's a better way to represent this in the API?

Handle topic partitioning

Related to #1 both producers and consumers should use service discovery correctly to handle partitioned topics

https://pulsar.apache.org/docs/en/concepts-messaging/#partitioned-topics

client close method

we have ways to close a producer or consumer, but not the main client instance, and since that one relies on a background task, we should have an explicit method to stop everything