GithubHelp home page GithubHelp logo

Comments (21)

lmontrieux avatar lmontrieux commented on May 25, 2024 1

I think we need to come up with a proper admin documentation - that would be the right place to answer these questions, and also highlight configuration options

from nakadi.

adyach avatar adyach commented on May 25, 2024

Hi @wsalembi, thank you for your interest in Nakadi!

There is no open source documents on how to operate Nakadi or forums to discuss it, but the team is happy to collaborate to clarify any questions you have and maybe open source (partially) what we have on the topic. Feel free to ask in this issue or file another one. I will cover your point in this issue.

  • High-available setup
    Nakadi availability is covered by availability of Apache Kafka cluster, which is configurable upto your requirements. Nakadi publishes events in batches. In order to preserve ordering within partition Nakadi does not retry to publish messages in Kafka even if one event in batch is failed to be published. It rejects the whole publishing request.

    Our team hosts Kafka on AWS in 3 different AZ with replication factor of 3 (2 in sync replicas) in rack aware mode, it is possible to lose one AZ completely.
    There is also SLOs defined for Nakadi, if you need more on that topic, let me know.

  • Capacity management
    Nakadi itself is stateless application, so scaling in and out Nakadi is not a problem, especially, if you run in the cloud environment as we do in AWS. Nakadi is CPU bound due processing of events such as schema validation and compression, that's why it scales out very well based on CPU usage.

    Nakadi is backed by Apache Kafka as stateful application it will require effort to scale it up in advance. Although, Kafka is network bound, the main concern is a disk utilisation, because the machine can be exchanged to the bigger one (more network, more cpu) in a couple of minutes if you use some kind attachable / detachable volumes to avoid data replication, but once you have to extend disk capacity you have to introduce new nodes in the cluster, which requires rebalance of the cluster in order to distribute the load. We keep ~40% of free disk space per broker to be able to add more if load grows unexpectedly. One of our invention is Kafka supervisor bubuku, which help to move data in different ways between brokers.

    Apache Zookeeper is used to maintain distributed acknowledgement of consumed events (Subscriptions API). Usual setup for Nakadi keeps 3 node Zookeeper ensemble, and scales vertically only by replacing machines with more powerful ones. It is over-provisioned by CPU quite a lot, keeping its usage is around 40%-60% works good for us.

    Postgres is used as EventTypes metadata store. The most hot paths are cached, so it does not really requires capacity planing, since its CPU and disk usage are low. The current cluster runs with 1 master and 2 replicas.

  • Backup/restore strategies
    This should be covered by Apache Kafka and Apache Zookeeper setup.

  • Release management (upgrade strategy)
    Nakadi can be rolled out in several different ways: switch traffic between stacks or blue green deployment. Basically, it is stateless application there is nothing special about upgrading it.

    Apache Kafka cluster is upgraded using bubuku and recommendation for upgrade from Apache Kafka documentation.

from nakadi.

lenalebt avatar lenalebt commented on May 25, 2024

Thanks for your post here, it helped me better understand it.

Since I know Zalando is running quite a few services in k8s (besides the vintage DC and maybe still some STUPS services around?) I wonder if you have experience running Nakadi (maybe together with Kafka?) in k8s. Or together with managed Kafka such as Amazon MSK.

Besides that, are there others running Nakadi you know of outside of Zalando? I am happy to read names or numbers, but a simple "yes/no" would help as well.

from nakadi.

adyach avatar adyach commented on May 25, 2024

@lenalebt unfortunately, the answer is no to your questions for now

from nakadi.

lenalebt avatar lenalebt commented on May 25, 2024

Too bad! But thanks for the update :).

from nakadi.

mateimicu avatar mateimicu commented on May 25, 2024

I am also interested in this subject.

@lenalebt We are trying now to run Nakadin in ESK using MSK(and the provided zookeeper)

from nakadi.

lmontrieux avatar lmontrieux commented on May 25, 2024

@mateimicu that's very interesting, how is it working for you? We'd love to know more about your efforts, and any issues you may be running into!

from nakadi.

mateimicu avatar mateimicu commented on May 25, 2024

@lmontrieux it works, we can use it. The problem we had is ensuring backups for it. Keeping all Nakadi data stores in sync(MSK, zookeeper, DB).

We are now trying to understand how we could what should we backup and how to do it. We want to use it for missing critical system, we can't afford to lose messages.

Also, we need to be able to do disaster recovery and migrations (maybe even keep a hot-replica in another AWS region).

from nakadi.

lmontrieux avatar lmontrieux commented on May 25, 2024

We keep zookeeper and DB backups separately - the two are not synchronized, but that should not be an issue. In the extremely unlikely event that we would lose both the entire zookeeper ensemble and the database (and all its replicas) at the same time, we can recover from the backups and start again from there.

Regarding Kafka, we have replication on 3 brokers, each in a different availability zone, and ack=all to make sure that Nakadi does not confirm publishing until the data is on all brokers. The Kafka brokers use EBS volumes, so the risks of data loss are very, very small. But we also have an application that consumes all data from all Nakadi event types and persists it to S3 for our data lake, so we have another copy in case everything goes south.

from nakadi.

lmontrieux avatar lmontrieux commented on May 25, 2024

re: keeping a hot-replica in another AWS region: we don't do it, but I guess you could use Kafka MM to replicate to another cluster, or even an application that reads from Nakadi and writes to another Nakadi. I see potential issues around offsets and timelines, though.

from nakadi.

eupestov avatar eupestov commented on May 25, 2024

Thanks for sharing @lmontrieux

If the DB and ZK backups are not synchronised you will probably need to do some 'magic' after you restore from the backups to deal with the data which is stored in both, e.g. subscriptions and event types?

Is it possible to create a subscription which will get all the events event as new event types get registered? Or you update that 'sink' subscription with every new event type explicitly?

from nakadi.

lmontrieux avatar lmontrieux commented on May 25, 2024

Yes, you'd probably have to do some cleanup, as there could be subscriptions in the DB that aren't in Zookeeper, or the other way around. But since subscription or event type creation are relatively rare events, we take it as an acceptable risk that, in the case of a complete meltdown, we'll have to clear up a few subscriptions or event types.

It isn't possible to create a subscription which gets all events from all event types, for 2 reasons:

  • the events in a subscription are fixed, so you'd have to re-create it every time a new event type is created
  • there should be a reasonable maximum number of partitions to which a subscription is registered, Nakadi will not be able to handle too many. We set our maximum to 100 partitions per subscription

To archive everything, we wrote a (not yet open source unfortunately) application that periodically lists all the event types, and then creates subscriptions to read events from them.

from nakadi.

mateimicu avatar mateimicu commented on May 25, 2024

Hi @lmontrieux,

I see a line mentioning SLO monitoring support in the documentation, but I can't find more information, is there another place I should look at?

from nakadi.

lmontrieux avatar lmontrieux commented on May 25, 2024

Hi @mateimicu

For SLO monitoring, we use Nakadi itself. Basically, Nakadi produces its own access log to an event type, nakadi.access.log. It also logs some stats in nakadi.batch.published and nakadi.data.streamed.
We use these for our monitoring, including SLOs

from nakadi.

mateimicu avatar mateimicu commented on May 25, 2024

@lmontrieux souds like a good idea.

Another question: who creates the nakadi.access.log and the other event types. I tried to enable the kpi_collection audit_log_collection features but got the following error

Error occurred while publishing events to nakadi.data.streamed, EventType "nakadi.data.streamed" does not exist.

from nakadi.

mateimicu avatar mateimicu commented on May 25, 2024

I think i figured this out, just starting a new process worked :).

from nakadi.

lmontrieux avatar lmontrieux commented on May 25, 2024

:)

from nakadi.

eupestov avatar eupestov commented on May 25, 2024

If you do not mind I ask here instead of starting a new thread:

We have two timelines for event type X:
A (original): e1, e2, e3, ... eN
B (new): eN+1, ... eM

What should happen if we create a new subscription for X with read_from=begin? Should it start streaming from e1 or eN+1? It looks like it starts from eN+1 in my case, which is not what I expected.

from nakadi.

lmontrieux avatar lmontrieux commented on May 25, 2024

It should start at the oldest available offset of the oldest available timeline, if you read from begin.

from nakadi.

eupestov avatar eupestov commented on May 25, 2024

Thanks. In may case the retention time for the event type was set to -1 and because of that the old topic was cleaned up almost immediately because of:

final Date cleanupDate = new Date(System.currentTimeMillis() + retentionTime);

Is this a bug or a feature? =)

from nakadi.

lmontrieux avatar lmontrieux commented on May 25, 2024

Looks like a bug. Could you please open an issue (and maybe a PR if you feel like it) ?

from nakadi.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.