raystack / firehose Goto Github PK

Firehose is an extensible, no-code, and cloud-native service to load real-time streaming data from Kafka to data stores, data lakes, and analytical storage systems.

Home Page: https://raystack.github.io/firehose/

License: Apache License 2.0

Dockerfile 0.07% Java 99.93%

kafka sink streaming firehose dataops bigquery postgresql influxdb prometheus apache-kafka

firehose's Introduction

Firehose

Firehose is a cloud native service for delivering real-time streaming data to destinations such as service endpoints (HTTP or GRPC) & managed databases (Postgres, InfluxDB, Redis, Elasticsearch, Prometheus and MongoDB). With Firehose, you don't need to write applications or manage resources. It can be scaled up to match the throughput of your data. If your data is present in Kafka, Firehose delivers it to the destination(SINK) that you specified.

Key Features

Discover why users choose Firehose as their main Kafka Consumer

Sinks: Firehose supports sinking stream data to :
- log console
- MongoDB
- Prometheus
- HTTP
- GRPC
- PostgresDB(JDBC)
- InfluxDB
- Elasticsearch
- Redis
- Bigquery
- BigTable
- Blob Storage/Object Storage :
  - Google Cloud Storage
Scale: Firehose scales in an instant, both vertically and horizontally for high performance streaming sink and zero data drops.
Extensibility: Add your own sink to firehose with a clearly defined interface or choose from already provided ones.
Runtime: Firehose can run inside VMs or containers in a fully managed runtime environment like kubernetes.
Metrics: Always know what’s going on with your deployment with built-in monitoring of throughput, response times, errors and more.

To know more, follow the detailed documentation

Usage

Explore the following resources to get started with Firehose:

Guides provides guidance on creating Firehose with different sinks.
Concepts describes all important Firehose concepts.
Reference contains details about configurations, metrics and other aspects of Firehose.
Contribute contains resources for anyone who wants to contribute to Firehose.

Run with Docker

Use the docker hub to download firehose docker image. You need to have docker installed in your system.

# Download docker image from docker hub
$ docker pull raystack/firehose

# Run the following docker command for a simple log sink.
$ docker run -e SOURCE_KAFKA_BROKERS=127.0.0.1:6667 -e SOURCE_KAFKA_CONSUMER_GROUP_ID=kafka-consumer-group-id -e SOURCE_KAFKA_TOPIC=sample-topic -e SINK_TYPE=log -e SOURCE_KAFKA_CONSUMER_CONFIG_AUTO_OFFSET_RESET=latest -e INPUT_SCHEMA_PROTO_CLASS=com.github.firehose.sampleLogProto.SampleLogMessage -e SCHEMA_REGISTRY_STENCIL_ENABLE=true -e SCHEMA_REGISTRY_STENCIL_URLS=http://localhost:9000/artifactory/proto-descriptors/latest/raystack/firehose:latest

Note: Make sure your protos (.jar file) are located in work-dir, this is required for Filter functionality to work.

Run with Kubernetes

Create a firehose deployment using the helm chart available here
Deployment also includes telegraf container which pushes stats metrics

Running locally

# Clone the repo
$ git clone https://github.com/raystack/firehose.git

# Build the jar
$ ./gradlew clean build

# Configure env variables
$ cat env/local.properties

# Run the Firehose
$ ./gradlew runConsumer

Note: Sample configuration for other sinks along with some advanced configurations can be found here

Running tests

# Running unit tests
$ ./gradlew test

# Run code quality checks
$ ./gradlew checkstyleMain checkstyleTest

#Cleaning the build
$ ./gradlew clean

Contribute

Development of Firehose happens in the open on GitHub, and we are grateful to the community for contributing bugfixes and improvements. Read below to learn how you can take part in improving Firehose.

Read our contributing guide to learn about our development process, how to propose bugfixes and improvements, and how to build and test your changes to Firehose.

To help you get your feet wet and get you familiar with our contribution process, we have a list of good first issues that contain bugs which have a relatively limited scope. This is a great place to get started.

Credits

This project exists thanks to all the contributors.

License

Firehose is Apache 2.0 licensed.

firehose's People

Contributors

Stargazers

Watchers

firehose's Issues

Feature request: add support key-value in redis sink

Currently redis sink supports hashset and list, we would like to add support key value pair in redis

LOG_LEVEL field not changing actual log level

On setting the field LOG_LEVEL, the global log level for the application does not change, it is expecting the field to be log.level

firehose_sink_http_response_code_total metric sends url tag. The tag values can be unbounded.

🐛 Bug Report

private void captureHttpStatusCount(HttpEntityEnclosingRequestBase httpRequestMethod, HttpResponse response) {
        String urlTag = "url=" + httpRequestMethod.getURI().getPath();
        String statusCode = statusCode(response);
        String httpCodeTag = statusCode.equals("null") ? "status_code=" : "status_code=" + statusCode;
        getInstrumentation().captureCount(SINK_HTTP_RESPONSE_CODE_TOTAL, 1, httpCodeTag, urlTag);
    }

URL can be unbounded which can results into large number of series.

Expected Behavior

URL tag should not be sent.

Steps to Reproduce

Steps to reproduce the behavior.

Run firehose
Have cortex series limit.
cortex will throw 400 when trying to push metrics after the series limit is reached.

Environment

Kubernetes

Use Http Sink from sink-connector library Depot

Implement HTTP Sink from Depot
Documentation

Add support for Druid

Add support for druid ingestion

GCS as DLQ sink to be supported & be configurable.

Description : As part of this multiple DLQ sinks to be supported, Log Sink/ GCS along with the kafka DLQ sink.
Configuration :
DLQ sink & the corresponding configurations.
Acceptance Criteria :
GCS DLQ sink should be written a single file for every batch & follow the same contract.
Design Consideration :
Change in Sink Response can be carried out as part of this.

Improve documentation

Add changelog
Update roadmap

Guides

Filters - How to use

Concepts

Reference

Metric names
FAQS

Contribute

Release process
Development Guide - Add details about local setup

BQ Sink to be sophisticated to handle any failures in inputs/failures.

REDIS TTL does not set for keyvalue and hashset data structure

🐛 Bug Report

TTL is not set for hashset and keyvalue redis data type

Expected Behavior

TTL should be set

Steps to Reproduce

Steps to reproduce the behavior.
1.SET SINK_REDIS_TTL_TYPE: DURATION|EXACT_TIME
2.SET SINK_REDIS_TTL_VALUE: 1000 in seconds |<unix_timestamp_in_future>
3. Run the fireshose

Extract bigquery sink to depot

motivation
going forward support for new data formats and bug fixes will be done in depot

NullPointerException on using repeated field for JSON body template in HTTP sink

🐛 Bug Report

For HTTP Sink JSON Templatised body, if there is a repeated field being used for JSON body creation, then firehose silently fails with NullPointerException while reporting metrics

java.lang.NullPointerException: null
	at java.time.Duration.between(Duration.java:473)
	at io.odpf.firehose.metrics.StatsDReporter.captureDurationSince(StatsDReporter.java:53)
	at io.odpf.firehose.metrics.Instrumentation.captureSinkExecutionTelemetry(Instrumentation.java:165)
	at io.odpf.firehose.sink.AbstractSink.pushMessage(AbstractSink.java:56)
	at io.odpf.firehose.sinkdecorator.SinkDecorator.pushMessage(SinkDecorator.java:28)

Expected Behavior

Firehose should throw proper errors for non-supported field types

Steps to Reproduce

Steps to reproduce the behavior.

Use an input proto having repeated field
Use that repeated field in JSON body template
Run firehose

Add support for ClickHouse

Add support for clickhouse ingestion

Handle unknown fields in BigQuery Sink

Acceptance criteria

unknown fields should be an error type for message.
check default configs
logging

BQ sink to create/update the destination table/dataset.

Description :
Table to be created based on the proto. Table to be updated/created during startup time & schema updates to be synchronous to avoid any errors from BQ rate limiting. Updates can happen from multiple workers parallely.
Configurations:
Configurations for the destination BQ Table
BQ labels to be applied on table.
Acceptance Criteria :
Metrics to be captured whenever schema is updated.
Fail if there are backward incompatible changes with the destination & capture metrics for the same.
Table labels to be updated appropriately.

GCS sink to handle all error scenarios in message/sink errors

Description :
Implement error handling, need to be backward compatible, if error info not configured (is null), the messages need to be retried.

All input message handling route need to happen as per the configurations. The error list should should be on the configuration.

Need to list some error that generic enough for all sink

Error need to be handled:

Deserialization Errors need to be logged
Unknown Fields need to be logged

Configurations :
Fail on Deserialization Errors : Default true
Fail on Unknown Fields : Default true

Acceptance Criteria :
Metrics for timeouts to be captured.
Metrics for errors to be captured.
Messages skipped will be captured, only if the batch is successfully processed.

Add instrumentation and logging to BQ sink.

Acceptance criteria:

Analysis of metrics for BQ sink.
Implementation.

Discussion:

Insert time.
Counter for success/failures of insert messages.
No of Error messages(deserialisation and repose errors from bigquery) or dlqed.
Log the offsetinfo when the error happens.
table/dataset creation logging and metrics.
stencil Proto update logging and metrics. (log exceptions etc)
Think about completeness/freshness/deduplication.

End to end verification of no data loss in Firehose BQ sinks.

Extend OffsetManager to use for BQ sink

Description :
AsyncConsumer in BQ sink sends futures list to OffsetManager. OffsetManager is implemented with keeping CloudSink in mind, As part of this story we extend it to be used with BQ sink as well.

Configuration :

Acceptance Criteria :

Implement commit strategy for Asynchronous consumer
Firehose Log sink works with Consumer Mode = Async
No data loss

Add instrumentation and logging to GCS sink.

Acceptance criteria:

Analysis of metrics for GCS sink.
Implementation.
Tracing.

Metrics:

record_write_count , tags : filename(partition+uuid)
file_open_total
file_closed_total, tags: success(true/false)
file_closing_time_milliseconds
file_size_bytes_total
file_upload_total , tags: success(true/false)
file_upload_time_milliseconds
file_upload_bytes

Discussion: 
1. Distribution of the file size.
2. upload time.
3. success/failures of upload.
4. how many open files are there
5. time taken to close parquet files.
6. messages read/messages per parquet file.
7. No of Error messages or dlqed. 
8. Think about completeness/freshness/deduplication.

Need Clarification on Protobuf, Kafka and Prometheus interaction

Hello,

Apologies in advance, I was unable to access the slack linked in the bug report menu. I am trying to connect Kafka to a Prometheus instance and I am confused on how index mapping works surrounding Protobuf.

From the guide:

SINK_PROM_METRIC_NAME_PROTO_INDEX_MAPPING

The mapping of fields and the corresponding proto index which will be set as the metric name on Cortex. This is a JSON field.

Example value: {"2":"tip_amount","1":"feedback_ratings"}
Proto field value with index 2 will be stored as metric named tip_amount in Cortex and so on
Type: required

Would "tip_amount" be a record header? Can firehose handle Kafka messages with variable record list lengths?

Thank you,
Liam

Add support for multiple sources other than kafka for Firehose

Acceptance Criteria

Kafka, BQ both should be sources
Come up with the core design with which the solution will work.
Sinks are Kafka & Postgres

Add support for Avro

Support for parsing Kafka messages in Apache Avro schema format

Allow ignore unknown value in firehose BQ client

There's issue on starting firehose when proto schema is not matching with BQ table schema. Sample error looks like this:
Provided Schema does not match Table xxx. Field yyy is missing in new schema

Need a feature to ignore this error and fill with default value if proto schema is incomplete.
This can be done by allowing ignoreUnknownValues while instantiating BQ client.
source: https://cloud.google.com/bigquery/docs/reference/rest/v2/tabledata/insertAll#request-body

Support for BigQuery sink Table Clustering

Currently, firehose is already supported for table partitioning on BigQuery sink.

There are some cases where partitioning only is not enough to improve the query performance on BigQuery. To fulfill that case and improve the query performance better, need to add a feature to also support clustering on BigQuery table.

Expected Behaviour:

BQ sink should be able to create partitioned and clustered table
BQ sink should be able to create clustered tables without partitioned table
BQ sink should be able to modify the clustered table
BQ sink should be able to modify the non-clustered table

BigTable Sink using Depot

Lets add BigTable sink in Firehose using the BigTable sink connector implementation in ODPF Depot library.

Tasks -

Implement Bigtable sink using Depot library
Bigtable sink documentation

Change time in logger and metrics from UTC to local timezone

Fix the clock time to show system clock time/ local time rather than UTC

Add support for JSON-based filters

support for filtering Kafka messages based on rules in JSON format

Upgrade gradle version

Upgrade Gradle version to 6.8.3 to support the latest OpenJDK version.

Add Troubleshooting section in docs

Add a Troubleshooting section in Firehose documentation containing common runtime problems and their solutions.

Sink - specific troubleshooting as well as generic issues ( related to Stencil client, Kafka consumer, etc) need to be covered in this section.

For JDBC sink, connections are recreated with every push

🐛 Bug Report

The connections are reset after every writes to the DB. Leading to new connection creation and deletion with every write.

Expected Behavior

The connection should be recycled and the pool should reuse it rather than creating and destroying it with every connection

Steps to Reproduce

Sink the data to a Postgres database with debug log
You will see logs like this
[null connection adder] DEBUG com.zaxxer.hikari.pool.HikariPool - null - Added connection org.postgresql.jdbc.PgConnection@5211101e [pool-2-thread-1] DEBUG io.odpf.firehose.sink.jdbc.JdbcSink - DB response: [1] [pool-2-thread-1] INFO io.odpf.firehose.sink.jdbc.JdbcSink - Pushed 1 messages to db. [null connection closer] DEBUG com.zaxxer.hikari.pool.PoolBase - null - Closing connection org.postgresql.jdbc.PgConnection@5211101e: (connection evicted by user)

Add instrumentation and logging to DLQ and retry .

How many messages DLQ and retry with error_type mapping and sink_type as tags.
Messages Failed to dlq or retry.. put error_type and sink_type as tags
Check the logging.

Support asynchronous consumers with automated offset management

Description :
As part of this story, a user should be able to configure an asynchronous mode of consuming messages for the supported sinks. When the configured parallelism for processing to sinks is not sufficient then consumers should wait.

Configurations :
Configuration on consumption mode
Number of parallel threads processing the batch parallelly.
Time to wait for an empty slot for the executor pool.

Acceptance Criteria :

There's an AsyncConsumer that maintains Futures per partition and sends futures list to offsetManager
If pool is full, then there should be wait for pool to become empty.
No Data loss.
Note: Offset manager is out of scope of this story

Dependency Scan Vulnerabilities - Snyk

Below are the list of vulnerabilities reported by dependency scan.

Summary

Tested 195 dependencies for known issues, found 127 issues, 479 vulnerable paths.

Issues to fix by upgrading:

A full list of issues is attached in the report below.
Reports attached.
scan report.zip

If there is an exact replica of this repo on source.golabs.io then I can help raising an MR to fix all of these dependencies also. That will help you review the same.
For some reason I am not able to in gitlab.

End to end verification of no data loss in GCS sinks.

Given message is missing.
Number of messages are less in a given day.

Pre-requisite :

code fixes are done from the review comments.

Outcome:

To capture any metrics that will help us answer/confirm that there is no issue in the deployment.
SOP's to verify no missing records.
Dashboards to check if metrics are coming.
Deploy on kubernetes.
Load testing

Add support for MongoDB

Add support for Mongo DB sink in firehose

Add support for JSON data for BQ sink

Currently only ElasticSearch and MongoDB Sinks have support for JSON messages.
Let's add support for parsing JSON messages in other sinks as well.

Deprecate config SINK_HTTP_PARAMETER_SCHEMA_PROTO_CLASS

WHAT ?

Deprecate config SINK_HTTP_PARAMETER_SCHEMA_PROTO_CLASS

WHY ?

For a Firehose with HTTP Sink configured with header or query parameter source ( that is, SINK_HTTP_PARAMETER_SOURCE != disabled), the proto class that is used for parsing the incoming Kafka message during request creation is configured using SINK_HTTP_PARAMETER_SCHEMA_PROTO_CLASS.

This is confusing, as there is already a config INPUT_SCHEMA_PROTO_CLASS which tells the proto class that needs to be used for parsing the incoming Kafka message.

Ideally, we would like to keep a single variable which denotes this.

Deprecate jaeger tracing in Firehose

WHAT ?

Remove dependency of jaeger-client from Firehose along with usages anywhere in the code.

WHY ?

As mentioned on their Github handle, Jaeger Clients are being deprecated and users are being recommended to move to OpenTelemetry APIs and SDKs.

Announcement on Github handle
Announcement on their documentation
Issue on Github for the deprecation

Firehose, as of release 0.2, has a dependency on jaeger-client. However, tracing is not used actively in Firehose in production and hence this dependency can be removed safely.

Is there a deadline ?

As per the notice on jaeger-tracing :

We plan to continue accepting pull requests and making new releases of Jaeger clients through the end of 2021. 
In January 2022 we will enter a code freeze period for 6 months, during which we will no longer accept pull requests with 
new features, with the exception of security-related fixes. After that we will archive the client library repositories and will 
no longer accept new changes.

GCS sink to be optimized by parallelizing the parquet-creation / uploads.

Description :
As part of this GCS sink to be optimized where creation of parquet files & uploading to GCS will happen parallelly, without blocking them each other.
Acceptance Criteria :
GCS sinks to handle commits by itself.
Design Considerations :
SinkFactory to create sinks & fail if async consumer is configured.
SinkFactory to decide whether commits are auto managed by sink.
Refactor SinkFactory & FirehoseConsumerFactory.

Firehose to gRPC sink job failure with error `Uncaught exception in the SynchronizationContext. Panic! java.lang.IllegalStateException: Could not find policy 'pick_first'. `

Firehose to gRPC sink job failure with error `Uncaught exception in the SynchronizationContext. Panic! java.lang.IllegalStateException: Could not find policy 'pick_first'.`

Nov 14, 2022 5:57:37 PM io.grpc.internal.ManagedChannelImpl$2 uncaughtException
SEVERE: [Channel<1>: (127.0.0.1:6565)] Uncaught exception in the SynchronizationContext. Panic!
java.lang.IllegalStateException: Could not find policy 'pick_first'. Make sure its implementation is either registered to LoadBalancerRegistry 
        or included in META-INF/services/io.grpc.LoadBalancerProvider from your jar files.
        at io.grpc.internal.AutoConfiguredLoadBalancerFactory$AutoConfiguredLoadBalancer.<init>(AutoConfiguredLoadBalancerFactory.java:92)
        at io.grpc.internal.AutoConfiguredLoadBalancerFactory.newLoadBalancer(AutoConfiguredLoadBalancerFactory.java:63)
        at io.grpc.internal.ManagedChannelImpl.exitIdleMode(ManagedChannelImpl.java:406)
        at io.grpc.internal.ManagedChannelImpl$RealChannel$2.run(ManagedChannelImpl.java:972)
        at io.grpc.SynchronizationContext.drain(SynchronizationContext.java:95)
        at io.grpc.SynchronizationContext.execute(SynchronizationContext.java:127)
        at io.grpc.internal.ManagedChannelImpl$RealChannel.newCall(ManagedChannelImpl.java:969)
        at io.grpc.internal.ManagedChannelImpl.newCall(ManagedChannelImpl.java:911)
        at io.grpc.internal.ForwardingManagedChannel.newCall(ForwardingManagedChannel.java:63)
        at io.grpc.stub.MetadataUtils$HeaderAttachingClientInterceptor.interceptCall(MetadataUtils.java:74)
        at io.grpc.ClientInterceptors$InterceptorChannel.newCall(ClientInterceptors.java:156)
        at io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:142)
        at io.odpf.firehose.sink.grpc.client.GrpcClient.execute(GrpcClient.java:59)
        at io.odpf.firehose.sink.grpc.GrpcSink.execute(GrpcSink.java:38)
        at io.odpf.firehose.sink.AbstractSink.pushMessage(AbstractSink.java:46)
        at io.odpf.firehose.sinkdecorator.SinkDecorator.pushMessage(SinkDecorator.java:28)
        at io.odpf.firehose.sinkdecorator.SinkWithFailHandler.pushMessage(SinkWithFailHandler.java:34)
        at io.odpf.firehose.sinkdecorator.SinkDecorator.pushMessage(SinkDecorator.java:28)
        at io.odpf.firehose.sinkdecorator.SinkWithRetry.pushMessage(SinkWithRetry.java:54)
        at io.odpf.firehose.sinkdecorator.SinkDecorator.pushMessage(SinkDecorator.java:28)
        at io.odpf.firehose.sinkdecorator.SinkFinal.pushMessage(SinkFinal.java:28)
        at io.odpf.firehose.consumer.FirehoseSyncConsumer.process(FirehoseSyncConsumer.java:43)
        at io.odpf.firehose.launch.Main.lambda$multiThreadedConsumers$0(Main.java:65)
        at io.odpf.firehose.launch.Task.lambda$run$0(Task.java:49)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)

Expected Behavior

A firehose job need to interact with gRPC API for the response.

Steps to Reproduce

Proto used to reproduce the scenario

syntax = "proto3";
package io.odpf.dagger.consumer;
option java_multiple_files = true;
option java_package = "io.odpf.dagger.consumer";
option java_outer_classname = "SampleGrpcServerProto";

service TestServer {
  rpc TestRpcMethod (TestGrpcRequest) returns (TestGrpcResponse) {}
}
message TestGrpcRequest {
  string field1 = 1;
  string field2 = 2;
}
message TestGrpcResponse {
  bool success = 1;
  repeated Error error = 2;
  string field3 = 3;
  string field4 = 4;
}
message Error {
  string code = 1;
  string entity = 2;
}

Write a simple gRPC API which expects two fields and gives back response as is.
Run a firehose job in local with below local properties which consumes data from local kafka and uses gRPC as sink.

java -jar build/libs/firehose-0.4.2.jar

KAFKA_RECORD_PARSER_MODE=message
SINK_TYPE=grpc
INPUT_SCHEMA_PROTO_CLASS=io.odpf.dagger.consumer.TestGrpcRequest
SCHEMA_REGISTRY_STENCIL_ENABLE=false
SOURCE_KAFKA_BROKERS=127.0.0.1:9092
SOURCE_KAFKA_TOPIC=test-grpc-request
SOURCE_KAFKA_CONSUMER_GROUP_ID=sample-grpc-group-id2
SINK_GRPC_SERVICE_HOST=127.0.0.1
SINK_GRPC_SERVICE_PORT=6565
SINK_GRPC_METHOD_URL=io.odpf.dagger.consumer.TestServer/TestRpcMethod
SINK_GRPC_RESPONSE_SCHEMA_PROTO_CLASS=io.odpf.dagger.consumer.TestGrpcResponse

The job is failing with the above mentioned error.

##Analysis:
In the current implementation, the gRPC client chooses default LoadBalancerProvider ('pick_first') and default NameResolverProvider(DNS). The implementation classes PickFirstLoadBalancerProvider and DnsNameResolverProvider respectively are missing.

We could able to solve the issue with the including implementation classes through service provider like creating META-INF/services folder and creating a file named io.grpc.LoadBalancerProvider with value as io.grpc.internal.PickFirstLoadBalancerProvider and create another file io.grpc.NameResolverProvider with value io.grpc.internal.DnsNameResolverProvider under it.

Also if we provide only one service provider io.grpc.LoadBalancerProvider and miss other, we are getting below error.

Failed to resolve name. status=Status{code=UNAVAILABLE, description=Failed to initialize xDS, 
cause=io.grpc.xds.XdsInitializationException: Cannot find bootstrap configuration
Environment variables searched:
- GRPC_XDS_BOOTSTRAP
- GRPC_XDS_BOOTSTRAP_CONFIG

Java System Properties searched:
- io.grpc.xds.bootstrap
- io.grpc.xds.bootstrapConfig
        at io.grpc.xds.BootstrapperImpl.bootstrap(BootstrapperImpl.java:101)
        at io.grpc.xds.SharedXdsClientPoolProvider.getOrCreate(SharedXdsClientPoolProvider.java:90)
        at io.grpc.xds.XdsNameResolver.start(XdsNameResolver.java:155)
        at io.grpc.internal.ManagedChannelImpl.exitIdleMode(ManagedChannelImpl.java:412)
        at io.grpc.internal.ManagedChannelImpl$RealChannel$2.run(ManagedChannelImpl.java:972)
        at io.grpc.SynchronizationContext.drain(SynchronizationContext.java:95)
        at io.grpc.SynchronizationContext.execute(SynchronizationContext.java:127)
        at io.grpc.internal.ManagedChannelImpl$RealChannel.newCall(ManagedChannelImpl.java:969)
        at io.grpc.internal.ManagedChannelImpl.newCall(ManagedChannelImpl.java:911)
        at io.grpc.internal.ForwardingManagedChannel.newCall(ForwardingManagedChannel.java:63)
        at io.grpc.stub.MetadataUtils$HeaderAttachingClientInterceptor.interceptCall(MetadataUtils.java:74)
        at io.grpc.ClientInterceptors$InterceptorChannel.newCall(ClientInterceptors.java:156)
        at io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:142)
        at io.odpf.firehose.sink.grpc.client.GrpcClient.execute(GrpcClient.java:59)
        at io.odpf.firehose.sink.grpc.GrpcSink.execute(GrpcSink.java:38)
        at io.odpf.firehose.sink.AbstractSink.pushMessage(AbstractSink.java:46)
        at io.odpf.firehose.sinkdecorator.SinkDecorator.pushMessage(SinkDecorator.java:28)
        at io.odpf.firehose.sinkdecorator.SinkWithFailHandler.pushMessage(SinkWithFailHandler.java:34)
        at io.odpf.firehose.sinkdecorator.SinkDecorator.pushMessage(SinkDecorator.java:28)
        at io.odpf.firehose.sinkdecorator.SinkWithRetry.pushMessage(SinkWithRetry.java:54)
        at io.odpf.firehose.sinkdecorator.SinkDecorator.pushMessage(SinkDecorator.java:28)
        at io.odpf.firehose.sinkdecorator.SinkFinal.pushMessage(SinkFinal.java:28)
        at io.odpf.firehose.consumer.FirehoseSyncConsumer.process(FirehoseSyncConsumer.java:43)
        at io.odpf.firehose.launch.Main.lambda$multiThreadedConsumers$0(Main.java:65)
        at io.odpf.firehose.launch.Task.lambda$run$0(Task.java:49)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)
}

Cloud Storage Sink in Firehose to write Parquet Files

Description :
As part of this we introduce CS sink in firehose which creates parquet-mr files & write to local filesystem, to be rotated based on the configured size/time.

Configuration :
Destination configurations.
Partition field

Acceptance Criteria :

Parquet files to be written with the metadata{topic, offset, partition}
Parquet files to be rotated based on size defaulting to 256 MB.
Parquet files to be rotated based on time as well defaulting to hour.
No cleanup is necessary for Parquet files.
Parquet files to be created in the configured destination path.

BQ sink to write to the configured destination.

Description : As part of this, all records to be written to the destination table, with error handling.
Acceptance criteria :

Records should be written to BQ without any data loss if there are no errors.
Check what exceptions should be thrown.
All the errors should be handled and set in the response, to be retried or dlqed

Prometheus/Cortex sink

Create a new firehose sink that can push Prometheus metric format. The sink we are going to support is Prometheus/Cortex.

Prometheus Remote Write:
Features of Prometheus allow transparently sending samples. This is primarily intended for long-term storage.

Why Cortex/Prometheus?

Cortex is an open-source time-series database and monitoring system for applications and microservices. Based on Prometheus, Cortex adds horizontal scaling and virtually indefinite data retention.
It supports Prometheus write API that can push metrics so we can use it in Firehose.
It supports Amazon DynamoDB, Google Bigtable, Cassandra, S3, GCS, and Microsoft Azure for long-term storage of metric data.
It offers a global view of Prometheus time-series data that includes data in long-term storage, greatly expanding the usefulness of PromQL for analytical purposes.
It can isolate data and queries from multiple different independent Prometheus sources in a single cluster, allowing untrusted parties to share the same cluster.

Cloud Storage sink can be configured to write to GCS

Description :
CloudStorage sink after converting proto input to parquet files writes to the configured destination. CloudStorage is configurable.

Configuration :
Destination configurations.
Partition field

Acceptance Criteria :
Parquet files to be written to GCS filesystem.

raystack / firehose Goto Github PK

firehose's Introduction

Firehose

Key Features

Usage

Run with Docker

Run with Kubernetes

Running locally

Running tests

Contribute

Credits

License

firehose's People

Contributors

Stargazers

Watchers

Forkers

firehose's Issues

🐛 Bug Report

Expected Behavior

Steps to Reproduce

Environment

🐛 Bug Report

Expected Behavior

Steps to Reproduce

🐛 Bug Report

Expected Behavior

Steps to Reproduce

🐛 Bug Report

Expected Behavior

Steps to Reproduce

WHAT ?

WHY ?

WHAT ?

WHY ?

Is there a deadline ?

Firehose to gRPC sink job failure with error Uncaught exception in the SynchronizationContext. Panic! java.lang.IllegalStateException: Could not find policy 'pick_first'.

Expected Behavior

Steps to Reproduce

Recommend Projects

Recommend Topics

Recommend Org

Jobs

Firehose to gRPC sink job failure with error `Uncaught exception in the SynchronizationContext. Panic! java.lang.IllegalStateException: Could not find policy 'pick_first'.`