hurence / logisland Goto Github PK

Scalable stream processing platform for advanced realtime analytics on top of Kafka and Spark. LogIsland also supports MQTT and Kafka Streams (Flink being in the roadmap). The platform does complex event processing and is suitable for time series analysis. A large set of valuable ready to use processors, data sources and sinks are available.

Home Page: https://logisland.github.io

License: Other

Scala 3.58% Shell 0.56% Java 39.57% Makefile 0.05% Python 30.77% Roff 23.68% HTML 0.26% CSS 0.10% JavaScript 0.76% XSLT 0.55% Dockerfile 0.05% Clojure 0.07%

big-data stream-processing kafka spark analytics complex-event-processing pattern-recognition kafka-streams elasticsearch cassandra influxdb solr

logisland's Introduction

Logisland

Download the latest release build and chat with us on gitter

LogIsland is an event mining scalable platform designed to handle a high throughput of events.

It is highly inspired from DataFlow programming tools such as Apache Nifi, but with a highly scalable architecture.

LogIsland is completely open source and free even for commercial use. Hurence provides support if required.

Event mining Workflow

Here is an example of a typical event mining pipeline.

Raw events (sensor data, logs, user click stream, ...) are sent to Kafka topics by a NIFI / Logstash / *Beats / Flume / Collectd (or whatever) agent
Raw events are structured in Logisland Records, then processed and eventually pushed back to another Kafka topic by a Logisland streaming job
Records are sent to external short living storage (Elasticsearch, Solr, Couchbase, ...) for online analytics.
Records are sent to external long living storage (HBase, HDFS, ...) for offline analytics (aggregated reports or ML models).
Logisland Processors handle Records to produce Alerts and Information from ML models

Online documentation

You can find the latest Logisland documentation, including a programming guide, on the project web page.

Or on this site web as well.

This README file only contains basic setup instructions.

Browse the Java API documentation for more information.

You can follow one getting started guide through the apache log indexing tutorial.

Building Logisland

to build from the source just clone source and package with maven (logisland requires a maven 3.5.2 version and beyond)

git clone https://github.com/Hurence/logisland.git
cd logisland
mvn clean package

the final package is available at logisland-assembly/target/logisland-1.4.1-full-bin.tar.gz

You can also download the latest release build

If you want to build with opencv support, please install OpenCV first and then

mvn clean package -Dopencv

Quick start

Local Setup

Alternatively you can deploy logisland on any linux server from which Kafka and Spark are available

Replace all versions in the below code by the required versions (spark version, logisland version on specific HDP version, kafka scala version and kafka version etc.)

The Kafka distributions are available at this address: <https://kafka.apache.org/downloads>

Last tested version of scala version for kafka is: 2.11 with preferred release of kafka : 0.10.2.2

Last tested version of Spark is: 2.3.1 on Hadoop version: 2.7

But you should choose the Spark version that is compatible with your environment and hadoop installation if you have one (for example Spark 2.1.0 on hadoop 2.7). Note that hadoop 2.7 can run Spark 2.4.x, 2.3.x, 2.2.x, 2.1.x. Check at this URL what is available : http://d3kbcqa49mib13.cloudfront.net/

# install Kafka & start a zookeeper node + a broker
curl -s https://www-us.apache.org/dist/kafka/<kafka_release>/kafka_scala_version>-<kafka_version>.tgz | tar -xz -C /usr/local/
cd /usr/local/kafka_<scala_version>-<kafka_version>
nohup bin/zookeeper-server-start.sh config/zookeeper.properties > zookeeper.log 2>&1 &
JMX_PORT=10101 nohup bin/kafka-server-start.sh config/server.properties > kafka.log 2>&1 &

# install Spark (choose the spark version compatible with your hadoop distrib if you have one)
curl -s http://d3kbcqa49mib13.cloudfront.net/spark-<spark-version>-bin-hadoop<hadoop-version>.tgz | tar -xz -C /usr/local/
export SPARK_HOME=/usr/local/spark-<spark-version>-bin-hadoop<hadoop-version>

# install Logisland 1.4.1
curl -s https://github.com/Hurence/logisland/releases/download/v1.0.0-RC2/logisland-1.0.0-RC2-bin.tar.gz  | tar -xz -C /usr/local/
cd /usr/local/logisland-1.4.1

# launch a logisland job
bin/logisland.sh --conf conf/index-apache-logs.yml

you can find some logisland job configuration samples under $LOGISLAND_HOME/conf folder

Docker setup

The easiest way to start is the launch a docker compose stack

# launch logisland environment
cd /tmp
curl -s https://raw.githubusercontent.com/Hurence/logisland/master/logisland-framework/logisland-resources/src/main/resources/conf/docker-compose.yml > docker-compose.yml
docker-compose up

# sample execution of a logisland job
docker exec -i -t logisland conf/index-apache-logs.yml

Hadoop distribution setup

Launching logisland streaming apps is just easy as unarchiving logisland distribution on an edge node, editing a config with YARN parameters and submitting job.

# install Logisland 1.4.1
curl -s https://github.com/Hurence/logisland/releases/download/v0.10.0/logisland-1.4.1-bin-hdp2.5.tar.gz  | tar -xz -C /usr/local/
cd /usr/local/logisland-1.4.1
bin/logisland.sh --conf conf/index-apache-logs.yml

Start a stream processing job

A Logisland stream processing job is made of a bunch of components. At least one streaming engine and 1 or more stream processors. You set them up by a YAML configuration file.

Please note that events are serialized against an Avro schema while transiting through any Kafka topic. Every spark.streaming.batchDuration (time window), each processor will handle its bunch of Records to eventually

generate some new Records to the output topic.

The following configuration.yml file contains a sample of job that parses raw Apache logs and send them to Elasticsearch.

The first part is the ProcessingEngine configuration (here a Spark streaming engine)

version: 1.4.1
documentation: LogIsland job config file
engine:
  component: com.hurence.logisland.engine.spark.KafkaStreamProcessingEngine
  type: engine
  documentation: Index some apache logs with logisland
  configuration:
    spark.app.name: IndexApacheLogsDemo
    spark.master: yarn-cluster
    spark.driver.memory: 1G
    spark.driver.cores: 1
    spark.executor.memory: 2G
    spark.executor.instances: 4
    spark.executor.cores: 2
    spark.yarn.queue: default
    spark.yarn.maxAppAttempts: 4
    spark.yarn.am.attemptFailuresValidityInterval: 1h
    spark.yarn.max.executor.failures: 20
    spark.yarn.executor.failuresValidityInterval: 1h
    spark.task.maxFailures: 8
    spark.serializer: org.apache.spark.serializer.KryoSerializer
    spark.streaming.batchDuration: 4000
    spark.streaming.backpressure.enabled: false
    spark.streaming.unpersist: false
    spark.streaming.blockInterval: 500
    spark.streaming.kafka.maxRatePerPartition: 3000
    spark.streaming.timeout: -1
    spark.streaming.unpersist: false
    spark.streaming.kafka.maxRetries: 3
    spark.streaming.ui.retainedBatches: 200
    spark.streaming.receiver.writeAheadLog.enable: false
    spark.ui.port: 4050
  controllerServiceConfigurations:

Then comes a list of ControllerService which are the shared components that interact with outside world (Elasticearch, HBase, ...)

- controllerService: datastore_service
  component: com.hurence.logisland.service.elasticsearch.Elasticsearch_6_6_2_ClientService
  type: service
  documentation: elasticsearch service
  configuration:
    hosts: sandbox:9200
    batch.size: 5000

Then comes a list of RecordStream, each of them route the input batch of Record through a pipeline of Processor to the output topic

streamConfigurations:
  - stream: parsing_stream
    component: com.hurence.logisland.stream.spark.KafkaRecordStreamParallelProcessing
    type: stream
    documentation: a processor that converts raw apache logs into structured log records
    configuration:
      kafka.input.topics: logisland_raw
      kafka.output.topics: logisland_events
      kafka.error.topics: logisland_errors
      kafka.input.topics.serializer: none
      kafka.output.topics.serializer: com.hurence.logisland.serializer.KryoSerializer
      kafka.error.topics.serializer: com.hurence.logisland.serializer.JsonSerializer
      kafka.metadata.broker.list: sandbox:9092
      kafka.zookeeper.quorum: sandbox:2181
      kafka.topic.autoCreate: true
      kafka.topic.default.partitions: 4
      kafka.topic.default.replicationFactor: 1

Then come the configurations of all the Processor pipeline. Each Record will go through these components. Here we first parse raw apache logs and then we add those records to Elasticsearch. Please note that the datastore processor makes use of the previously defined ControllerService.

processorConfigurations:

  - processor: apache_parser
    component: com.hurence.logisland.processor.SplitText
    type: parser
    documentation: a parser that produce records from an apache log REGEX
    configuration:
      record.type: apache_log
      value.regex: (\S+)\s+(\S+)\s+(\S+)\s+\[([\w:\/]+\s[+\-]\d{4})\]\s+"(\S+)\s+(\S+)\s*(\S*)"\s+(\S+)\s+(\S+)
      value.fields: src_ip,identd,user,record_time,http_method,http_query,http_version,http_status,bytes_out

  - processor: es_publisher
    component: com.hurence.logisland.processor.datastore.BulkPut
    type: processor
    documentation: a processor that indexes processed events in elasticsearch
    configuration:
      datastore.client.service: datastore_service
      default.collection: logisland
      default.type: event
      timebased.collection: yesterday
      collection.field: search_index
      type.field: record_type

Once you've edited your configuration file, you can submit it to execution engine with the following cmd :

bin/logisland.sh -conf conf/job-configuration.yml

You should jump to the tutorials section of the documentation. And then continue with components documentation

Contributing

To contribute please follow git hubflow : https://datasift.github.io/gitflow/TheHubFlowTools.html

Please review the Contribution to Logisland guide for information on how to get started contributing to the project.

Start a stream processing job

logisland's People

Contributors

Stargazers

Watchers

logisland's Issues

store consumed offsets in Kafka instead of Zookeeper

can be Zookeeper, HBase, ES, Couchbase

much better thant current chekpointing

add processor documentation generation

add event key management in kafka topics

this will be useful for components like HDFSBurner as all events are in the same topics, we can filter processing on event key characteristics (groupBy on RDD for example

add a remote debugging tutorial

add plugin directory to class path

avoid putting plugins in lib folder

add configuration validation option in logisland.sh

suppress the verbosity of logs in yarn mode

there are too many useless log lines

add a RESTful API for components live update

a REST API will help to monitor and update components properties for parsers, processors and engines.

design the API with Raml
implement it with VertX
or for embedding into Ambari view just implement with JAX-RS (https://github.com/mulesoft/raml-for-jax-rs)

POST component/<COMPONENT_ID>/statuts?state=RUNNING
POST component/<COMPONENT_ID>/statuts?state=PAUSE
GET component/<COMPONENT_ID>/statuts
GET component/<COMPONENT_ID>/metrics
GET component/<COMPONENT_ID>/configuration
POST component/<COMPONENT_ID>/configuration?<PARAM_NAME>=<PARAM_VALUE>
PUT component
...

add a processor chain class

this class can chain together multiple processors

like GeoIPProcessor => HostAnonymizerProcessor, ...

add a full integrated test for components

2 levels

Docker container with all the components
Embedded Kafka server + embeded Elasticsaerch

add a check for in/out topics parameters validity

if input or output topic is not set in yaml conf, NullPointerException is thrown

add an autoscaler daemon

Log-island should handle all the scalability burden in background

autoscale kafka partition
manage spark executor-cores and memory in an elastic way

add retention duration to PutElasticsearch

Spark job parameters

Spark job parameters should be handled via a configuration file. For instance, LogParserJob could read its parameters from a config file log-parser.yml located in the conf directory.

merge elasticsearch-shaded and elasticsearch-plugin

add Nifi EL support

Expression language is really powerful to express programatic values for fields

add a config file to auto start parser and indexer jobs

add field auto extractor processor

lot of unstructured String records may contain structured information like that could be automatically inferred in a processor

json blocs
key/value fields in the form of "this un unstrctured fields with fieldA=valueA and some other stuff fieldB=valueB"

kafka.common.OffsetOutOfRangeException

Testing on ... use case (usr log & parser), it crashes after a while with the following error:
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 4279.0 failed 1 times, most recent failure: Lost task 0.0 in stage 4279.0 (TID 4279, localhost): kafka.common.OffsetOutOfRangeException at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown Source) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown Source) at java.lang.reflect.Constructor.newInstance(Unknown Source) at java.lang.Class.newInstance(Unknown Source) at kafka.common.ErrorMapping$.exceptionFor(ErrorMapping.scala:86) at org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.handleFetchErr(KafkaRDD.scala:184) at org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.fetchBatch(KafkaRDD.scala:193) at org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.getNext(KafkaRDD.scala:208) at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:282) at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78) at org.apache.spark.rdd.RDD.iterator(RDD.scala:268) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source)
Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1845) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1858) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1929) at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:920) at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:918) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111) at org.apache.spark.rdd.RDD.withScope(RDD.scala:316) at org.apache.spark.rdd.RDD.foreachPartition(RDD.scala:918) at com.hurence.logisland.job.LogParserJob$$anonfun$main$2.apply(LogParserJob.scala:100) at com.hurence.logisland.job.LogParserJob$$anonfun$main$2.apply(LogParserJob.scala:98) at org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3.apply(DStream.scala:661) at org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3.apply(DStream.scala:661) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ForEachDStream.scala:50) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:50) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:50) at org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:426) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:49) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:49) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:49) at scala.util.Try$.apply(Try.scala:161) at org.apache.spark.streaming.scheduler.Job.run(Job.scala:39) at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply$mcV$sp(JobScheduler.scala:224) at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:224) at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:224) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:223) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) Caused by: kafka.common.OffsetOutOfRangeException at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown Source) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown Source) at java.lang.reflect.Constructor.newInstance(Unknown Source) at java.lang.Class.newInstance(Unknown Source) at kafka.common.ErrorMapping$.exceptionFor(ErrorMapping.scala:86) at org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.handleFetchErr(KafkaRDD.scala:184) at org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.fetchBatch(KafkaRDD.scala:193) at org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.getNext(KafkaRDD.scala:208) at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:282) at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78) at org.apache.spark.rdd.RDD.iterator(RDD.scala:268) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) ... 3 more

add R interpreter processor

R code interpreted in java

add a PutKafka processor

add HDFS burner component

this processor takes all records and send them to HDFS. parameters are

partitioning strategy
compression level
hdfs block size
output format (with serializer) => Avro, CSV, Parquet, ORC ...

add a global logisland.properties

to set the configuration of all external services (Kafka, Spark, HDFS, ...)

handle partitioning with hostid

Add Kafka streams support

For now Logisland only handles Spark stream processing engine, but Kafka streams coming with Kafka 0.10 should simply dependencies management and scalability.

write a LogIsland NIFI MQtt tutorial

add a Documentation generator for plugins

SplitText and Multiline processor should return Record with only raw_content field if there's no REGEX match

can be an optional behavior

add extension/plugin manager

this should be able to isolate a classloader that loads a shaded plugin

add a geoip processor

move docker monolithic container to Docker Compose one

Write a NIFI connector for Spark Streaming job

https://blogs.apache.org/nifi/entry/stream_processing_nifi_and_spark

Write a tutorial on how to write a plugin

move elastic search indexer outside of this project to the plugins folders

it's difficult to follow elastic search versions movement. And this compatibility should be deferred to a plugin which will follow the ES version branch by branch

migrate to Kafka 0.9

add sampling processor

this processor is able to extract only a few relevant data from a bucket of values

add Python processor

typo on architecture diagram on README

"while they appear" is "while the appear"

add adapter for Nifi plugins

logisland plugins should be usable into Nifi Dataflow

add type checking for SplitText component

deploy artefacts to maven central

use sonatype account

embedded Kafka server leaves a remaining java process after unit tests

typo in logisland-docs\_static\logisland-architecture.png

I guess it should be "while they appear" rather than "while the appear"

add creation of output topics, and ensure the jobs work with a list of output topics

For the moment, only the input tiopics are created if they does not exist.

Add a foreach for injection in each output topic.

integrate QueryMatcherProcessor

QueryMatcherProcessorTest makes use of DocumentPublisher which doesn't seems to react as expected. timeout exception.

=> test has been commented

generify MultilineSplitBloc component

Ensure source files have a licence header

We could use something like this: https://github.com/sbt/sbt-header to ensure all source files have licences.

EventIndexerJob [IndexAlreadyExistsException]

The creation is inside a foreach partition, so multiple node receive the non existence of an index at roughly the same time and each one try to create an Index, resulting into an 'IndexAlreadyExistsException'.

ElasticsearchEventIndexer, bulkLoad function write a confusing log

in the afterBulk function, logger.info(response.buildFailureMessage()) is called, writing a confusing message that can lead the reader to think that there has been a problem during bulk processing. The message is the following: 'Bulk processor failed: failure in bulk execution:' Even when there is no errors...

add kafka checkpointing

when there's a Driver failure, the job should be able to restart processing at the latest offset

hurence / logisland Goto Github PK

logisland's Introduction

Logisland

Event mining Workflow

Online documentation

Building Logisland

Quick start

Local Setup

Docker setup

Hadoop distribution setup

Start a stream processing job

Contributing

Start a stream processing job

logisland's People

Contributors

Stargazers

Watchers

Forkers

logisland's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs