GithubHelp home page GithubHelp logo

mailmahee / flume_filtering Goto Github PK

View Code? Open in Web Editor NEW

This project forked from ashrithr/flume_filtering

0.0 2.0 0.0 108 KB

Filter HTTP log events based on status_codes using Interceptors and ChannelSelectors

flume_filtering's Introduction

#Forwarding log events to HDFS using Flume This use-case processes(filters) http log events recieved from apache web server(httpd), it does the following:

  • Reads logs from webservers using exec source
  • Filters the log events based on the status_code(200, 404, 503) received using interceptors, which add status code to the header of the flume event
  • Based on the events status_code header, the events are redirected to different channels using multiplexing channel selector, if no matching status codes are found agnet will fall back to default channel.
  • Sinks picks up events form the assigned channel and forwards them to respective directories in hdfs using hdfs-sink

##About Flume Flume is a distributed service for efficiently collecting, aggregating, and moving large amounts of data to a centralised data store. It's architecture is based on streaming data flows and it uses a simple extensible data model that allows for online analytic application. It is robust and fault tolerant with tuneable reliability mechanisms and many failover and recovery mechanisms.

A unit of data in Flume is called an event, and events flow through one or more Flume agents to reach their destination. An event has a byte payload and an optional set of string attributes. An agent is a Java process that hosts the components through which events flow. The components are a combination of sources, channels, and sinks.

A Flume source consumes events delivered to it by an external source. When a source receives an event, it stores it into one or more Flume channels. A channel is a passive store that keeps the event until it's consumed by a Flume sink. The sink removes the event from the channel and puts it into an external repository (i.e. HDFS or HBase) or forwards it to the source of the next agent in the flow. The source and sink within a given agent run asynchronously, with the events staged in the channel.

Flume agents can be chained together to form multi-hop flows. This allows flows to fan-out and fan-in, and for contextual routing and backup routes to be configured.

For more information, see the Apache Flume User Guide.

###Flume Interceptors Interceptors are part of Flume's extensibility model. They allow events to be inspected as they pass between a source and a channel, and the developer is free to modify or drop events as required. Interceptors can be chained together to form a processing pipeline.

Interceptors are classes that implement the org.apache.flume.interceptor.Interceptor interface and they are defined as part of a source's configuration.

  • Built-in interceptors allow adding headers such as timestamps, hostname, static markers.
  • Custom interceptors can inspect event payload to create specific headers where necessary.

For more information, see Flume Interceptors.

###Flume Channel Selectors Channel Selector facilitates the selection of one or more channels from all configured channels, based on the preset criteria.

Built-in Channel Selectors:

  • Replicating: for duplicating the events
  • Multiplexing: for routing based on the event headers (added by interceptors)

###Flume Sink Processors Sink Processor is responsible for invoking one sink from a specified group of sinks. Sink processors can be used to provide load balancing capabilities over all sinks inside the group or to achieve fail over from one sink to another in case of temporal failure.

Built-in Sink Porcessors:

  • Load Balancing Sink Processor: provides the ability to load-balance flow over multiple sinks. It maintains an indexed list of active sinks on which the load must be distributed. Implementation supports distributing load using either via round_robin or random selection mechanisms.
  • Failover Sink Processor: maintains a prioritized list of sinks, guaranteeing that so long as one is available events will be processed (delivered).
  • Default Sink Processor: accepts only a single sink, user is not forced to create processor (sink group) for single sinks. Instead user can follow the source -> channel -> sink pattern.

##Testing it out ###Get some logs Use http-events generator from cloudwick-labs to generate logs to a path, follow these instructions to do so:

cd ~ && git clone --recursive https://github.com/cloudwicklabs/datagenerators.git
cd datagenerators/http_events
mkdir /var/logs.flume
ruby random_log_gen.rb -f /var/logs.flume/apache.log

Create hdfs dir for flume events storage:

hadoop fs -mkdir /flume
hadoop fs -chown [USERNAME] /flume

where, USERNAME is the user who is running the flume agent

Make sure to change the NAMNODE_HOST to your fqdn of namenode in the flume configuration file webserver.conf and finally start the flume like so:

/usr/lib/flume-ng/bin/flume-ng agent \
  -c /etc/flume-ng/conf/ \
  -f /etc/flume-ng/conf/webserver.conf \
  -n webserver \
  -Dflume.root.logger=DEBUG,console &> logs/flume.log

flume_filtering's People

Contributors

ashrithr avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.