GithubHelp home page GithubHelp logo

acquia / fifo2kinesis Goto Github PK

View Code? Open in Web Editor NEW
10.0 8.0 1.0 10.89 MB

Continuously reads data from a named pipe and publishes it to a Kinesis stream.

License: MIT License

Go 98.55% Shell 1.45%

fifo2kinesis's Introduction

FIFO to Kinesis Pipeline

Build Status Go Report Card

This app continuously reads data from a named pipe (FIFO) and publishes it to a Kinesis stream.

fifo2kinesis cli demo

Why?

FIFOs are a great way to send data from one application to another. Having an open pipe that ships data to Kinesis facilitates a lot of interesting use cases. One such example is using the named pipe support in rsyslog and syslog-ng to send log streams to Kinesis.

Admittedly, it would be really easy to write a handful of lines of code in a bash script using the AWS CLI to achieve the same result, however the fifo2kinesis app is designed to reliably handle large volumes of data. It achieves this by making good use of Go's concurrency primitives, buffering and batch publishing data read from the fifo, and handling failures in a way that can tolerate network and AWS outages.

Installation

Either download the latest binary for your platform, or run the following command in the project's root to build the aws-proxy binary from source:

GOPATH=$PWD go build -o ./bin/fifo2kinesis fifo2kinesis

Usage

Create a named pipe:

mkfifo ./kinesis.pipe

Run the app:

./bin/fifo2kinesis --fifo-name=$(pwd)/kinesis.pipe --stream-name=my-stream

Write to the FIFO:

echo "Streamed at $(date)" > kinesis.pipe

The line will be published to the my-stream Kinesis stream within the default flush interval of 5 seconds.

Quick start for the impatient among us

If you are impatient like me and want your oompa loompa now, modify the --buffer-queue-limit, --flush-interval, and --flush-handler options so that what you send to the FIFO is written to STDOUT immediately instead of a buffered write to Kinesis. This doesn't do much, but it provides immediate gratification and shows how the app works when you play with the options.

./bin/fifo2kinesis --fifo-name=$(pwd)/kinesis.pipe --buffer-queue-limit=1 --flush-interval=0 --flush-handler=logger

Configuration

Configuration is read from command line options and environment variables in that order of precedence. The following options and env variables are available:

  • --fifo-name, FIFO2KINESIS_FIFO_NAME: The absolute path of the named pipe.
  • --stream-name, FIFO2KINESIS_STREAM_NAME: The name of the Kinesis stream.
  • --partition-key, FIFO2KINESIS_PARTITION_KEY: The partition key, a random string if omitted.
  • --buffer-queue-limit, FIFO2KINESIS_BUFFER_QUEUE_LIMIT: The number of items that trigger a buffer flush.
  • --failed-attempts-dir, FIFO2KINESIS_FAILED_ATTEMPTS_DIR: The directory that logs failed attempts for retry.
  • --flush-interval, FIFO2KINESIS_FLUSH_INTERVAL: The number of seconds before the buffer is flushed.
  • --flush-handler, FIFO2KINESIS_FLUSH_HANDLER: Defaults to "kinesis", use "logger" for debugging.
  • --region, FIFO2KINESIS_REGION: The AWS region that the Kinesis stream is provisioned in.
  • --role-arn, FIFO2KINESIS_ROLE_ARN: The ARN of the AWS role being assumed.
  • --role-session-name, FIFO2KINESIS_ROLE_SESSION_NAME: The session name used when assuming a role.
  • --debug, FIFO2KINESIS_DEBUG: Show debug level log messages.

The application also requires credentials to publish to the specified Kinesis stream. It uses the same configuration mechanism as the AWS CLI tool, minus the command line options.

Running With Upstart

Use Upstart to start fifo2kinesis during boot and supervise it while the system is running. Add a file to /etc/init with the following contents, replacing /path/to and my-stream according to your environment.

description "FIFO to Kinesis Pipeline"
start on runlevel [2345]

respawn
respawn limit 3 30
post-stop exec sleep 5

exec /path/to/fifo2kinesis --fifo-name=/path/to/named.pipe --stream-name=my-stream --region=us-east-1

Publishing Logs From Syslog NG

NOTE: You might also want to check out fluentd and the Amazon Kinesis Agent. You won't find an argument in this README as to why you should choose one over the other, I just want to make sure you have all the options in front of you so that you can make the best decision for your specific use case.

Syslog NG provides the capability to use a named pipe as a destination. Use fifo2kinesis to read log messages from the FIFO and publish them Kenisis.

Make a FIFO:

mkfifo /var/syslog.pipe

Modify syslog-ng configuration to send the logs to the named pipe. For example, on Ubuntu 14.04 create a file named /etc/syslog-ng/conf.d/01-kinesis.conf with the following configration:

destination d_pipe { pipe("/var/syslog.pipe"); };
log { source(s_src); destination(d_pipe); };

Start the app:

./bin/fifo2kinesis --fifo-name=/var/syslog.pipe --stream-name=my-stream

Restart syslog-ng:

service syslog-ng restart

The log stream will now be published to Kinesis.

Development

AWS Proxy uses Glide to manage dependencies.

Tests

Run the following commands to run tests and generate a coverage report:

GOPATH=$PWD go test -coverprofile=build/coverage.out fifo2kinesis
GOPATH=$PWD go tool cover -html=build/coverage.out

fifo2kinesis's People

Contributors

cpliakas avatar beejeebus avatar

Stargazers

 avatar Akihiro Suda avatar Dmitry Selivanov avatar Dr. Ogg avatar Ioannis Koniaris avatar adam avatar Kian G. Movasagi avatar Jonathan avatar Craig Patten avatar Matt Siegel avatar

Watchers

Amin Astaneh avatar  avatar Nik Gregory avatar James Cloos avatar  avatar Matthew Morley avatar  avatar  avatar

Forkers

mafonso

fifo2kinesis's Issues

Add an assumed role option

Sometimes the Kinesis stream you want to publish to is in a different AWS account. You therefore need to assume a role in order to access it. This app should provide an option to be able to assume a role in order to support this use case.

App hangs when it encounters a read error

The app should shut down since can no longer read from the fifo, however it throws a CRIT and then hangs until ctrl+c is pressed. To replicate, pass in a non-existent fifo.

Implement a file buffer

We currently only have a memory buffer. A file buffer would consume more resources, but it would have less chance of losing data if the app or system crashes before the buffer is flushed.

Implement a buffer size limit

This app reads from the fifo as fast as it can so that applications writing to it aren't blocked. The channel that stores buffer chunks has a buffer size of 100, so theoretically we store up to ~ 500MB in memory, assuming that the fifo is flooded with the maximum size of requests (5MB max for 500 records, see #10 for more details on how we get these numbers).

There should be a --buffer-size-limit, that is a multiple of 5MB, that would make this value configurable.

Add the ability to fetch the stack name from a URL

Because you cannot increase the number of shards in a Kinesis stream, scaling up requires creating a new resource. In order to facilitate autodiscovery of the new resource, it might help to be able to specify a URL that fifo2kinesis can check periodically to get the stream name. Thinking of a --stream-name-url options or something.

Flush interval values other than 0 and 5 don't behave as expected

Here is the relevant code snippet:

if w.FlushInterval > 0 {
    go func() {
        for {
            time.Sleep(time.Second * 5)
            forceFlush <- true

            // Send a flush command to unblock the fifo read in case no
            // lines are being written to the fifo. This command is
            // ignored below, the forceFlush channel is what matters.
            w.Fifo.SendCommand("flush")
        }
    }()
}

Notice the time.Sleep(time.Second * 5).

Define and document error handling

The docs say:

The application exits immediately with a non-zero exit code on all AWS errors

This is no longer true. We should define the error handling policy and document accordingly.

Make separate write requests once we have 5MB of data

Follow-up to #2. The buffer will flush when 500 lines have been received, however the data could exceed 5MB which would make the request fail. This is an unlikely scenario for our use case, however we should keep it on our radar to harden this app.

Add duplicate file detection when creating retry files

The filename is constructed from a timestamp and random string. Therefore collisions should practically never happen. However, it is pretty easy to create a loop of a fixed amount and re-generate the random string to ensure that the file is created. This is similar to how go creates temp files.

Add a --region option

You can configure the region via the AWS_REGION environment variable, but it would be good to be able to configure it via the command line as the environment's region might not match the region that the stream is in.

Implement a partition key strategy

Right new we have a dummy partition key for testing, but we should implement some strategy so that this library can be used with multiple shards. The primary use case that spawned this library is to send log messages to Kinesis, so maybe the following logs would split into 3 partition keys:

Jan 11 17:26:04 localhost jenkins: INFO: plexus-rds-hourly-backup #1448 main build action completed: SUCCESS
Jan 11 17:27:19 localhost dhclient: DHCPREQUEST of 10.40.10.4 on eth0 to 10.40.10.1 port 67 (xid=0x7697c2c5)
Jan 11 17:27:19 localhost dhclient: DHCPACK of 10.40.10.4 from 10.40.10.1
Jan 11 17:27:19 localhost dhclient: bound to 10.40.10.4 -- renewal in 1620 seconds.
Jan 11 17:29:10 localhost sudo: pam_unix(sudo:session): session closed for user root

localhost:jenkins, localhost:dhclient, and localhost:sudo. Obviously localhost needs to change.

Determine how to handle lines that exceed 1MB

The maximum size for a record (plus partition key) is 1MB. We should figure out how to handle lines that exceed 1MB so that we don't send a big request that we know will fail (and thus be captured in our retry system).

Implement a write buffer

Right now, one message = one put request to Kenisis. This is fine for our use case, but as the volume of logs increase we will likely need to buffer put requests so that we can send them in batch.

A better way to close your bufio.Scanner

I noticed this:

https://github.com/acquia/fifo2kinesis/blob/master/src/fifo2kinesis/fifo.go#L72-L78

So, the way I've handled this is to open a file and launch a goroutine to which you pass that filehandle. The bufio.Scanner is then created around that filehandle (e.g. scanner := bufio.NewScanner(filehandle)) inside the child goroutine. When the parent routine sees that it's time to shutdown (via context.Context or whatever), the parent routine closes the filehandle and that will cause the scanner.Scan() in the child goroutine to abort.

Something like this:

file, err := os.OpenFile(f.Name, os.O_RDONLY, os.ModeNamedPipe)
if err != nil {
	return err
}

sigs := make(chan os.Signal, 1)
readerDone := make(chan struct{}, 1)
signal.Notify(sigs, syscall.SIGINT, syscall.SIGTERM)

go readLog(file, readerDone)

select {
   case <-sigs:
        // We recieved a SIGINT or SIGTERM, so we cancel our context and break out
        // of this loop to wait for the workers to finish
        log.Println("Exiting...")
        tr.Close()
        cancel()
   case <-readerDone:
         // Our reader finished up (hit EOF or got an error) so we cancel our context
         // and break out of the loop to wait for workers to finish
        cancel()
}



func readLog(f *os.File, readerDone chan struct{}) {
        defer close(readerDone)
        scanner := bufio.NewScanner(tr)

        for scanner.Scan() {
                line := scanner.Text()
                // Do something with the line
        }

        if err := scanner.Err(); err != nil {
                log.Println("error reading:", err)
        }
}

Add better error handling to retry mechanism

For example:

  • What if writing certain lines back to the fifo fails?
  • What if there are scanner errors, and the file was only partially read?
  • What if we cannot remove the retry file?

Marking as an enhancement, since we made a decision to accept these risks as the rewards of retry handling without handling the what if's is worth it.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.