GithubHelp home page GithubHelp logo

ohnosequences / loquat Goto Github PK

View Code? Open in Web Editor NEW
1.0 8.0 1.0 1.63 MB

Another taste of nispero

Home Page: https://github.com/ohnosequences/nispero

License: GNU Affero General Public License v3.0

Scala 100.00%
scala scala-library framework cloud-computing cloud-infrastructure aws amazon-web-services aws-ec2 aws-s3 aws-sqs aws-sns aws-architecture data-analysis scalability pipeline-framework pipelines-as-code

loquat's Introduction

loquat's People

Contributors

eparejatobes avatar laughedelic avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

digideskio

loquat's Issues

DataMappings revamp

Here are some ideas how to simplify and improve DataMapping:

1. Simplifying data mapping

We don't need explicit map of input-outputs, instead we can have

  • a set of input records
  • a function from "task" (or input record) to output record

An alternative is to have a list of "tasks" (e.g. samples in MG7) and a function from task to an input dataset (or a set of datasets). This is probably better, because it's more flexible. Think of the MG7 pipeline:

  • currently it uses a very clumsy form of defining inputs for the BLAST step, because they are the chunks after the split step: so we go to the chunks S3 folder, enumerate all objects there and generate the input tasks
  • this would be much more straightforward with the proposed input datasets definition (task/sample -> set of inputs)
  • which leads to easier composability of loquats: split + blast piping becomes quite easy:
    • split loquat has an output mapping: sample -> chunks s3 folder
    • blast loquat needs to define an input mapping: sample -> set of chunks, which is just a composition of the split output mapping and S3 listObjects.

2. Simplifying output dataset usage

Output record is currently used in the dataProcessing to return all output files corresponding to the each record field. This brings two problems:

  • we intend to use output record with values of two types: s3 location and local file, which is strange and difficult due to the cosas-record limitations
  • return record with files is a resource of copy-paste mistakes, it does guarantee that you will return all output files, but it's too easy to mix up which file goes to which data-key.

We can drop it and use output files in the similar fashion as the input files currently: we have the data-keys and the files context, which can tell us where this file is expected to be. So whenever we write to the output file, we just pass that file obtained from the context with the data-key).

3. Simplifying input dataset usage

The only reason we're using datasets library now is that we want a common type for two types of input "resources": s3 address and a simple message/string. I think it's quite easy to achieve without the inconvenient datasets dependency:

  • input data-key has two apply methods (as a Type or Data): one that takes String, another AnyS3Address which is just converted to its URL string.
  • on receiving such input, worker tries to parse an S3 address and either downloads from it, or just writes the string to a file.

This is also need for #70.


With all this in account, we can use normal records of data-keys (which are just Type[String] for input and Type[AnyS3Address] for output) to control that data mapping is correct:

  • input datasets contain all keys
  • output mapping function (from 1.) returns all keys

Use unique fienames for the input data

It can be useful, because they are unique. Another thing is when the processing part needs files to be in a certain place, but I think it's out of the context.

This would make #42 unnecessary, btw.

LogUploader failure blocks worker

Got this:

15:55:23.833 [ForkJoinPool-1-worker-1] INFO  ohnosequences.loquat.DataProcessor - Processing data in: /media/ephemeral0/applicator/loquat
Mar 02, 2016 4:04:49 PM com.amazonaws.http.AmazonHttpClient executeHelper
INFO: Unable to execute HTTP request: Remote host closed connection during handshake
javax.net.ssl.SSLHandshakeException: Remote host closed connection during handshake
        at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:992)
        at sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1375)
        at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1403)
        at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1387)
...
        at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3649)
        at com.amazonaws.services.s3.AmazonS3Client.createBucket(AmazonS3Client.java:824)
        at com.amazonaws.services.s3.AmazonS3Client.createBucket(AmazonS3Client.java:760)
        at ohnosequences.awstools.s3.S3.ohnosequences$awstools$s3$S3$$createBucketAction$1(S3.scala:183)
        at ohnosequences.awstools.s3.S3$$anonfun$createBucket$1.apply(S3.scala:190)
        at ohnosequences.awstools.s3.S3$$anonfun$createBucket$1.apply(S3.scala:190)
        at ohnosequences.awstools.s3.S3.tryAction(S3.scala:170)
        at ohnosequences.awstools.s3.S3.createBucket(S3.scala:190)
        at ohnosequences.awstools.s3.S3$$anonfun$uploadFile$1.apply$mcV$sp(S3.scala:265)
        at ohnosequences.awstools.s3.S3$$anonfun$uploadFile$1.apply(S3.scala:264)
        at ohnosequences.awstools.s3.S3$$anonfun$uploadFile$1.apply(S3.scala:264)
        at scala.util.Try$.apply(Try.scala:192)
        at ohnosequences.awstools.s3.S3.uploadFile(S3.scala:264)
        at ohnosequences.loquat.LogUploaderBundle$$anonfun$uploadLog$3.apply(logger.scala:31)
        at ohnosequences.loquat.LogUploaderBundle$$anonfun$uploadLog$3.apply(logger.scala:30)
        at scala.util.Success$$anonfun$map$1.apply(Try.scala:237)
        at scala.util.Try$.apply(Try.scala:192)
        at scala.util.Success.map(Try.scala:237)
        at ohnosequences.loquat.LogUploaderBundle.uploadLog(logger.scala:30)
        at ohnosequences.loquat.LogUploaderBundle$$anonfun$instructions$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(logger.scala:42)
        at ohnosequences.loquat.utils$Scheduler$$anon$1.run(utils.scala:68)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.EOFException: SSL peer shut down incorrectly
        at sun.security.ssl.InputRecord.read(InputRecord.java:505)
        at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:973)
        ... 46 more

after that worker doesn't react.

Generating IAM policies

Minimal policies/roles/profiles

  • for launching a loquat
  • for running it (i.e. manager's permissions)

Would be nice to generate the second policy automatically and create it for the newly launched loquat.

Data flow diagram

I was looking for web-apps to make an AWS data flow diagram for MG7 (ohnosequences/mg7#91) and to try something first made a diagram for the Loquat architecture:

loquat2

Here's a link to the editable draft.

@ohnosequences/docs what do you think of such diagram in general and in the context of MG7?

Planning v2.0 release

Some essential things that have to be done:

  • #52: Check that datamappings have all data keys defined
  • #42: Check that all input data-keys have distinct labels
  • #53: Stabilize/update dependencies before the release

The rest of the things are not essential for the v2.0 release, some of them are important but can be implemented in the v2.1

The plan is:

  • 1. fix the issues under the v2.0 milestone
  • 2. release statika-2.0-RC1, and release loquat-2.0-RC1
  • 3. test it with some real project, if there are some new issues/bugs, add them to the milestone
  • 4. fix them, release RC2 with these fixes and updates of all dependencies (#53)
  • 5. test again, if no major problems detected, release v2.0

Improve email subscription messages

  • add links to the results in the success-finish message
  • what-to-do in the failure-finish message
  • fix temp-link in the worker failure message

Suggestions are welcome.

Better workers logging

Some ideas:

  • current logUploader bundle uploads log to S3 only once with the information about instance initialization (before "applying" state)
    • it is still useful for situations when initialization fails and kills the instance, but you still have a log in S3 to inspect it manually
  • then there is a separate log for each task processing cycle
    • it's flushed in the beginning of each cycle
    • it can be uploaded (?) to S3 at the end of the cycle
    • probably it's better to upload it to S3 or send to the SQS queue only on failures
    • send an SNS notification on fatal failures (see #21) with this task-related log

Some ideas for capturing stdout for logging: StackExchange.

Improve logging

  • more logging for debug
  • adjust logging levels
  • create loggers explicitly and set the default logging level and a more sensible name
  • ...?

Chaining loquats

General ideas for implementing this feature:

  • change termination daemon to "progress monitor"
  • provide a simple way to define certain behaviour patterns/rules
  • progress monitor will check the given conditions and follow corresponding instructions if some of them are met

This way in addition to the termination conditions, monitor will be able, for example, to set a condition to launch a subsequent loquat when there is at least one successful result in the output queue (which is the input queue for it).

Note this will require the manager to have permissions to launch another loquat, so #32 is related.

Add events notification topic

It could be useful for #33: One loquat emits events like N results ready or everything is done/failed, another loquat subscribes and waits/starts when needed.

Bug in the input objects check

When checking S3 objects existence (not folders), it should be more strict. Currently it checks by prefix, which may fail if the actual object has a longer name.

Stabilize/update dependencies before the release

  • Release v2.0 statika finally
  • Consider updating better-files (from 2.13 to 2.15)
  • Consider updating aws-scala-tools (to use the latest aws-java-sdk).
    This shouldn't cause any changes, but will require a new release of datasets.

The rest of the things:

[info]   ch.qos.logback:logback-classic    : 1.1.3  -> 1.1.5
[info]   com.github.pathikrit:better-files : 2.13.0          -> 2.15.0
[info]   com.lihaoyi:upickle               : 0.3.6  -> 0.3.8
[info]   org.scalatest:scalatest:test      : 2.2.5  -> 2.2.6

Workers are not prolonging messages visibility timeout

One case from the project I'm working on:

  • 1904 chunks, 1000 reads each
  • processing each task (chunk) takes ~40min
  • no matter how many workers I launch, only 5 to 10 messages are in-flight

So I think that workers are just processing the same tasks at the same time, because they fail to prolong their messages' initial visibility timeout (then those tasks return to the queue and other workers start doing them again).

Worker failures handling

After the changes in #39 failures in worker are basically ignored. This has to be fixed:

  • worker's data processing cycle recovers from non-critical failures and sends an SQS error message
  • manager checks the SQS error queue and if there are too many messages (>10?) sends an SNS notification to the user and undeploys

The difficulty here is in providing useful debug information in the SNS notification, but I have some ideas how to solve it.

Logger fails to upload when log changes too fast

When somebody is writing to the log continuously, logger fails with this message:

ERROR ohnosequences.loquat.LogUploaderBundle - Couldn't upload log to S3: {}
com.amazonaws.services.s3.model.AmazonS3Exception: The Content-MD5 you specified did not match what we received. (Service: Amazon S3; Status Code: 400; Error Code: BadDigest; Request ID: 7EAF2583A79D66FF)
...

This is not a big deal, but the log will be missing in S3.

Could be solved by #67.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.