Light

ohnosequences / loquat Goto Github PK

Another taste of nispero

Home Page: https://github.com/ohnosequences/nispero

License: GNU Affero General Public License v3.0

Scala 100.00%

scala scala-library framework cloud-computing cloud-infrastructure aws amazon-web-services aws-ec2 aws-s3 aws-sqs aws-sns aws-architecture data-analysis scalability pipeline-framework pipelines-as-code

loquat's Introduction

Loquat 🍋

Nispero is a Scala library designed for scaling stateless computations using Amazon Web Services.

Loquat is a fork of Nispero. See release notes for the list of changes.

loquat's People

Contributors

Stargazers

Watchers

Forkers

digideskio

loquat's Issues

DataMappings revamp

Here are some ideas how to simplify and improve DataMapping:

1. Simplifying data mapping

We don't need explicit map of input-outputs, instead we can have

a set of input records
a function from "task" (or input record) to output record

An alternative is to have a list of "tasks" (e.g. samples in MG7) and a function from task to an input dataset (or a set of datasets). This is probably better, because it's more flexible. Think of the MG7 pipeline:

currently it uses a very clumsy form of defining inputs for the BLAST step, because they are the chunks after the split step: so we go to the chunks S3 folder, enumerate all objects there and generate the input tasks
this would be much more straightforward with the proposed input datasets definition (task/sample -> set of inputs)
which leads to easier composability of loquats: split + blast piping becomes quite easy:
- split loquat has an output mapping: sample -> chunks s3 folder
- blast loquat needs to define an input mapping: sample -> set of chunks, which is just a composition of the split output mapping and S3 listObjects.

2. Simplifying output dataset usage

Output record is currently used in the dataProcessing to return all output files corresponding to the each record field. This brings two problems:

we intend to use output record with values of two types: s3 location and local file, which is strange and difficult due to the cosas-record limitations
return record with files is a resource of copy-paste mistakes, it does guarantee that you will return all output files, but it's too easy to mix up which file goes to which data-key.

We can drop it and use output files in the similar fashion as the input files currently: we have the data-keys and the files context, which can tell us where this file is expected to be. So whenever we write to the output file, we just pass that file obtained from the context with the data-key).

3. Simplifying input dataset usage

The only reason we're using datasets library now is that we want a common type for two types of input "resources": s3 address and a simple message/string. I think it's quite easy to achieve without the inconvenient datasets dependency:

input data-key has two apply methods (as a Type or Data): one that takes String, another AnyS3Address which is just converted to its URL string.
on receiving such input, worker tries to parse an S3 address and either downloads from it, or just writes the string to a file.

This is also need for #70.

With all this in account, we can use normal records of data-keys (which are just Type[String] for input and Type[AnyS3Address] for output) to control that data mapping is correct:

input datasets contain all keys
output mapping function (from 1.) returns all keys

md5 for input/output supporting multipart

Several links about this:

Use unique fienames for the input data

It can be useful, because they are unique. Another thing is when the processing part needs files to be in a certain place, but I think it's out of the context.

This would make #42 unnecessary, btw.

make the bucket name configurable

AWS server-side error handling

When there are a lot of workers milking the same SQS queue, at some point they get a lot of Request limit exceeded. (Service: AmazonEC2; Status Code: 503 exceptions. Those should be caught and handled in the following way:

Error Retries and Exponential Backoff

Typo in configs/user.scala#L28

There is one extra s in configs/user.scala#L28, it should be doesn't exist.

Sorry for the noise, I tried to fix it myself but I don't have permission to do it.

Write new docs

LogUploader failure blocks worker

Got this:

15:55:23.833 [ForkJoinPool-1-worker-1] INFO  ohnosequences.loquat.DataProcessor - Processing data in: /media/ephemeral0/applicator/loquat
Mar 02, 2016 4:04:49 PM com.amazonaws.http.AmazonHttpClient executeHelper
INFO: Unable to execute HTTP request: Remote host closed connection during handshake
javax.net.ssl.SSLHandshakeException: Remote host closed connection during handshake
        at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:992)
        at sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1375)
        at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1403)
        at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1387)
...
        at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3649)
        at com.amazonaws.services.s3.AmazonS3Client.createBucket(AmazonS3Client.java:824)
        at com.amazonaws.services.s3.AmazonS3Client.createBucket(AmazonS3Client.java:760)
        at ohnosequences.awstools.s3.S3.ohnosequences$awstools$s3$S3$$createBucketAction$1(S3.scala:183)
        at ohnosequences.awstools.s3.S3$$anonfun$createBucket$1.apply(S3.scala:190)
        at ohnosequences.awstools.s3.S3$$anonfun$createBucket$1.apply(S3.scala:190)
        at ohnosequences.awstools.s3.S3.tryAction(S3.scala:170)
        at ohnosequences.awstools.s3.S3.createBucket(S3.scala:190)
        at ohnosequences.awstools.s3.S3$$anonfun$uploadFile$1.apply$mcV$sp(S3.scala:265)
        at ohnosequences.awstools.s3.S3$$anonfun$uploadFile$1.apply(S3.scala:264)
        at ohnosequences.awstools.s3.S3$$anonfun$uploadFile$1.apply(S3.scala:264)
        at scala.util.Try$.apply(Try.scala:192)
        at ohnosequences.awstools.s3.S3.uploadFile(S3.scala:264)
        at ohnosequences.loquat.LogUploaderBundle$$anonfun$uploadLog$3.apply(logger.scala:31)
        at ohnosequences.loquat.LogUploaderBundle$$anonfun$uploadLog$3.apply(logger.scala:30)
        at scala.util.Success$$anonfun$map$1.apply(Try.scala:237)
        at scala.util.Try$.apply(Try.scala:192)
        at scala.util.Success.map(Try.scala:237)
        at ohnosequences.loquat.LogUploaderBundle.uploadLog(logger.scala:30)
        at ohnosequences.loquat.LogUploaderBundle$$anonfun$instructions$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(logger.scala:42)
        at ohnosequences.loquat.utils$Scheduler$$anon$1.run(utils.scala:68)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.EOFException: SSL peer shut down incorrectly
        at sun.security.ssl.InputRecord.read(InputRecord.java:505)
        at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:973)
        ... 46 more

after that worker doesn't react.

Generating IAM policies

Minimal policies/roles/profiles

for launching a loquat
for running it (i.e. manager's permissions)

Would be nice to generate the second policy automatically and create it for the newly launched loquat.

Data flow diagram

I was looking for web-apps to make an AWS data flow diagram for MG7 (ohnosequences/mg7#91) and to try something first made a diagram for the Loquat architecture:

Here's a link to the editable draft.

@ohnosequences/docs what do you think of such diagram in general and in the context of MG7?

Planning v2.0 release

Some essential things that have to be done:

#52: Check that datamappings have all data keys defined
#42: Check that all input data-keys have distinct labels
#53: Stabilize/update dependencies before the release

The rest of the things are not essential for the v2.0 release, some of them are important but can be implemented in the v2.1

The plan is:

1. fix the issues under the v2.0 milestone
2. release statika-2.0-RC1, and release loquat-2.0-RC1
3. test it with some real project, if there are some new issues/bugs, add them to the milestone
4. fix them, release RC2 with these fixes and updates of all dependencies (#53)
5. test again, if no major problems detected, release v2.0

Console output logging alternative

Check how GetConsoleOutputResult works, if this is what I think, we don't need all this output redirection in statika and logs uploadding here.

Improve email subscription messages

add links to the results in the success-finish message
what-to-do in the failure-finish message
fix temp-link in the worker failure message

Suggestions are welcome.

Better workers logging

Some ideas:

current logUploader bundle uploads log to S3 only once with the information about instance initialization (before "applying" state)
- it is still useful for situations when initialization fails and kills the instance, but you still have a log in S3 to inspect it manually
then there is a separate log for each task processing cycle
- it's flushed in the beginning of each cycle
- it can be uploaded (?) to S3 at the end of the cycle
- probably it's better to upload it to S3 or send to the SQS queue only on failures
- send an SNS notification on fatal failures (see #21) with this task-related log

Some ideas for capturing stdout for logging: StackExchange.

Improve logging

more logging for debug
adjust logging levels
create loggers explicitly and set the default logging level and a more sensible name
...?

Upgrade to cosas-0.8

Chaining loquats

General ideas for implementing this feature:

change termination daemon to "progress monitor"
provide a simple way to define certain behaviour patterns/rules
progress monitor will check the given conditions and follow corresponding instructions if some of them are met

This way in addition to the termination conditions, monitor will be able, for example, to set a condition to launch a subsequent loquat when there is at least one successful result in the output queue (which is the input queue for it).

Note this will require the manager to have permissions to launch another loquat, so #32 is related.

Subscribe more users to the notifications

Also probably SMS notifications...

Fix instances naming

After some update instances are not named

Add events notification topic

It could be useful for #33: One loquat emits events like N results ready or everything is done/failed, another loquat subscribes and waits/starts when needed.

Use better-files

Instead of the homebrew utils.file

https://github.com/pathikrit/better-files

Bug in the input objects check

When checking S3 objects existence (not folders), it should be more strict. Currently it checks by prefix, which may fail if the actual object has a longer name.

Stabilize/update dependencies before the release

Release v2.0 statika finally
Consider updating better-files (from 2.13 to 2.15)
Consider updating aws-scala-tools (to use the latest aws-java-sdk).
This shouldn't cause any changes, but will require a new release of datasets.

The rest of the things:

[info]   ch.qos.logback:logback-classic    : 1.1.3  -> 1.1.5
[info]   com.github.pathikrit:better-files : 2.13.0          -> 2.15.0
[info]   com.lihaoyi:upickle               : 0.3.6  -> 0.3.8
[info]   org.scalatest:scalatest:test      : 2.2.5  -> 2.2.6

Workers are not prolonging messages visibility timeout

One case from the project I'm working on:

1904 chunks, 1000 reads each
processing each task (chunk) takes ~40min
no matter how many workers I launch, only 5 to 10 messages are in-flight

So I think that workers are just processing the same tasks at the same time, because they fail to prolong their messages' initial visibility timeout (then those tasks return to the queue and other workers start doing them again).

Worker failures handling

After the changes in #39 failures in worker are basically ignored. This has to be fixed:

worker's data processing cycle recovers from non-critical failures and sends an SQS error message
manager checks the SQS error queue and if there are too many messages (>10?) sends an SNS notification to the user and undeploys

The difficulty here is in providing useful debug information in the SNS notification, but I have some ideas how to solve it.

Logger fails to upload when log changes too fast

When somebody is writing to the log continuously, logger fails with this message:

ERROR ohnosequences.loquat.LogUploaderBundle - Couldn't upload log to S3: {}
com.amazonaws.services.s3.model.AmazonS3Exception: The Content-MD5 you specified did not match what we received. (Service: Amazon S3; Status Code: 400; Error Code: BadDigest; Request ID: 7EAF2583A79D66FF)
...

This is not a big deal, but the log will be missing in S3.

Could be solved by #67.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.

Jobs

Jooble