pywren / pywren Goto Github PK

Teraflops and microservices

License: Apache License 2.0

Python 98.68% Shell 1.32%

pywren's Introduction

PyWren

The wrens are mostly small, brownish passerine birds in the mainly New World family Troglodytidae. ... Most wrens are small and rather inconspicuous, except for their loud and often complex songs. - Wikipedia

PyWren -- it's like a mini condor, in the cloud, for often-complex calls. You can get up to 40 TFLOPS peak from AWS Lambda:

This is the development site. Learn more at pywren.io.

pywren's People

Contributors

Stargazers

Watchers

Forkers

tdhopper jakirkham smorin rodrigoparada bradparks dmreiland nunb shivaram jrk etrain ooq chagge appcoreopc bygreencn vaishaal yuqianman atveit awesome-ml awesome-python nishadsingh1 hj3938 picnkname spiogit donohue sean-smith mailmahee apengwin admetricks dcrankshaw elviento ccdatatraits benzei attaluris zff concretevitamin ianmeyers feynmanliang gilv beomi giserh kkaffes wyawen jostheim sworxjoe hryang rocaltair qcloud-scf rowhit epai quipri vishalbelsare joostvdoorn nikolayvoronchikhin zehric w601sxs rjboczar fruit37 tomwhite jaykimbravekjh aaronikramer tarng zooootk andresriancho nunofernandes-plight thomcost ericmjonas kristjanstrojan gafajardogr ukulililixl gindachen sacheendra saneitchyhog prahaladdarkin crazyboycjr zhangjyr scusemua yaxche-io gabelev stjordanis pedramha adamqqqplay arontsang dwa moonbirdkiss eshnil2000 csu-tclab syllogy sjtu-serverless evhemary sreeram-gsan ekmixon leonhardfs lt2000 brianmcconnel s50600822 tratori nusu77 dattgoswami chefren cameronraysmith

pywren's Issues

Generic refactor / split wren.py / run pyflakes

@ooq points out that wren.py is already over 1200 lines -- we can do better! A lot of parts of it can easily be split into separate functions.

How to handle retries

Right now if jobs die for various reasons, we don't retry them. This is especially problematic for some of the stand-alone workers.

Is there a tasklet / greenlet version of urllib that will work better / be a faster backend to botocore? Right now it seems we're limited by our ability to dispatch / launch those jobs. Or is this a aws rate-limiting issue? Performance seems to have really slowed down when we switched to the s3-backed job synchronization

Provides cost estimation along with result?

This could be a nice user-facing feature.
Lambda cost can be easily estimated by accumulating the lambda runtime, multiplied by 100ms unit cost. Invocation cost can be included but is generally minimal. S3 read/write is free for lambda.

PyWren IAM role should use correct AWS id

The IAM role installed by pywren_exec_role doesn't have the right permissions for the cloud watch logging actions. We need to get the user's aws id from boto to get this right.

Add more packages to conda runtime

Get errors out of cloudwatch

Fix directories so that the conda command-line utils have right path

Are the conda utils / is the anaconda environment in the runtime?
Can we conda install or pip install something?

workers on arbitrary instances

Add cloudwatch logging
Test what happens if something goes bad
Add self-termination
Integrate into tests

Reduce setup time

Sometimes we find old runtimes in /tmp, I think this is because AWS recycles my lambda runner. This could potentially be used to cache the runtime and dramatically reduce setup time.

Investigate using EMR as a better backend for execution for long-running jobs

Would this work better for longer-running jobs? Or is explicit cluster creation too awful

Permissions for a new user -- IAM policies? How to constrain?

What's the workflow for spinning up a new user?
How does IAM integrate into this?

ImportError: No module named 'wren'

pywren create_config failed for me because it couldn't import wren.

(pywren-test) ✔ ~/repos/pywren-test
12:44 $ ipython --no-banner

In [1]: import pywren
---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-1-4be033962a5f> in <module>()
----> 1 import pywren

/Users/tdhopper/repos/pywren-test/pywren/pywren/__init__.py in <module>()
----> 1 import wren
      2 from wren import default_executor, wait

ImportError: No module named 'wren'```

python3 support

Both the client-side and the server-side runtime would need a little bit of work for python3 support.

Server-side

the fabfile fabfile_builder.py handles spinning up an EC2 instance, downloading anaconda, configuring it, tarballing the result, and cramming it back into S3 as the run environment. You'd just need to download the python3 miniconda -- everything there should basically work ok.
jobrunner.py would need to be python3 friendly

Client-side

as @tdhopper has pointed out with some issues, there's just a host of non-python-3 friendly code and idioms in there. all of them could be fixed and I think some of the more complex parts (cloudpickle?) already have python 3 support

Additionally we'd want to properly set up a build matrix on travis for testing

Scaling MISO for Isoform prediction

https://miso.readthedocs.io/en/fastmiso/ is a library that does massively parallel MCMC for Isoform prediction written by a friend. He's had no end of struggles with regular users trying to make it scale on their clusters. Would be a great pywren use case.

Rate-limiting executors with a max number of concurrent workers

Right now, the executor fires all the invocations at once. This approach has the following downsides:

If the invoked function is accessing other services, e.g., S3, too many concurrent invocations can result in throttling from those external services
A user might have a much lower limit on the concurrent executions from AWS (the default is 100). Firing up, say, 10000, invocations might cause AWS to throttle those invocations.
In the above scenario, our progress tracking code on the host will track all 10,000 invocations, which is unnecessary, as only 100 will be executed at the same time.

Why do we use multiprocess instead of multiprocessing

This seems like an unnecessary dependency given that we just 1. use it in the stand-alone mode and 2. I think multiprocessing provides all the same functionality? I'm still not sure what the multiprocess fork actually does.

Add documentation on permissions

I use an IAM User with All S3 permissions which allowed me to successfully call pywren create_config --bucket_name YOUR_S3_BUCKET_NAME. Then I tried to run create_role. I got

20:10 $ pywren create_role
config= {'s3': {'bucket': 'BUCKET', 'pywren_prefix': 'pywren.jobs'}, 'account': {'aws_account_id': ID, 'aws_region': 'us-west-2', 'aws_lambda_role': 'pywren_exec_role'}, 'runtime': {'s3_key': 'condaruntime.nomkl_sklearn.tar.gz', 's3_bucket': 'ericmjonas-public'}, 'lambda': {'memory': 1536, 'timeout': 300, 'function_name': 'pywren1'}}
Traceback (most recent call last):
  File "/Users/tdhopper/miniconda2/envs/pywren/bin/pywren", line 11, in <module>
    load_entry_point('pywren', 'console_scripts', 'pywren')()
  File "/Users/tdhopper/miniconda2/envs/pywren/lib/python2.7/site-packages/click/core.py", line 716, in __call__
    return self.main(*args, **kwargs)
  File "/Users/tdhopper/miniconda2/envs/pywren/lib/python2.7/site-packages/click/core.py", line 696, in main
    rv = self.invoke(ctx)
  File "/Users/tdhopper/miniconda2/envs/pywren/lib/python2.7/site-packages/click/core.py", line 1060, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/tdhopper/miniconda2/envs/pywren/lib/python2.7/site-packages/click/core.py", line 889, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/tdhopper/miniconda2/envs/pywren/lib/python2.7/site-packages/click/core.py", line 534, in invoke
    return callback(*args, **kwargs)
  File "/Users/tdhopper/repos/pywren/pywren/scripts/pywrencli.py", line 87, in create_role
    AssumeRolePolicyDocument=json_policy)
  File "/Users/tdhopper/miniconda2/envs/pywren/lib/python2.7/site-packages/boto3/resources/factory.py", line 520, in do_action
    response = action(self, *args, **kwargs)
  File "/Users/tdhopper/miniconda2/envs/pywren/lib/python2.7/site-packages/boto3/resources/action.py", line 83, in __call__
    response = getattr(parent.meta.client, operation_name)(**params)
  File "/Users/tdhopper/miniconda2/envs/pywren/lib/python2.7/site-packages/botocore/client.py", line 251, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/Users/tdhopper/miniconda2/envs/pywren/lib/python2.7/site-packages/botocore/client.py", line 537, in _make_api_call
    raise ClientError(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (AccessDenied) when calling the CreateRole operation: User: arn:aws:iam::X:user/Y is not authorized to perform: iam:CreateRole on resource: arn:aws:iam::X:role/pywren_exec_role

I attached the IAMFullAccess policy to my user and then that command ran successfully. Then I tried to run deploy_lambda and I got

20:17 $ pywren deploy_lambda
Traceback (most recent call last):
  File "/Users/tdhopper/miniconda2/envs/pywren/bin/pywren", line 11, in <module>
    load_entry_point('pywren', 'console_scripts', 'pywren')()
  File "/Users/tdhopper/miniconda2/envs/pywren/lib/python2.7/site-packages/click/core.py", line 716, in __call__
    return self.main(*args, **kwargs)
  File "/Users/tdhopper/miniconda2/envs/pywren/lib/python2.7/site-packages/click/core.py", line 696, in main
    rv = self.invoke(ctx)
  File "/Users/tdhopper/miniconda2/envs/pywren/lib/python2.7/site-packages/click/core.py", line 1060, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/tdhopper/miniconda2/envs/pywren/lib/python2.7/site-packages/click/core.py", line 889, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/tdhopper/miniconda2/envs/pywren/lib/python2.7/site-packages/click/core.py", line 534, in invoke
    return callback(*args, **kwargs)
  File "/Users/tdhopper/repos/pywren/pywren/scripts/pywrencli.py", line 124, in deploy_lambda
    b = lambclient.list_functions()
  File "/Users/tdhopper/miniconda2/envs/pywren/lib/python2.7/site-packages/botocore/client.py", line 251, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/Users/tdhopper/miniconda2/envs/pywren/lib/python2.7/site-packages/botocore/client.py", line 537, in _make_api_call
    raise ClientError(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (AccessDeniedException) when calling the ListFunctions operation: User: arn:aws:iam::X:user/Y is not authorized to perform: lambda:ListFunctions

I haven't yet figured out how to get around this. :)

Probably need to add some docs explaining how users might configurer AWS permissions.

S3 Key to Key wrapper

It seems a common idiom for running pywren on large datasets is mapping an S3 key to a new (processed) S3 key. Right now how we do this is put the S3 reading/writing code inside the pywren function which is kind of cumbersome. It would be nice to have a native interface for this functionality.

figure out exception issue

Should we distinguish between exeptions involved in the remote code invocaton
and exceptions triggered by the call code

Thread logging through

We really need to take logging seriously

s3util func key calls it lambda.pickle

It's now JSON encoded
It should be called func not lambda.

Integration with elastic cache

While S3 is useful for bulk writes, it would be good to have a way to integrate with elastic cache for small writes / reads across Lambdas.

create_config script should pick the correct runtime based on python version

right now it defaults to 2.7, which could make it difficult for python 3 users to try out pywren

Get a modern openblas running on the runtime

Follow shivaram's notes, see if it makes a diff

Create S3 bucket if it doesn't exist

When a new user tries out pywren with a bucket name that doesn't exist, right now they get an error message. We could auto create the bucket to make it easier for users

create pypi package

Make sure we have docs on how to release in DEVNOTES.md
Figure out if cloudwatch pinning works ok?

package everything into an executor that will take config info

Increase concurrent capacity and retry benchmark

http://docs.aws.amazon.com/lambda/latest/dg/limits.html

Lambda defaults to a 100 concurrent execution limit. Looking at your graph the green dots seem clustered around that amount :)

The runtime situation is a mess

There's no programmatic way of knowing which runtime you're using
There's no good way of making sure you don't accidentally try to run a python2 job with the python3 runtime
It's easy to accidentally deploy a broken runtime.

Create anaconda package

Benchmark dgemm

Set up real benchmark infrastructure

Migrate to github org

Migrate to the new pywren github organization

Change travis crypto keys
Make sure travis builds still work
Update slack integrations
Update links in build files (like setup.py)

Encode and serialize python modules as UTF-8

UTF8 is a perfectly valid format for python files, and JSON should also handle it quite fine, too. Make sure we are doing this explicitly.

use wait in the examples

otherwise no one knows about it!

Add links to RISELab

In new blog posts and readme.md

Better synchronization of runtime information with local client

Right now, what modules to not upload is hardcoded. We should be able to do better than that, especially due to conda env. This is mostly a placeholder to keep the issue alive.

Make sure we can serialize futures

for integration into ruffus etc

How to handle upgrades

We should add in versioning to the runtime so that people don't accidentally invoke wrong / old / incompatible functions

Speed up invocation by sending body with invoke request

The AWS invoke limits allow up to 128 kB in the body 👍

http://docs.aws.amazon.com/lambda/latest/dg/limits.html

this would potentially prevent the additional S3 transaction for some workloads

Benchmark S3 reads, writes within a VPC / region

Check if this is any different from existing benchmark numbers we have

Large-scale reducer functionality

replace /tmp/pymodules with PYTHON_MODULE_PATH in wrenhandler

Create interactive getting started script

Create a script for new users like Evan suggested, like

pywren init

validate Boto config / ask them to type in aws keys and create their boto file for them
prompt for an S3 bucket or create an empty one for them
set up all the deploy scripts etc.

Consider using boto3 s3 transfer interface

http://boto3.readthedocs.io/en/latest/_modules/boto3/s3/transfer.html

Document install procedure

I was able to get this work on a (non-borked python install), but the README is a little off.

There's not yet docs about how to install the package. (Just a line that says python setup.py install or something would be helpful)
After running pywren create_config I needed to modify my ~/.pywren_config in order to run pywren create_role and pywren deploy_lambda. The install instructions indicate that modifying the file happens after all three steps.
It may make sense to have a wrapper script called pywren init which does all of this with interactive prompts for bucket names and regions. That way there's no chance of having a partially broken ~/pywren_config anywhere and no magic values to worry about checking.

Mapping local working directory to remote

A user's local python program might access data or call binary executables in a local working directory. Ideally, we want such python programs to work on the remote side as well.
The enable this, a user can specify the local directory that needs to be mapped, and we transfer the files inside the directory to remote through S3.

Create package versioning and check with runtime

We would like to make sure that the lambda wrenhandler.py and the locally-installed library always are the same version.

Create a __version__ for the package, and have the lambda runtime compare its version with that.

New user experience

How do they create and deploy the function to begin? what's the flow for deploying to Lambda?