GithubHelp home page GithubHelp logo

elastiq's Introduction

elastiq

elastiq is a lightweight Python daemon that allows a cluster of virtual machines running a batch system to scale up and down automatically.

Scale up. elastiq monitors the batch system's queue. If too many jobs are waiting, it requests new virtual machines.

Scale down. elastiq monitors cluster's virtual machines. If some machines are idle for some time, it turns them off.

EC2. elastiq communicates with the cloud via the ubiquitous EC2 interface. The boto library is used for that.

Quotas. elastiq supports a quota for a minimum and maximum number of virtual machines. It will always ensure that a minimum number of virtual machines are running, and it will never run too many virtual machines.

Plugins. elastiq can support several batch systems via plugins. It already comes with support for HTCondor.

IaaS embedded elasticity. elastiq allows to run an entire IaaS cluster that scales itself without using tools running outside the virtual cluster. Run it on the head node of your virtual cluster and it will scale it on any cloud exposing an EC2 interface.

Requirements

  • Python 2.6 or greater
  • boto

Installation

CentOS/RHEL

Pick a release and an RPM from the releases page, and install it with:

yum localinstall python-elastiq-<ver>.rpm

Debian/Ubuntu

Pick a release and a deb from the releases page, and install it with:

gdebi python-elastiq-<ver>.deb

Run in foreground

Syntax:

elastiq-real.py --config=<configfile> [--logdir=<logdir>]

Where:

  • <configfile> is the configuration file (mandatory)
  • <logdir> is a directory where to place logfiles, which are rotated periodically

If run like this, it will stay in the foreground. It is also possible to run it as a system service: a script for running it in background is provided.

Run in background

To run it in the background use the elastiqctl command. It is recommended to run it as an unprivileged user.

Syntax:

elastiqctl [start|stop|status|restart|log|conf]

Where:

  • start, stop and restart are self-explanatory;
  • status tells whether the daemon is running;
  • log shows the log of elastiq in real time (quit with Ctrl-C);
  • conf shows some configuration information like the Python version, configuration file in use and log files directory.

Default configuration

When running as unprivileged user:

  • log directory: ~/.elastiq/log
  • configuration file: ~/.elastiq/elastiq.conf

When running as root:

  • log directory: /var/log/elastiq
  • configuration file: /etc/elastiq.conf

Configuration

See the provided example elastiq.conf.example under the elastiq installation directory.

Plugins

See the htcondor.py plugin provided as an example.

elastiq's People

Contributors

dberzano avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar

elastiq's Issues

subprocess.wait() hangs if output is too large

As per the doc, subprocess.wait() hangs forever if running out of buffer space on the output pipe.

subprocess.communicate() on the other hand empties the output pipe and waits for the process to terminate as well.

wait() should be then replaced by communicate().

For the record, this is the real cause of the behavior shown in #3.

Log level not correctly set

It seems that the log level is always set to "debug" independently of the configuration variable log_level: "self.logctl.setLevel(logging.DEBUG)".

Get rid of screen

screen currently used for background processing: leftover from initial debug phases. Get rid of it: it is also a "hidden" dependency!

External commands might hang forever

External commands (handled by robust_cmd) might hang forever and never recover. We should work around it.

Last log lines for reference:

2014-06-27 02:25:17 root DEBUG [__init__.main] Sleeping 5 seconds
2014-06-27 02:25:22 root DEBUG [__init__.main] Event 1/11 in queue: action=change_vms_allegedly_running when=1403828887 (-164) params=[-1, u'i-00043315']
2014-06-27 02:25:22 root DEBUG [__init__.main] Event 2/11 in queue: action=check_owned_instance when=1403828887 (-164) params=[u'i-00043315']
2014-06-27 02:25:22 root DEBUG [__init__.main] Event 3/11 in queue: action=change_vms_allegedly_running when=1403828887 (-164) params=[-1, u'i-00043316']
2014-06-27 02:25:22 root DEBUG [__init__.main] Event 4/11 in queue: action=check_owned_instance when=1403828887 (-164) params=[u'i-00043316']
2014-06-27 02:25:22 root DEBUG [__init__.main] Event 5/11 in queue: action=change_vms_allegedly_running when=1403828979 (-256) params=[-1, u'i-00043317']
2014-06-27 02:25:22 root DEBUG [__init__.main] Event 6/11 in queue: action=check_owned_instance when=1403828979 (-256) params=[u'i-00043317']
2014-06-27 02:25:22 root DEBUG [__init__.main] Event 7/11 in queue: action=change_vms_allegedly_running when=1403828979 (-256) params=[-1, u'i-00043318']
2014-06-27 02:25:22 root DEBUG [__init__.main] Event 8/11 in queue: action=check_owned_instance when=1403828979 (-256) params=[u'i-00043318']
2014-06-27 02:25:22 root DEBUG [__init__.main] Event 9/11 in queue: action=check_vms when=1403828721 (0) params=[]
2014-06-27 02:25:22 root INFO [__init__.check_vms] Checking batch system's VMs...
2014-06-27 02:25:24 root DEBUG [__init__.ec2_running_instances] No input hostnames given

Execution stalls after that line.

elastiq does not work with Ubuntu 12.04

Due to an old boto version (v2.2.2) we obtain:

Can't get list of owned EC2 instances in error: 'EC2Connection' object has no attribute 'get_only_instances'

Currently at least boto v2.13 is required. Code should be adapted to work also with such old boto version.

elastiq does not work with multiple networks possible

If a tenant has more networks as lan and wan, the creation of VM doesn't work. The
error message in the log is:

root ERROR [__init__.ec2_scale_up] Cannot run instance via EC2: check your "hard" quota

(that is not so clear) and the traceback is:

Traceback (most recent call last):
File "/usr/lib/python2.6/site-packages/elastiq/__init__.py", line 291, in ec2_scale_up
instance_type=cf['ec2']['flavour']
File "/usr/lib/python2.6/site-packages/boto/ec2/image.py", line 325, in run
tenancy=tenancy, dry_run=dry_run)
File "/usr/lib/python2.6/site-packages/boto/ec2/connection.py", line 935, in run_instances
verb='POST')
File "/usr/lib/python2.6/site-packages/boto/connection.py", line 1177, in get_object
raise self.ResponseError(response.status, response.reason, body)
EC2ResponseError: EC2ResponseError: 400 Bad Request
<?xml version="1.0"?>
<Response><Errors><Error><Code>NetworkAmbiguous</Code><Message>Multiple possible networks found, use a Network ID to be more specific.</Message></Error></Errors><RequestID>req-48643f37-7819-42c3-9acb-8833ee1c3602</RequestID></Response>
2016-07-06 15:37:05 root INFO [__init__.ec2_scale_up] VM launch fail. Requested: 1/2 | Success: 0 | Failed: 1
2016-07-06 15:37:10 root ERROR [__init__.ec2_scale_up] Cannot run instance via EC2: check your "hard" quota

Change allegedly running VMs sooner if appropriate

  • Run check_owned_instance() many times, more granularly and use the deployment time as a timeout: if it hasn't been reached, reschedule the check; if it has been reached, terminate the VM.
  • Also, do not schedule change_vms_allegedly_running(), but use check_owned_instance() to change it, both on success (i.e. VM is running) and on failure (i.e. timeout reached).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.