etsy / skyline Goto Github PK

View Code? Open in Web Editor NEW

2.1K 175.0 336.0 1.83 MB

It'll detect your anomalies! Part of the Kale stack.

Home Page: http://codeascraft.com/2013/06/11/introducing-kale/

License: Other

Python 79.13% CSS 5.33% JavaScript 8.37% Shell 5.05% HTML 2.12%

non-sox

skyline's Introduction

Skyline is an Archived Project

Skyline is no longer actively maintained. Your mileage with patches may vary.

Skyline

Skyline is a real-time* anomaly detection* system*, built to enable passive monitoring of hundreds of thousands of metrics, without the need to configure a model/thresholds for each one, as you might do with Nagios. It is designed to be used wherever there are a large quantity of high-resolution timeseries which need constant monitoring. Once a metrics stream is set up (from StatsD or Graphite or other source), additional metrics are automatically added to Skyline for analysis. Skyline's easily extendible algorithms automatically detect what it means for each metric to be anomalous. After Skyline detects an anomalous metric, it surfaces the entire timeseries to the webapp, where the anomaly can be viewed and acted upon.

Read the details in the wiki.

Install

sudo pip install -r requirements.txt for the easy bits
Install numpy, scipy, pandas, patsy, statsmodels, msgpack_python in that order.
You may have trouble with SciPy. If you're on a Mac, try:

sudo port install gcc48
sudo ln -s /opt/local/bin/gfortran-mp-4.8 /opt/local/bin/gfortran
sudo pip install scipy

On Debian, apt-get works well for Numpy and SciPy. On Centos, yum should do the trick. If not, hit the Googles, yo.

cp src/settings.py.example src/settings.py
Add directories:

sudo mkdir /var/log/skyline
sudo mkdir /var/run/skyline
sudo mkdir /var/log/redis
sudo mkdir /var/dump/

Download and install the latest Redis release
Start 'er up

cd skyline/bin
sudo redis-server redis.conf
sudo ./horizon.d start
sudo ./analyzer.d start
sudo ./webapp.d start

By default, the webapp is served on port 1500.

Check the log files to ensure things are running.

Debian + Vagrant specific, if you prefer

Gotchas

If you already have a Redis instance running, it's recommended to kill it and restart using the configuration settings provided in bin/redis.conf
Be sure to create the log directories.

Hey! Nothing's happening!

Of course not. You've got no data! For a quick and easy test of what you've got, run this:

cd utils
python seed_data.py

This will ensure that the Horizon service is properly set up and can receive data. For real data, you have some options - see wiki

Once you get real data flowing through your system, the Analyzer will be able start analyzing for anomalies!

Alerts

Skyline can alert you! In your settings.py, add any alerts you want to the ALERTS list, according to the schema (metric keyword, strategy, expiration seconds) where strategy is one of smtp, hipchat, or pagerduty. You can also add your own alerting strategies. For every anomalous metric, Skyline will search for the given keyword and trigger the corresponding alert(s). To prevent alert fatigue, Skyline will only alert once every for any given metric/strategy combination. To enable Hipchat integration, uncomment the python-simple-hipchat line in the requirements.txt file.

How do you actually detect anomalies?

An ensemble of algorithms vote. Majority rules. Batteries kind of included. See wiki

Architecture

See the rest of the wiki

Contributions

Clone your fork
Hack away
If you are adding new functionality, document it in the README or wiki
If necessary, rebase your commits into logical chunks, without errors
Verfiy your code by running the test suite and pep8, adding additional tests if able.
Push the branch up to GitHub
Send a pull request to the etsy/skyline project.

We actively welcome contributions. If you don't know where to start, try checking out the issue list and fixing up the place. Or, you can add an algorithm - a goal of this project is to have a very robust set of algorithms to choose from.

Also, feel free to join the skyline-dev mailing list for support and discussions of new features.

(*depending on your data throughput, *you might need to write your own algorithms to handle your exact data, *it runs on one box)

skyline's People

Contributors

Stargazers

Watchers

Forkers

ftdysa shjain oxtopus edouard-lopez gitlisted nvdnkpr zborboa-tesla ssgelm blotter123 tzuryby jonathandmello gutefrage jaredstehler johndorn ericds lucciano ifwe khouse tksteph12 which07 parkan tcopple hetyhuka mabrek cdoru oztc kryptx scalextremeinc goncalopereira doowb robzienert wayfair-archive jumping jametong mydalon ifixit draco2003 github4bhavin sirsepp mcchang nevins-b ami-linsa vinicius0026 jof oscil8 wwfalcon scopely anthroprose presto53 lananhbk168 thegreymatter jxwr ybrs somenathd andy-pham lloucas-imvu cekstam lxfontes amoseev mikemian fzylogic carl-ellis jsnyderjsnyderper akaplan-tagged xbglowx fauxcult skynet hltbra spadalkar jbmunro4 arreyder yujinqiu gnosek mynameismeerkat salanki maxgabriel ronmb waytai davideme banzayats danavilacon bradleyjs maxmetagravity ugurarpaci payamsabz albertobaselga m-kiselev rugger74 zsol darkcrawler01 freshteapot xujack astanway knowbody jkkruse leoyzen viveksck b-rich marccardinal nrmjba

skyline's Issues

GRAPHITE_URL is undefined in webapp, even though it is set in settings.py

This is causing all links to be broken.

make SKIP_LIST regexs

I'd like to be able to exclude metrics that end in "foo" without worrying about metrics that happen to have "foo" somewhere else in their name.

Skyline analyzer Crashed on 2k Metric's

Skyline is crashing every time it reaches 2k Metrics. Analyzer.log stops recording any metrics. While horizon logs are working fine. Please advise on how can i fix these issues and are there any alternatives available in the market that I can look into.

Why such an old version of simplejson?

All the other dependencies seem to be fairly current.

analyzer.d: statsmodels not listed as dependency

[root@skyline-n01 bin]# ./analyzer.d start
Traceback (most recent call last):
File "../src/analyzer/analyzer-agent.py", line 12, in
from analyzer import Analyzer
File "/root/skyline-master/src/analyzer/analyzer.py", line 14, in
from algorithms import run_selected_algorithm
File "/root/skyline-master/src/analyzer/algorithms.py", line 4, in
import statsmodels.api as sm
ImportError: No module named statsmodels.api
failed to start analyzer-agent

Fun part is that its not packed in Fedora yet :(

Make the Roombas spawn the same way as the Workers do.

Or vice versa, as long as it's consistent.

Uncaught Exceptions in algorithms.py

Hi all,

I had a look at Skyline for a few days now and only can get it to work partially. The most annoying thing are the hundreds of Tracebacks in the analyzer logfile like this one

ERROR:root:Algorithm error: Traceback (most recent call last):
  File "/home/christian/Source/git/github/skyline/src/analyzer/algorithms.py", line 289, in run_selected_algorithm
    ensemble = [globals()[algorithm](timeseries) for algorithm in ALGORITHMS]
  File "/home/christian/Source/git/github/skyline/src/analyzer/algorithms.py", line 103, in first_hour_average
    t = tail_avg(timeseries)
  File "/home/christian/Source/git/github/skyline/src/analyzer/algorithms.py", line 45, in tail_avg
    t = (timeseries[-1][1] + timeseries[-2][1] + timeseries[-3][1]) / 3
TypeError: unsupported operand type(s) for /: 'str' and 'int'

Can someone give me a hint whats going wrong? Because I have no idea what to do at this point.

add TAIL_INTERVAL configuration setting to use in tail_avg

Checking last N datapoints gives different results on metrics with different resolutions. If anomaly is detected on 1 last datapoint or even 3 last datapoints in a metric with 2 seconds resolution that anomaly might disappear in 10 seconds. If metric has resolution of 5 minutes then there is quite a lot of time for human to notice detected anomaly.
Metrics with a different resolutions might be present in the same environment so single size (like 3 used in tail_avg) won't fit them all.
New configuration setting (TAIL_INTERVAL measured in seconds) needs to be added so that algorithms would compare datapoints from the last TAIL_INTERVAL seconds with the rest.

See the discussion at #43

Provide a real build system

Installation of skyline currently is quite cumbersome and requires a lot of manual tasks. From a user perspective, having a common build system with e.g. setuptools would be a huge improvement for the usability. Especially, there should not be any assumption that skyline is installed with sudo permissions and custom installation prefixes need to work correctly.

Properly subclass Listen()

It's not very easy to extend the Listener while it remains a single class.

centos cannot import calc_lwork from scipy.linalg?

Hi, I use pip to install scipy,and the version is 0.15.0.
when i start the analyzer, I got this error.

File "/usr/local/lib/python2.7/site-packages/statsmodels/api.py", line 15, in <module>
    from .tsa import api as tsa
  File "/usr/local/lib/python2.7/site-packages/statsmodels/tsa/api.py", line 5, in <module>
    from .vector_ar.var_model import VAR
  File "/usr/local/lib/python2.7/site-packages/statsmodels/tsa/vector_ar/var_model.py", line 23, in <module>
    from statsmodels.tools.linalg import logdet_symm
  File "/usr/local/lib/python2.7/site-packages/statsmodels/tools/linalg.py", line 23, in <module>
    from scipy.linalg import calc_lwork
ImportError: cannot import name calc_lwork
failed to start analyzer-agent

I had tried google, but can' t. Thanks.

analyzer.d: pandas not listed as dependency

[root@skyline-n01 bin]# ./analyzer.d start
rm: cannot remove ‘../src/analyzer/*.pyc’: No such file or directory
Traceback (most recent call last):
File "../src/analyzer/analyzer-agent.py", line 12, in
from analyzer import Analyzer
File "/root/skyline-master/src/analyzer/analyzer.py", line 14, in
from algorithms import run_selected_algorithm
File "/root/skyline-master/src/analyzer/algorithms.py", line 1, in
import pandas
ImportError: No module named pandas

"Anomalous datapoint" causes confusion

Need a clearer way to communicate that "anomalous datapoint" means, the actual datapoint that is anomalous as opposed to the number of anomalous datapoints in the series.

Don't truncate logs on restart

Continuity.py should accept a command line argument to easily check any metric's health

Add ability to filter by metric on front end

Not detecting a dropped connection

When the graphite server drops the connection it's not detected. Returning if not buf forces the exception from unpack.

def read_all(self, sock, n):
"""
Read n bytes from a stream
"""
data = ''
while n > 0:
buf = sock.recv(n)
if not buf:
return data
n -= len(buf)
data += buf
return data

dependecies missing from requirements.txt

This needs to be added to the requriemetns.txt:

numpy
msgpack-python
pandas
scipy
patsy
statsmodels

Potentially interesting data sets could be discarded as boring.

Given this block of code to determine if a data set is boring (https://github.com/etsy/skyline/blob/master/src/analyzer/algorithms.py#L177):

# Get rid of boring series
total = sum(item[1] for item in timeseries[-MAX_TOLERABLE_BOREDOM:])
if total == 0:
    raise Boring()

If, in the rare chance the net movement over a window of time is 0, for example, if the data set toggles between positive and negative values, but contains an anomaly, it could be discarded as being "boring".

For example, the following contrived example contains an anomaly, but would be seen as boring:

[-1, -1, -1, -1, -1, -1, -1, -1, -1, 10, -1]

Page layout messed up when no anomalous metrics are present.

Problem getting data into skyline

Hi everyone,

I made and update on my skyline, and now data are no more integrated on redis.

I used a version of skyline downloaded by zip, dont know which revision exactly, but certainly the state of 20/02/2014 skyline-master.

when I start OLD horizon.d I have the following in log :

started with pid 2644
2014-06-16 11:45:10 :: 2644 :: starting horizon agent
2014-06-16 11:45:10 :: 2645 :: started worker
2014-06-16 11:45:10 :: 2644 :: started roomba
2014-06-16 11:45:10 :: 2648 :: started listener
2014-06-16 11:45:10 :: 2648 :: listening over udp for messagepack on 2025
2014-06-16 11:45:10 :: 2647 :: started listener
2014-06-16 11:45:10 :: 2647 :: listening over tcp for pickles on 2024
2014-06-16 11:45:10 :: 2646 :: started worker
2014-06-16 11:45:10 :: 2650 :: operated on metrics. in 0.235132 seconds
2014-06-16 11:45:10 :: 2650 :: metrics. keyspace is 1003
2014-06-16 11:45:10 :: 2650 :: blocked 0 times
2014-06-16 11:45:10 :: 2650 :: euthanized 0 geriatric keys
2014-06-16 11:45:10 :: 2650 :: sleeping due to low run time...
2014-06-16 11:45:11 :: 2647 :: connection from xxxxxxxxxxx:2024
2014-06-16 11:45:11 :: 2645 :: queue size at 134
2014-06-16 11:45:11 :: 2645 :: queue size at 326
2014-06-16 11:45:11 :: 2645 :: queue size at 488
...

when I start CURRENT horizon.d (with exactly same settings.py) I have the following in log :

started with pid 2702
2014-06-16 11:46:56 :: 2702 :: starting horizon agent
2014-06-16 11:46:56 :: 2703 :: started worker
2014-06-16 11:46:56 :: 2705 :: started listener
2014-06-16 11:46:56 :: 2702 :: started roomba
2014-06-16 11:46:56 :: 2705 :: listening over tcp for pickles on 2024
2014-06-16 11:46:56 :: 2706 :: started listener
2014-06-16 11:46:56 :: 2706 :: listening over udp for messagepack on 2025
2014-06-16 11:46:56 :: 2704 :: started worker
2014-06-16 11:46:57 :: 2708 :: operated on metrics. in 0.263879 seconds
2014-06-16 11:46:57 :: 2708 :: metrics. keyspace is 1007
2014-06-16 11:46:57 :: 2708 :: blocked 0 times
2014-06-16 11:46:57 :: 2708 :: euthanized 0 geriatric keys
2014-06-16 11:46:57 :: 2708 :: sleeping due to low run time...
2014-06-16 11:46:58 :: 2705 :: connection from xxxxxxxxxx:2024
2014-06-16 11:46:58 :: 2705 :: global name 'StringIO' is not defined
2014-06-16 11:46:58 :: 2705 :: incoming connection dropped, attempting to reconnect
2014-06-16 11:46:58 :: 2705 :: listening over tcp for pickles on 2024
2014-06-16 11:47:07 :: 2713 :: operated on metrics. in 0.252194 seconds
2014-06-16 11:47:07 :: 2713 :: metrics. keyspace is 1007
2014-06-16 11:47:07 :: 2713 :: blocked 0 times
2014-06-16 11:47:07 :: 2713 :: euthanized 0 geriatric keys
2014-06-16 11:47:07 :: 2713 :: sleeping due to low run time...
2014-06-16 11:47:11 :: 2704 :: worker queue is empty and timed out
2014-06-16 11:47:11 :: 2703 :: worker queue is empty and timed out
...

And no data is sent to redis.

Should I come back to my old working skyline or there is some changes on horizon.d that explain the change of behaviour concerning TCP PICKLES data transfers ?

Add UI to indicate missing data in timeseries

The UDP MessagePack format description in the wiki is wrong

Hi,

Currently, https://github.com/etsy/skyline/wiki/Getting-Data-Into-Skyline#udp-messagepack says that:

UDP messagepack

Horizon also accepts metrics in the form of messagepack encoded strings over UDP, on port 2025. > The format is <metric name> <timestamp> <value>. Simply encode your metrics as messagepack and send them on their way.

That's not the case. It actually is [<metric name>, [<timestamp>, <value>]]. I haven't changed the wiki directly, because the format has always been like this (as far as I could find on git's history), so I wonder if the issue was that I didn't interpret it correctly. If that's the case, it would be good to make it clearer, as others might have the same issue.

Thanks!

support metric whitelists

I'd like to be able to supply a whitelist of metrics (as an alternative to the SKIP_LIST blacklist).

Currently skyline usually finds around 2k anomalies out of about 500k metrics. That is totally awesome to look through but too much for 'important things' to jump out. I'd like to run a parallel skyline that only looks at a smaller set of metrics that are tied to existing fault-detection/dashboards.

Analyzer DictProxy / KeyError

Please see this Gist for all of the information: https://gist.github.com/bflad/5863991

Add algorithms for seasonal metrics

Process timeseries that are less than FULL_DURATION anyway...

...and add a visual indicator warning that the data is not complete.

Missing "pipe.execute()" in Roomba when purging timeseries

Hi guys, here is the problem I encountered:

I hava some really old datapoints in redis. After I started Roomba, the whole timeseries should be purged as the last timestamp of it is much older than FULL_DURATION(+ ROOMBA_GRACE_TIME) seconds ago, but actually it is not and the variable "euthanized" somehow increased by 1. After going through the source I found that the reason may be the redis transaction is not submitted:

                #### Line 94
                 # Check if the last value is too old and purge
                if timeseries[-1][0] < now - duration:
                    pipe.delete(key)
                    pipe.srem(namespace + 'unique_metrics', key)
                    # ====== maybe we should call "pipe.execute()" here before "continue" ?
                    euthanized += 1
                    continue

and it is the same situation when purging timeseries which contains only 1 datapoint :

                #### Line 75
                # There's one value. Purge if it's too old
                try:
                    if not isinstance(timeseries[0], TupleType):
                        if timeseries[0] < now - duration:
                            pipe.delete(key)
                            pipe.srem(namespace + 'unique_metrics', key)
                            # ====== maybe we should call "pipe.execute()" here before "continue" ?
                            euthanized += 1
                        continue
                except IndexError:
                    continue

'Cause I'm new to Python or Redis, am I missing something ?

Make Oculus optional

If a user does not want to use Oculus, and gets rid of the Oculus settings, the front end will break. See #22 and #21.

It would also be wise to stop writing to mini. in Redis if Oculus is turned off, and to render MINI_DURATION views on the front end from the regular FULL_DURATION keys.

Created puppet module

I've created a puppet module and modified the wiki, mind checking and approving?

IOError for PID lockfile after ungraceful analyzer shutdown

Prevents analyzer from starting without knowing to remove PID lockfile.

# analyzer crashes here and leaves lockfile still around
ls -l /var/run/skyline/analyzer.pid
-rw-r--r-- 1 skyline skyline 6 Jun 25 20:36 /var/run/skyline/analyzer.pid
service skyline-analyzer start
cat /var/log/skyline/analyzer.log

Traceback (most recent call last):
  File "../src/analyzer/analyzer-agent.py", line 53, in <module>
    daemon_runner.do_action()
  File "/usr/lib/python2.6/site-packages/daemon/runner.py", line 189, in do_action
    func(self)
  File "/usr/lib/python2.6/site-packages/daemon/runner.py", line 124, in _start
    self.daemon_context.open()
  File "/usr/lib/python2.6/site-packages/daemon/daemon.py", line 346, in open
    self.pidfile.__enter__()
  File "/usr/lib/python2.6/site-packages/lockfile/__init__.py", line 226, in __enter__
    self.acquire()
  File "/usr/lib/python2.6/site-packages/daemon/pidfile.py", line 42, in acquire
    super(TimeoutPIDLockFile, self).acquire(timeout, *args, **kwargs)
  File "/usr/lib/python2.6/site-packages/lockfile/pidlockfile.py", line 85, in acquire
    raise LockTimeout
lockfile.LockTimeout
close failed in file object destructor:
IOError: [Errno 9] Bad file descriptor

# remove PID file - init script is just calling ./analyzer.d stop here
# could also just rm -f /var/run/skyline/analyzer.pid
service skyline-analyzer stop
service skyline-analyzer start
# starts fine. :)

Series with periodic

For example, a series with periodic: 1 day, data at 12:00 is a peak(i.e 1000), and at 0:00 is 10, so, 1000 at 12:00 should be normal, and 10 at 12:00 should be anomalous.

But skyline thinks 10 is normal.

Any data in webapp

Hello,
Why webapp doesn't show any data after running seed_data.py ?
"
Loading data over UDP via Horizon...
Connecting to Redis...
Congratulations! The data made it in. The Horizon pipeline seems to be working.
"

Thanks,

analyzer error

ERROR:root:Algorithm error: Traceback (most recent call last):
File "/opt/skyline/src/analyzer/algorithms.py", line 233, in run_selected_algorithm
ensemble = [globals()algorithm for algorithm in ALGORITHMS]
File "/opt/skyline/src/analyzer/algorithms.py", line 205, in ks_test
adf = sm.tsa.stattools.adfuller(reference, 10)
File "/usr/local/lib/python2.7/dist-packages/statsmodels/tsa/stattools.py", line 221, in adfuller
maxlag, autolag)
File "/usr/local/lib/python2.7/dist-packages/statsmodels/tsa/stattools.py", line 64, in _autolag
mod_instance = mod(endog, exog[:,:lag], *modargs)
File "/usr/local/lib/python2.7/dist-packages/statsmodels/regression/linear_model.py", line 479, in init
hasconst=hasconst)
File "/usr/local/lib/python2.7/dist-packages/statsmodels/regression/linear_model.py", line 381, in init
weights=weights, hasconst=hasconst)
File "/usr/local/lib/python2.7/dist-packages/statsmodels/regression/linear_model.py", line 79, in init
super(RegressionModel, self).init(endog, exog, *_kwargs)
File "/usr/local/lib/python2.7/dist-packages/statsmodels/base/model.py", line 136, in init
super(LikelihoodModel, self).init(endog, exog, *_kwargs)
File "/usr/local/lib/python2.7/dist-packages/statsmodels/base/model.py", line 52, in init
self.data = handle_data(endog, exog, missing, hasconst, *_kwargs)
File "/usr/local/lib/python2.7/dist-packages/statsmodels/base/data.py", line 397, in handle_data
return klass(endog, exog=exog, missing=missing, hasconst=hasconst, *_kwargs)
File "/usr/local/lib/python2.7/dist-packages/statsmodels/base/data.py", line 78, in init
self._check_integrity()
File "/usr/local/lib/python2.7/dist-packages/statsmodels/base/data.py", line 246, in _check_integrity
if len(self.exog) != len(self.endog):
TypeError: len() of unsized object

After running around 24 hours, analyzer threw out bunch of this kind of errors. Does anyone know the reason for this?

horizon-agent exit port 2025 if non-msgpack data recieved

I careless write a msgpack data with "\n" to port 2025, then horizon-agent exit the 2025 listen.
I know this is my mistake.But maybe better to ignore such data without exit?

alternatives to redis?

Hi people,
I'm currently working on a quite similar problem (trying to tame a gazilion metrics) and get some kind of anomaly detection into the mess.

So I wanted to ask why redis is backend? It seems a strange choice to me, being in memory it makes it limited by memory rather then disk (and you usually have a hell lot more disk). It also seems to require storing the timestamp with each metric effectively doubling (or more then given msgpack overhead) the storage consumption. And last but not least I can't see how the append operation is considered O(1) when it needs to relocat the whole data every time the size doubles it sounds like O(sqrt(n)) given the size of the data is always the same.

What I could not find is how long historical data is preserved given I know that I'm producing about 8g a day I can see to run out of memory in about 16 days not taking overhead into account probably 8 or less with that.

So I was wondering if you would be up for a discussion of an alternative backend(s), I currently ended up using Cassandra behind KairosDB (easier to write to and nice for aggregation) which so far works quite well and has a very sound storage mechanism with Cassandars Column based storage.

Cheers,
Heinz

horizon-agent sending duplicate graphite data per worker?

Was looking into another issue and noticed this in the process tree:

27026 ? S 0:07 _ python /pkg/skyline/src/horizon/horizon-agent.py start
18928 ? S 0:00 | _ sh -c echo skyline.horizon.queue_size 1083 1381505318 | nc -w 3 graphite 2003
18946 ? S 0:00 | _ nc -w 3 graphite 2003
27027 ? S 0:07 _ python /pkg/skyline/src/horizon/horizon-agent.py start
18921 ? S 0:00 | _ sh -c echo skyline.horizon.queue_size 1083 1381505318 | nc -w 3 graphite 2003
18931 ? S 0:00 | _ nc -w 3 graphite 2003
27028 ? S 0:07 _ python /pkg/skyline/src/horizon/horizon-agent.py start
18933 ? S 0:00 | _ sh -c echo skyline.horizon.queue_size 1083 1381505318 | nc -w 3 graphite 2003
18950 ? S 0:00 | _ nc -w 3 graphite 2003

Allow non-socket connections to redis

It'd be nice to be able to specify a host and a port instead of being forced to use redis.sock.

Happy to give implementing this a go if you'd like

The DataFlows does not exist

https://github.com/etsy/skyline/wiki/Data-Flows does not exist, I clicked it from https://github.com/etsy/skyline/wiki/Horizon.

Consistent-hash and skyline

I am just researching this and trying to understand if it is possible to use this with Graphite Consistent-hash relays? We have 5 relays running on the same machine and we use HAproxy to load balance between them. Each array uses Consistent-hash algorithm - do you know if it is still possible to forward all traffic to Skyline somehow?

Thanks,

Andrew

Pickling is insecure in Skyline listener

Skyline doesn't block arbitrary code in it's pickle implementation. This can be considered a security risk.
The graphite project has resolved this in their code:
https://github.com/graphite-project/carbon/blob/master/lib/carbon/util.py#L112

Can this be addressed in skyline?

horizon.d: msgpack not listest as a depencency

[root@skyline-n01 bin]# ./horizon.d start
rm: cannot remove ‘../src/horizon/*.pyc’: No such file or directory
Traceback (most recent call last):
File "../src/horizon/horizon-agent.py", line 13, in
from listen import Listen
File "/root/skyline-master/src/horizon/listen.py", line 6, in
from msgpack import unpackb
ImportError: No module named msgpack
failed to start horizon-agent

GRAPHITE_HOST_PORT

It would be handy to have a GRAPHITE_HOST_PORT setting so that if graphite runs on a different port to the standard port 80, links within the skyline frontend can be created appropriately. Just appending the port breaks analyzer.py:

GRAPHITE_HOST = 'http://graphite.example.com:81'

analyzer.py's nc post to graphite does not work with a port declared as it is only stripping the 'http://' and not the port, this means you can have both valid links and skyline posting to carbon (it seems).

Documentation for some algorithms is wrong?

In https://github.com/etsy/skyline/blob/master/src/analyzer/algorithms.py in the documentation for functions stddev_from_average, stddev_from_moving_average, and mean_subtraction_cumulation the threshold is stated as one standard deviation. However in the implementations it looks like three standard deviations is the threshold. Am I missing something?

Flagging as an anomaly even though no algorithms failed

Just started testing out skyline and I started seeing today that skyline was flagging one of my metrics as an anomaly even though none of the algorithms failed. See highlighted below. I did have the CONSENSUS = 1 just increase the sensitivity. Any ideas on why this is happening?

2014-07-31 09:59:08 :: 29238 :: seconds to run :: 0.11
2014-07-31 09:59:08 :: 29238 :: total metrics :: 3
2014-07-31 09:59:08 :: 29238 :: total analyzed :: 0
2014-07-31 09:59:08 :: 29238 :: total anomalies :: 0
2014-07-31 09:59:08 :: 29238 :: exception stats :: {'Stale': 3}
2014-07-31 09:59:08 :: 29238 :: anomaly breakdown :: {}
2014-07-31 09:59:08 :: 29238 :: sleeping due to low run time...
2014-07-31 09:59:18 :: 29238 :: WARNING: skyline is set for more cores than needed.
2014-07-31 09:59:18 :: 29238 :: seconds to run :: 0.08
2014-07-31 09:59:18 :: 29238 :: total metrics :: 3
2014-07-31 09:59:18 :: 29238 :: total analyzed :: 1
2014-07-31 09:59:18 :: 29238 :: total anomalies :: 1
2014-07-31 09:59:18 :: 29238 :: exception stats :: {'Stale': 2}
2014-07-31 09:59:18 :: 29238 :: anomaly breakdown :: {}
2014-07-31 09:59:18 :: 29238 :: sleeping due to low run time...
2014-07-31 09:59:28 :: 29238 :: WARNING: skyline is set for more cores than needed.
2014-07-31 09:59:28 :: 29238 :: seconds to run :: 0.08
2014-07-31 09:59:28 :: 29238 :: total metrics :: 3
2014-07-31 09:59:28 :: 29238 :: total analyzed :: 1
2014-07-31 09:59:28 :: 29238 :: total anomalies :: 1
:

ERROR:root:Algorithm error: Traceback (most recent call last):
  File "/home/astanway/skyline/src/analyzer/algorithms.py", line 263, in run_selected_algorithm
    ensemble = [globals()[algorithm](timeseries) for algorithm in ALGORITHMS]
  File "/home/astanway/skyline/src/analyzer/algorithms.py", line 206, in ks_test
    adf = sm.tsa.stattools.adfuller(reference, 10)
  File "/usr/lib64/python2.6/site-packages/statsmodels/tsa/stattools.py", line 201, in adfuller
    xdall = lagmat(xdiff[:,None], maxlag, trim='both', original='in')
  File "/usr/lib64/python2.6/site-packages/statsmodels/tsa/tsatools.py", line 305, in lagmat
    raise ValueError("maxlag should be < nobs")
ValueError: maxlag should be < nobs

Any clues? cc @mabrek

Re: f886000