boundary / folsom Goto Github PK

Expose Erlang Events and Metrics

License: Apache License 2.0

Erlang 100.00%

folsom's Introduction

folsom

Folsom is an Erlang based metrics system inspired by Coda Hale's metrics (https://github.com/dropwizard/metrics). The metrics API's purpose is to collect realtime metrics from your Erlang applications and publish them via Erlang APIs and output plugins. folsom is not a persistent store. There are 6 types of metrics: counters, gauges, histograms (and timers), histories, meter_readers and meters. Metrics can be created, read and updated via the folsom_metrics module.

Building and running

First, regarding using folsom and folsom_webmachine together. To make sure you have compatible versions of each, make sure you use code from the same version tags, ie 0.5 of folsom is known to work with 0.5 folsom_webmachine. HEAD on each repo may have broken API compatibility.

You need a (preferably recent) version of Erlang installed but that should be it.

   ./rebar get-deps compile

folsom can be run standalone or embedded in an Erlang application.

   $ erl -pa ebin deps/*/ebin

   > folsom:start(). % this creates the needed ETS tables and starts a gen_server

You can also start it as an application:

   $ erl -pa ebin deps/*/ebin
   > application:start(folsom).

   $ erl -pa ebin deps/*/ebin -s folsom

The application can be configured to create individual or lists of metrics at startup on the command line or in an application config file:

   $ erl -pa ebin deps/*/ebin -s folsom \
      -folsom history '[hist1,hist2]' \
      -folsom gauge gauge1

   $ echo '[{folsom, [{history, [hist1, hist2]}, {gauge, gauge1}]}].' \
      > myapp.config
   $ erl -pa ebin deps/*/ebin -config myapp.config -s folsom

Metrics API

folsom_metrics.erl is the API module you will need to use most of the time.

Retrieve a list of current installed metrics:

  > folsom_metrics:get_metrics().

Query a specific metric:

  > folsom_metrics:get_metric_value(Name).

Generally names of metrics are atoms or binaries.

Counters

Counter metrics provide increment and decrement capabilities for a single scalar value.

  > folsom_metrics:new_counter(Name).
  > folsom_metrics:notify({Name, {inc, Value}}).
  > folsom_metrics:notify({Name, {dec, Value}}).

Gauges

Gauges are point-in-time single value metrics.

  > folsom_metrics:new_gauge(Name).
  > folsom_metrics:notify({Name, Value}).

Histograms (and Timers)

Histograms are collections of values that have statistical analysis done to them, such as mean, min, max, kurtosis and percentile. They can be used like "timers" as well with the timed update functions.

  > folsom_metrics:new_histogram(Name).
  > folsom_metrics:histogram_timed_update(Name, Mod, Fun, Args).
  > folsom_metrics:histogram_timed_update(Name, Fun, Args).
  > folsom_metrics:histogram_timed_update(Name, Fun).
  > folsom_metrics:notify({Name, Value}).

Histogram sample types

Each histogram draws its values from a reservoir of readings. You can select a sample type for a histogram by passing the name of the sample type as an atom when you create a new histogram. Some sample types have further arguments. The purpose of a sample type is to control the size and charecteristics of the reservoir of readings the histogram performs analysis upon.

Folsom currently provides the following sample types:

`uniform`

This is a random uniform sample over the stream of readings. This is the default sample type, bounded in size to 1028 readings. When size readings have been taken, new readings replace older readings in the reservoir at random. You can set the sample size at creation time:

  > folsom_metrics:new_histogram(Name, uniform, Size::integer()).

Be sure you understand why before you do this.

`exdec`

This is a sample that exponentially decays less significant readings over time so as to give greater significance to newer readings. Read more here - Forward Decay.... Again you can change defaults at creation time, if you think you need to:

> folsom_metrics:new_histogram(Name, exdec, Size::integer(), Alpha::float()).

`slide`

This is a sliding window in time over a stream of readings. The default window size is 60 seconds. Every reading that occurs in a sliding sixty second window is stored, with older readings being discarded. If you have a lot of readings per minute the reservoir may get pretty big and so it will take more time to calculate statistics. You can set the window size by providing a number of seconds.

> folsom_metrics:new_histogram(Name, slide, Seconds::integer()).

`slide_uniform`

This is a sliding window in time over a stream of readings with a random uniform sample per second, to bound the size of the total number of readings. The maximum size of the reservoir will be window size * sample size. Default is a window of 60 seconds and a sample size of 1028. Again, you can change these at creation time:

> folsom_metrics:new_histogram(Name, slide_uniform, {Secs::interger(), Size::integer()).

Histories

Histories are a collection of past events, such as errors or log messages.

  > folsom_metrics:new_history(Name).
  > folsom_metrics:get_history_values(Name, Count). % get more than the default number of history items back
  > folsom_metrics:notify({Name, Value}).

Meters

Meters are increment only counters with mean rates and exponentially weighted moving averages applied to them, similar to a unix load average.

  > folsom_metrics:new_meter(Name).
  > folsom_metrics:notify({Name, Value}).

`Spiral` meter

A spiral is a type of meter that has a one minute sliding window count. The meter tracks an increment only counter and a total for the last minute. This is a sliding count with older readings dropping off per second.

> folsom_metrics:new_spiral(Name).
> folsom_metrics:notify({Name, Count}).

Meter Reader

Meter readers are like a meter except that the values passed to it are monotonically increasing, e.g., reading from a water or gas meter, CPU jiffies, or I/O operation count.

  > folsom_metrics:new_meter_reader(Name).
  > folsom_metrics:notify({Name, Value}).

Metrics groups/tags

Certain users might want to group and query metrics monitoring a common task. In order to do so, they can tag metrics:

> folsom_metrics:tag_metric(Name, Tag).

and untag metrics:

> folsom_metrics:untag_metric(Name, Tag).

Users can query a list of tuples [{Name, Value}] of all metrics with a given tag:

> folsom_metrics:get_metrics_value(Tag).

If only a certain type of metrics from a given group is desired, one can specify so:

> folsom_metrics:get_metrics_value(Tag, Type).

where Type is one of counter, gauge, histogram, history, meter, meter_reader, duration or spiral.

Erlang VM

folsom also produces Erlang VM statistics.

The result of erlang:memory/0:

   > folsom_vm_metrics:get_memory().

The result of erlang:system_info/1:

   > folsom_vm_metrics:get_system_info().

The result of erlang:statistics/1:

   > folsom_vm_metrics:get_statistics().

The result of erlang:process_info/1:

   > folsom_vm_metrics:get_process_info(). %% use with caution

The result of inet:getstat/1, prim_inet:getstatus/1, erlang:port_info/1, prim_inet:gettype/1, inet:getopts/1, inet:sockname/1:

   > folsom_vm_metrics:get_port_info(). %% use with caution

The result from ets:info/1 and dets:info/1 across all tables

   > folsom_vm_metrics:get_ets_info().
   > folsom_vm_metrics:get_dets_info().

folsom's People

Contributors

Stargazers

Watchers

Forkers

argv0 jrwest amtal accense jrecursive schmurfy seth nygge knutin techtraits djui erkan-yilmaz etrepum paulperegud n1rvana basho jcrabtree deadzen danielwhite wdshin alepharchives avasenin d63432 chef thomasbhatia ates viveshok sportlane b20n tsloughter mykook xbinxu l1x badubizzle dcorbacho someapp puzza007 lastres abanca alexfok billbarnhill linemetrics jasson i11 chinnurtb nivertech ypaq echoteam hfeeki sebmaynard vladaionescu cmullaparthi ifwe rodo roowe urbanserj doubleyou pokutnik vascokk kape1395 pmonson711 goldensurfer wrw lafka discoproject ashneyderman cybergrind tipbit tigertext marcosquesada groovenauts-erlang varnit cloudant linearregression archie johnfoconnor burlay erlguru hibari shortishly reachfh rondsny yunnet aeronotix baden linbo wangxin39 wuchuguang joliny leo-project derwolfe ruanpienaar nomorecoffee joewilliams ricardobcl drednout livechat kodiehf unisontech andy-dufour

folsom's Issues

Folsom fails to start for host in MongooseIM

I'm not sure if this is an issue with MongooseIM or folsom, but I get this error when running master, whereas I don't get it when running tag 0.8.1. I don't have any more info than that, unfortunately.

2014-08-29 22:03:55.473 [critical] <0.138.0>@gen_mod:start_module:84 Problem starting the module mod_metrics for host <<"admin">>
 options: []
 error: badarg
[{ets,member,[folsom,{<<"admin">>,sessionSuccessfulLogins}],[]},
 {folsom_ets,handler_exists,1,[{file,"src/folsom_ets.erl"},{line,96}]},
 {folsom_ets,add_handler,2,[{file,"src/folsom_ets.erl"},{line,64}]},
 {mod_metrics,'-init_folsom/1-fun-0-',2,
              [{file,"src/mod_metrics.erl"},{line,36}]},
 {lists,foreach,2,[{file,"lists.erl"},{line,1323}]},
 {mod_metrics,init_folsom,1,[{file,"src/mod_metrics.erl"},{line,35}]},
 {mod_metrics,start,2,[{file,"src/mod_metrics.erl"},{line,22}]},
 {gen_mod,start_module,3,[{file,"src/gen_mod.erl"},{line,73}]}]
2014-08-29 22:03:55.473 [critical] <0.138.0>@gen_mod:start_module:89 ejabberd initialization was aborted because a module start failed.
The trace is [{ets,member,[folsom,{<<"admin">>,sessionSuccessfulLogins}],[]},{folsom_ets,handler_exists,1,[{file,"src/folsom_ets.erl"},{line,96}]},{folsom_ets,add_handler,2,[{file,"src/folsom_ets.erl"},{line,64}]},{mod_metrics,'-init_folsom/1-fun-0-',2,[{file,"src/mod_metrics.erl"},{line,36}]},{lists,foreach,2,[{file,"lists.erl"},{line,1323}]},{mod_metrics,init_folsom,1,[{file,"src/mod_metrics.erl"},{line,35}]},{mod_metrics,start,2,[{file,"src/mod_metrics.erl"},{line,22}]},{gen_mod,start_module,3,[{file,"src/gen_mod.erl"},{line,73}]}].

Tag 0.8.2

We would use meck 0.8.2 and folsom 0.8.1 which depends on meck 0.8.1, so they conflicted. Could you tag folsom with 0.8.2?
Thanks.

Traffic measurement

I'm collecting network stats from /proc/net/dev and netstat -ib. These logs show me monotonically growing numbers. I need to remember previous value for each interface and once per second (for example), subtract and calculate per-second bitrate on interface.

Then I want to push data to some history and keep history for last hour.

What are the proper tools to use in folsom for my task?

Maybe I should post this question to some mailing list?

update README.md: application:start(bear) is required before application:start(folsom)

In third block of README.md, I would suggest adding application:start(bear) before application:start(folsom). Otherwise the user will get this error:
{error,{not_started,bear}}

Grouped metrics

Hi,

I'm not sure if I've just missed how to do this, but is it possible to group metrics together? For example, say I have 10 TCP connections that are sending and receiving messages. I'm currently capturing the number of messages a second with folsom and it's working perfectly.

If I want to make this available to say a web browser for viewing, it would be useful to be able to retrieve those metrics as a group. For example if I have a page like:

/metrics/tcp/

It would be great to be able to easily pull back all the TCP statistics to display on that page. Right now I guess the best way is to simply name them appropriately, but being able to add them to groups (ideally multiple groups) would be really awesome.

Another area where this would be useful is if there are a set of different metrics for the same logical entity. Then on the page I can request everything in that entity's group and easily get all the metrics I'd need need to display it.

If it's not currently doable, I'd be interested in offering a bounty to someone who could add such a feature and get it accepted to core :)

Cheers,

Pete

Folsom supervisor dies everytime ErlangVM gets any exception/error

I think this should not happen:

tmr@gersemi:~/src/tmp/folsom$ erl -pa deps/bear/ebin -pa ebin -sname foo
Erlang R15B (erts-5.9) [source] [64-bit] [smp:8:8] [async-threads:0] [hipe] [kernel-poll:false]

Eshell V5.9  (abort with ^G)
(foo@gersemi)1> 
(foo@gersemi)1> 
(foo@gersemi)1> 
(foo@gersemi)1> folsom_sup:start_link ().
{ok,<0.42.0>}
(foo@gersemi)2> 
(foo@gersemi)2> a=b.
** exception error: no match of right hand side value b
(foo@gersemi)3> 
=ERROR REPORT==== 24-Apr-2012::22:26:46 ===
** Generic server folsom_sup terminating 
** Last message in was {'EXIT',<0.40.0>,{{badmatch,b},[{erl_eval,expr,3,[]}]}}
** When Server state == {state,
                            {local,folsom_sup},
                            one_for_one,
                            [{child,<0.44.0>,folsom_metrics_histogram_ets,
                                 {folsom_metrics_histogram_ets,start_link,[]},
                                 permanent,2000,worker,
                                 [folsom_metrics_histogram_ets]},
                             {child,<0.43.0>,folsom_meter_timer_server,
                                 {folsom_meter_timer_server,start_link,[]},
                                 permanent,2000,worker,
                                 [folsom_meter_timer_server]}],
                            undefined,1000,3600,[],folsom_sup,[]}
** Reason for termination == 
** {{badmatch,b},[{erl_eval,expr,3,[]}]}

(foo@gersemi)3>

Am I doing something wrong?

Tom

"no function clause" error when calling folsom_vm_metrics:get_system_info()

I'm getting the error below when calling folsom_vm_metrics:get_system_info() on our servers:

** exception error: no function clause matching
                folsom_vm_metrics:'-convert_system_info/1-lc$^1/1-1-'({logical,
                                                                       0}) (src/folsom_vm_metrics.erl, line 102)
    in function  folsom_vm_metrics:'-get_system_info/0-lc$^0/1-0-'/1 (src/folsom_vm_metrics.erl, line 52)
    in call from folsom_vm_metrics:'-get_system_info/0-lc$^0/1-0-'/1 (src/folsom_vm_metrics.erl, line 52)

The same call works fine on my local machine. I believe the problem is related to what's being returned by erlang:system_info(cpu_topology). On my local machine, the value is:

[{processor,{logical,0}},{processor,{logical,1}}]

But on the server it's just:

{logical,0}

All of our servers are hosted Amazon EC2 instances, so I expect that may account for the difference, but I think folsom should be able to account for both cases?

A race condition can stop trimming of slide servers

If SampleMod:trim/2 gets called by a folsom_sample_slide_server after the ETS table the server is trimming has gone away, it will go into a restart loop and eventually bring down the folsom_sample_slide_sup supervisor.

Pull request #99 fixes this.

A log section that shows this happening:

2015-10-27 12:40:03.889 [info] <0.15943.9>@ws_handler:websocket_terminate:278 Authenticated client(...) from ... disconnected
2015-10-27 12:40:03.889 [error] <0.15942.9> gen_server <0.15942.9> terminated with reason: bad argument in call to ets:select_delete(2230132366, [{{{'$1','_'},'_'},[{'<','$1',1445949598}],[true]}]) in folsom_sample_slide:trim/2 line 64
2015-10-27 12:40:03.890 [error] <0.15942.9> CRASH REPORT Process <0.15942.9> with 0 neighbours exited with reason: bad argument in call to ets:select_delete(2230132366, [{{{'$1','_'},'_'},[{'<','$1',1445949598}],[true]}]) in folsom_sample_slide:trim/2 line 64
2015-10-27 12:40:03.890 [error] <0.7667.3> Supervisor folsom_sample_slide_sup had child undefined started with folsom_sample_slide_server:start_link(folsom_sample_slide, 2230132366, 5) at <0.15942.9> exit with reason bad argument in call to ets:select_delete(2230132366, [{{{'$1','_'},'_'},[{'<','$1',1445949598}],[true]}]) in folsom_sample_slide:trim/2 line 64 in context child_terminated
2015-10-27 12:40:06.394 [error] <0.16357.9> gen_server <0.16357.9> terminated with reason: bad argument in call to ets:select_delete(2230132366, [{{{'$1','_'},'_'},[{'<','$1',1445949601}],[true]}]) in folsom_sample_slide:trim/2 line 64
2015-10-27 12:40:06.394 [error] <0.16357.9> CRASH REPORT Process <0.16357.9> with 0 neighbours exited with reason: bad argument in call to ets:select_delete(2230132366, [{{{'$1','_'},'_'},[{'<','$1',1445949601}],[true]}]) in folsom_sample_slide:trim/2 line 64
2015-10-27 12:40:06.394 [error] <0.7667.3> Supervisor folsom_sample_slide_sup had child undefined started with folsom_sample_slide_server:start_link(folsom_sample_slide, 2230132366, 5) at <0.16357.9> exit with reason bad argument in call to ets:select_delete(2230132366, [{{{'$1','_'},'_'},[{'<','$1',1445949601}],[true]}]) in folsom_sample_slide:trim/2 line 64 in context child_terminated
2015-10-27 12:40:08.895 [error] <0.16365.9> gen_server <0.16365.9> terminated with reason: bad argument in call to ets:select_delete(2230132366, [{{{'$1','_'},'_'},[{'<','$1',1445949603}],[true]}]) in folsom_sample_slide:trim/2 line 64
2015-10-27 12:40:08.895 [error] <0.16365.9> CRASH REPORT Process <0.16365.9> with 0 neighbours exited with reason: bad argument in call to ets:select_delete(2230132366, [{{{'$1','_'},'_'},[{'<','$1',1445949603}],[true]}]) in folsom_sample_slide:trim/2 line 64
2015-10-27 12:40:08.895 [error] <0.7667.3> Supervisor folsom_sample_slide_sup had child undefined started with folsom_sample_slide_server:start_link(folsom_sample_slide, 2230132366, 5) at <0.16365.9> exit with reason bad argument in call to ets:select_delete(2230132366, [{{{'$1','_'},'_'},[{'<','$1',1445949603}],[true]}]) in folsom_sample_slide:trim/2 line 64 in context child_terminated
2015-10-27 12:40:11.396 [error] <0.16368.9> gen_server <0.16368.9> terminated with reason: bad argument in call to ets:select_delete(2230132366, [{{{'$1','_'},'_'},[{'<','$1',1445949606}],[true]}]) in folsom_sample_slide:trim/2 line 64
2015-10-27 12:40:11.396 [error] <0.16368.9> CRASH REPORT Process <0.16368.9> with 0 neighbours exited with reason: bad argument in call to ets:select_delete(2230132366, [{{{'$1','_'},'_'},[{'<','$1',1445949606}],[true]}]) in folsom_sample_slide:trim/2 line 64
2015-10-27 12:40:11.396 [error] <0.7667.3> Supervisor folsom_sample_slide_sup had child undefined started with folsom_sample_slide_server:start_link(folsom_sample_slide, 2230132366, 5) at <0.16368.9> exit with reason bad argument in call to ets:select_delete(2230132366, [{{{'$1','_'},'_'},[{'<','$1',1445949606}],[true]}]) in folsom_sample_slide:trim/2 line 64 in context child_terminated
2015-10-27 12:40:11.397 [error] <0.7667.3> Supervisor folsom_sample_slide_sup had child undefined started with folsom_sample_slide_server:start_link(folsom_sample_slide, 2230132366, 5) at <0.16368.9> exit with reason reached_max_restart_intensity in context shutdown
2015-10-27 12:40:11.397 [error] <0.485.0> Supervisor folsom_sup had child folsom_sample_slide_sup started with folsom_sample_slide_sup:start_link() at <0.7667.3> exit with reason shutdown in context child_terminated

bear is not listed in .app.src dependencies

Therefore, running folsom from an Erlang release results in an error because bear is not packaged.

Dynamic atoms creation

I wonder what is the reason for Folsom to having dynamic atoms creation for pid_port_fun_to_atom?

The origin commit says better string hinting and ip parsing , but I wonder is this is more important than having your Erlang app crush due to atoms table exhausting.

Yes, I'd read the README and saw the caution:

folsom_vm_metrics:get_process_info(). %% use with caution
folsom_vm_metrics:get_port_info(). %% use with caution

but this isn't serious since it make pointless providing metrics which you shouldn't use with all cost.

Folsom has moved

This repo is forked and likely unmaintained.

I no longer have access the folsom related projects and it seems like I probably won't in the near future. To continue maintaining them I have moved the project to https://github.com/folsom-project Please update your projects.

crash when deleting ets table.

I get this CRASH when deleting an ets table:

CRASH REPORT Process <0.4972.42> with 0 neighbours crashed with reason: bad argument in call to ets:delete(11567156) in folsom_sample_exdec:delete_and_rescale/4 line 122

Absolutely zero specs for most things

Hi ladies and gents,

Would appreciate if there could be some specs on some of this stuff. Helps the discoverability of your API (is that an atom, a binary or a string argument??).

Any plans for that to come? I'd love to help but without prior knowledge of some of this stuff the only way to see is by extensive examples.

Is the uniform histogram type wrong?

Please take a look at the code at

https://github.com/boundary/folsom/blob/master/src/folsom_sample_uniform.erl#L50

which updates a sample uniformly in the histogram reservoir. The L46 clause is hit whenever we have fewer than 1028 samples and we insert a new sample in the table. Once we have 1028 samples, we look at N. Suppose N is 2056 since we have taken that many samples. We take a random value, which could be 1768 and then maybe update the reservoir. In half the cases, we won't be bumping the reservoir here, depending ont he random outcome.

I have reservoir's with N > 1000_1000_1000. They will almost never update the reservoir. Is this intended behavior of the uniform sample type? I am afraid some of the logic is wrong and we never ever replace entries in the reservoir for large N.

I could change to slide_uniform to fix this, but I want to make sure I understand how this is supposed to work.

Temporal Counter

I am doing Zotonic web page statistics and rather than outsource them I'd like to track them in Folsom, with periodic dumps to external store. One of the things I am doing is a metric such as page requests. I have this as a set of counters with these names:
{nodal, page_req, Path}
{nodal, page_req, Path, Year, all, all, all}
{nodal, page_req, Path, Year, Month, all, all}
{nodal, page_req, Path, Year, Month, Day, all}
{nodal, page_req, Path, Year, Month, Day, Hour}

All but the first are created as needed, using notify/3.

What happens is that a page request comes in, the time components are gotten from the current date and time, and an {inc, 1} event is sent to each of the above counters.

The behavior I'd like is in three parts.

I) There is a type temporal_counter (or some other name) that I can declare {nodal, page_req, Path} as, and have Folsom automatically create the following counters (I changed format from above as it makes more sense I think and matches erlang:localtime()):
{{nodal, page_req, Path}, {{Year,all,all},{all,all,all}}}
{{nodal, page_req, Path}, {{Year,Month,all},{all,all,all}}}
{{nodal, page_req, Path}, {{Year,Month,Day},{all,all,all}}}
{{nodal, page_req, Path}, {{Year,Month,Day},{Hour,all,all}}}

Generically this is {Name, erlang:localtime like pseudo match spec}. I used all rather than '' because I thought storing '' in ETS might be problematic.

II) When an {inc, N} or {dec, N} event is sent for a temporal counter Folsom gets the current date and time and parses it into parts. This is ok since we're tracking at the hour level. If we were tracking the stats at the second level then we should have the time sent from the client, but that's a different use case. The event is then also applied to the corresponding temporal counters, creating them if they did not already exist.

III) Getting the individual counters would not be changed, so to get the above you would use the full name. The following additional function gets the data associated with this type of counter:
-spec get_temporal_metrics(name()) -> [{name(), date_spec()}] | []
get_temporal_metrics(Name) -> ...

If this issue is approved I'd be happy to work it, and contribute the result back to Folsom.

export lcnt info if available

this stuff:

http://www.erlang.org/doc/apps/tools/lcnt_chapter.html

support tags in configuration file

Hi,

Since for most cases, metrics are already predefined in different groups, init metrics with tags is preferred.

I think can folsom support tags in configuration file? for example

 $ echo '[{folsom, [{history, [{hist1, [tag1, tag2]}, {hist2, [tag2, tag3]} ]}, {gauge, gauge1}]}].' \
      > myapp.config
   $ erl -pa ebin deps/*/ebin -config myapp.config -s folsom

The init code changed as flowing:

configure_metric(New, Spec) when is_list(Spec) ->
    apply(folsom_metrics, New, Spec);
configure_metric(New, Spec) ->
    case Spec of
        {Name, Tags} ->
            folsom_metrics:New(Name),
            [folsom_metrics:tag_metric(Name, Tag) || Tag <- Tags];
        Name ->
            folsom_metrics:New(Name)
    end.

If accept, can I send a pull request?

forward decay paper has moved

readme.md exdec section has a broken link to att.com for "Forward decay". The link is in archive.org and web search on the title seems to find the new home is:

http://dimacs.rutgers.edu/~graham/pubs/papers/fwddecay.pdf

Table ownership differences can leave folsom inconsistent

The history metric creates a new ets table when a new metric is created. The owner of that table is the process that called folsom_metrics:new_history(Name). However, the folsom table is owned by the folsom supervisor. In the case that the process that owns the history exits the history metric table itself crashes, but the entry in the folsom metrics table remains.

Folsom is then in an inconsistent state. Using folsom_metrics_histogram_ets to create (and therefore own) the table would probably help. Ideally folsom should have a single process that owns all ets tables so that there is consistency (a crash takes them all away, they're insulated from calling process crashes.) Better still would be to implement something like the strategy in this article http://steve.vinoski.net/blog/2011/03/23/dont-lose-your-ets-tables/

I'm raising this as a request for comments before I factor such a strategy into folsom. Opinions?

EWMA vs mean unit of measure inconsistency in meters

Meters can be used for a number of things, but consider this example:
We want to measure the number of events per second, where an event is folsom_metrics:notify(Name, 1).

So we do folsom_metrics:get_metric_value(Name). What we get in response is

[{count,...}, {one, ...}, {five, ...}, {fifteen, ...}, {day, ...}, {mean, ...}, {acceleration, ...}]

where one, five, fifteen and day are measured in events per second, but mean is measured in events per _micro_ second. Furthermore, acceleration is measured in events per second squared.

I believe it would be more consistent if mean were measured in events per second as well.

Suggested fix: divide this value by 1000000. May need to do something similar for meter reader as well.

folsom_sample_uniform sampling error

folsom_sample_uniform does not generate a true random sampling of the values.
Once the reservoir is full each new value is replacing an existing value in the reservoir.
This causes a bias for later values.
e.g.

lists:foldl(fun(V,A) -> folsom_sample_uniform:update(A,V) end, folsom_sample_uniform:new(5000),lists:seq(1,50000)).
lists:sum(folsom_sample_uniform:get_values(S))/5000.

gives an arithmetic mean of ~45000 instead of the expected 25000

Exactly which algorithm in the Vitter paper is supposed to be implemented here?

Folsom tag 0.6 version is 0.6-51-gdaa75cb

I'm curious why in the .app file the version for folsom in the tag 0.6 is 0.6-51-gdaa75cb instead of 0.6.

This breaks build tools like sinan that require dependencies app versions to match the version it claims to be in the path. This I easily fixed by changing the tag in agner to 0.6-51-gdaa75cb, but still would be nice to have version numbers match up.

Thanks!

Race condition in some metrics (counter and gauge are fine)

These metrics are ok:

folsom_metrics_counter (update_counter is atomic)
folsom_metrics_gauge (insert is atomic)

These metrics can drop data due to race conditions (interleaved get_value / insert):

folsom_metrics_histogram
folsom_metrics_meter
folsom_metrics_meter_reader

And this one can grow and shrink incorrectly (interleaved ets:info(…, size) / delete / insert):

folsom_metrics_history

There are some similar problems with folsom_sample_exdec, folsom_sample_none, folsom_sample_uniform but those problems would go away if folsom_metrics_histogram usage was serialized by key.

The most straightforward way out of this predicament is to set up a process per metric (of these types) for update serialization. The metrics are basically determined at compile time and their count should be smallish (hundreds maybe), so a fixed size pool isn't necessary. Most reads can still go directly to the table since it's updated atomically, but some of the histogram implementations may need some kind of get_values serialization.

folsom_sample_exdec: can delete a table before you get its contents, maybe ets:delete_all_objects/1 would be preferable to cycling the table. This would change worst case to returning somewhere between 0 and Size samples instead of crashing. Inserting a list instead of calling insert in a loop would change it such that 0 or Size samples could be returned, since inserting a list is atomic.
folsom_sample_none: could be in a state where there are Size - 1 items in the pool (this is likely acceptable)
folsom_sample_uniform: this looks like it would always be in a consistent state, so definitely acceptable.

pids and ets ids leaks

3> f(Pid1), f(Pid2), [Pid1, Pid2] = [erlang:spawn_link(fun() -> receive _ -> ok = folsom_metrics:new_histogram(<<"sorry">>, slide) end end) || _ <- lists:seq(1, 2)].
[<0.45.0>,<0.46.0>]
4>  Pid1 ! Pid2 ! sorry.
sorry
5> supervisor:which_children(folsom_sample_slide_sup).
[{undefined,<0.49.0>,worker,[folsom_sample_slide_server]},
 {undefined,<0.50.0>,worker,[folsom_sample_slide_server]}]
7> ets:tab2list(folsom_histograms).
[{<<"sorry">>,
  {histogram,slide,{slide,1028,20506,<0.50.0>}}}]

I expected that there would be only one child of folsom_sample_slide_sup. So the second child and its ets are leaked.

supervisor:count_children(folsom_sample_slide_sup) =/= erlang:length(ets:tab2list(folsom_histograms)).

metrics with the same leaks:

history
spiral
duration

A crash of `folsom_metrics_histogram_ets` breaks all `spiral` metrics

Due to the ordering here https://github.com/boundary/folsom/blob/master/src/folsom_ets.erl#L300 of deletes, a crashed spiral ets table means a spiral metric can never be deleted, and therefore re-created. Such a crash leaves any app that updates a spiral broken until folsom is restarted.