Stuff Goes Bad:Erlang In Anger

Global View

全局视图

For a view of the VM in the large, it’s useful to track statistics and metrics general to the VM, regardless of the code running on it. Moreover, you should aim for a solution that allows long-term views of each metric — some problems show up as a very long accumulation over weeks that couldn’t be detected over small time windows.
 Good examples for issues exposed by a long-term view include memory or process leaks, but also could be regular or irregular spikes in activities relative to the time of the day or week, which can often require having months of data to be sure about it.

 对于大型的VM,有效分析问题的方向通常不是重点监控运行时的代码,而是去跟踪VM相关统计(statistics)和指标(metrics)。此外,你应该致力于设计允许长时间观察每个指标的方案---一些问题只能通过长时间的累积才浮现出来。
 一个需要长时间观测才能看出问题的典型例子:内存或进程泄漏,这可能是在一天或一周时间内一些周期性或没有周期性的峰值造成的,这往往需要分析数据(几个月内存储的数据)才能找出问题。

 For these cases, using existing Erlang metrics applications is useful. Common options are:
 • folsom 3 to store metrics in memory within the VM, whether global or app-specific..
 • vmstats4 and statsderl 5, sending node metrics over to graphite through statsd 6.
 • exometer 7, a fancy-pants metrics system that can integrate with folsom (among other things), and a variety of back-ends (graphite, collectd, statsd, Riak, SNMP, etc.). It’s the newest player in town
 • ehmon 8 for output done directly to standard output, to be grabbed later through specific agents, splunk, and so on.
 • custom hand-rolled solutions, generally using ETS tables and processes periodically dumping the data. 9
 • or if you have nothing and are in trouble, a function printing stuff in a loop in a shell 10.
It is generally a good idea to explore them a bit, pick one, and get a persistence layer that will let you look through your metrics over time.

 对于这些情况,使用现有的Erlang metrics application非常有用,常用的选择如下:
 • folsom3把指标储存在VM的内存中,可以指定是全局的还是app所特有的。
 • vmstats4statsderl5使用statsd6发送节点的指标。
 • exometer7
 一个可以整合folsom(还有其它的),各式各样的back-ends(graphite,collectd,statsd,Riak,SNMP等)的非常庞大的系统。
 • ehmon8把输出直接放到标准输出上,可以被其它特定的代理(specific agents, splunk)所捕获。
 • 自定义的方案:通常是使用ETS表,进程定期的dumping数据9
 • 或许你根本就没有什么麻烦,你只需要一个函数在loop里面把信息打印到shell上就行了10
你可以都大概小浏览下他们,然后选一个,仔细研究下,让你可以随时看到系统中想了解的指标。

[3] https://github.com/boundary/folsom
[4] https://github.com/ferd/vmstats
[5] https://github.com/lpgauth/statsderl
[6] https://github.com/etsy/statsd/
[7] https://github.com/Feuerlabs/exometer
[8] https://github.com/heroku/ehmon
[9] Common patterns may fit the ectr application, at https://github.com/heroku/ectr
[10] The recon application has the function recon:node_stats_print/2 to do this if you’re in app

[注3]: https://github.com/boundary/folsom
[注4]: https://github.com/ferd/vmstats
[注5]: https://github.com/lpgauth/statsderl
[注6]: https://github.com/etsy/statsd/
[注7]: https://github.com/Feuerlabs/exometer
[注8]: https://github.com/heroku/ehmon
[注9]: Common patterns may fit the ectr application, at https://github.com/heroku/ectr
[注10]:recon application 有一个函数可以在app里调用recon:node_stats_print/2来做这件事。