Stuff Goes Bad:Erlang In Anger


Unfortunately for Erlang developers, CPU is very hard to profile. There are a few reasons for this:
 • The VM does a lot of work unrelated to processes when it comes to scheduling — high scheduling work and high amounts of work done by the Erlang processes are hard to characterize.
 • The VM internally uses a model based on reductions, which represent an arbitrary number of work actions. Every function call, including BIFs, will increment a process reduction counter. After a given number of reductions, the process gets descheduled.
 • To avoid going to sleep when work is low, the threads that control the Erlang schedulers will do busy looping. This ensures the lowest latency possible for sudden load spikes. The VM flag +sbwt none|very_short|short|medium|long|very_long can be used to change this value.
 These factors combine to make it fairly hard to find a good absolute measure of how busy your CPU is actually running Erlang code. It will be common for Erlang nodes in production to do a moderate amount of work and use a lot of CPU, but to actually fit a lot of work in the remaining place when the workload gets higher.
 The most accurate representation for this data is the scheduler wall time. It’s an optional metric that needs to be turned on by hand on a node, and polled at regular intervals. It will reveal the time percentage a scheduler has been running processes and normal Erlang code, NIFs, BIFs, garbage collection, and so on, versus the amount of time it has spent idling or trying to schedule processes.

 + VM在做大量调度工作或Erlang进程在做大量很难被测定(characterize)的工作时,VM会额外做很多与工作进程不相关的大量工作。
 + VM内部使用一个基于归约(reductions)的模型(可以表示任意数量的工作行为)。每个函数的调用,包括BIFs,这会增加一个进程的归约数(reduction counter)。进程被分配了一定数量的归约数后,会被切换到不执行的状态(descheduled)。  + 为了防止进程在负载低时休眠,控制Erlang调度的进程会频繁地在一个loop里面运行。这是为了确保在非常低负荷的情况下,实然忙起来时也能正常。可以用VM的+sbwt none|very_short|short|medium|long|very_long 选项来调整这个值。

 可以提高CPU表现的就是调度器墙时间(scheduler wall time).这是一个可选指标,需要手动找开一个节点并做定期的轮询。它显示运行进程,正常的Erlang代码,NIFs,BIFs,垃圾回收等的时间与花在空转或试图调度进程时间的百分比。

 The value here represents scheduler utilization rather than CPU utilization. The higher the ratio, the higher the workload.
 While the basic usage is explained in the Erlang/OTP reference manual 13, the value can be obtained by calling recon:

这个值与其说代表CPU的使用情况,不如更准确地说是代表调度器使用情况 13。百分比越高,负荷越大。

1> recon:scheduler_usage(1000).
 The function recon:scheduler_usage(N) will poll for N milliseconds (here, 1 second) and output the value of each scheduler. In this case, the VM has two very loaded schedulers (at 99.2% and 93.7% repectively), and two mostly unused ones at far below 1%. Yet, a tool like htop would report something closer to this for each core:


1 [||||||||||||||||||||||||| 70.4%]
2 [||||||| 20.6%]
3 [|||||||||||||||||||||||||||||100.0%]
4 [|||||||||||||||| 40.2%]
 The result being that there is a decent chunk of CPU usage that would be mostly free for scheduling actual Erlang work (assuming the schedulers are busy waiting more than trying to select tasks to run), but is being reported as busy by the OS.
 Another interesting behaviour possible is that the scheduler usage may show a higher rate (1.0) than what the OS will report. Schedulers waiting for os resources are considered utilized as they cannot handle more work. If the OS itself is holding up on non-CPU tasks it is still possible for Erlang’s schedulers not to be able to do more work and report a full ratio.
 These behaviours may especially be important to consider when doing capacity planning, and can be better indicators of headroom than looking at CPU usage or load.