Stuff Goes Bad:Erlang In Anger

Chapter 3 Planning for Overload


By far, the most common cause of failure I’ve encountered in real-world scenarios is due to the node running out of memory. Furthermore, it is usually related to message queues going out of bounds. 1
There are plenty of ways to deal with this, but knowing which one to use will require a decent understanding of the system you’re working on.
To oversimplify things, most of the projects I end up working on can be visualized as a very large bathroom sink. User and data input are flowing from the faucet.
The Erlang system itself is the sink and the pipes, and wherever the output goes (whether it’s a database, an external API or service, and so on) is the sewer system.

 Erlang 系统自身相当于水池和管道,所有的输出(数据库或外部API接口,或其它服务项等等)就是下水道。

When an Erlang node dies because of a queue overflowing, figuring out who to blame is crucial.
Did someone put too much water in the sink? Are the sewer systems backing up?Did you just design too small a pipe?


Determining what queue blew up is not necessarily hard. This is information that can be found in a crash dump.
Finding out why it blew up is trickier. Based on the role of the process or run-time inspection, it’s possible to figure out whether causes include fast flooding, blocked processes that won’t process messages fast enough, and so on.
The most difficult part is to decide how to fix it. When the sink gets clogged up by too much waste, we will usually start by trying to make the bathroom sink itself larger (the part of our program that crashed, at the edge).
Then we figure out the sink’s drain is too small, and optimize that. Then we find out the pipes themselves are too narrow, and optimize that.
The overload gets pushed further down the system, until the sewers can’t take it anymore. At that point, we may try to add sinks or add bathrooms to help with the global input level.
Then there’s a point where things can’t be improved anymore at the bathroom’s level. There are too many logs sent around, there’s a bottleneck on databases that need the consistency, or there’s simply not enough knowledge or manpower in your organization to improve things there.

 你可以在crash dump文件中轻易地找到什么队列挂掉了。
 但是比较棘手的是为什么这队列会挂掉。这取决于进程担任的角色或运行时状态(run-time inspection),可能的原因:大量的消息洪水般涌入系统或者进程堵塞导致不能快速地处理消息等等。
 然后系统在bathroom's level已优化到没有什么可以可再改善时, 你又发现还有大量飞来飞去的日志,待克服瓶颈的数据库,又或者只是没有足够的知识或者人力来改善这些。

By finding that point, we identified what the true bottleneck of the system was, and all the prior optimization was nice (and likely expensive), but it was more or less in vain.
We need to be more clever, and so things are moved back up a level. We try to massage the information going in the system to make it either lighter (whether it is through compression, better algorithms and data representation, caching, and so on).
Even then, there are times where the overload will be too much, and we have to make the hard decisions between restricting the input to the system, discarding it, or accepting that the system will reduce its quality of service up to the point it will crash.
These mechanisms fall into two broad strategies: back-pressure and load-shedding.
We’ll explore them in this chapter, along with common events that end up causing Erlang systems to blow up.

 这些机制可以分为两大策略:back-pressure和load-shedding. 我们可以浏览本章节,了解到那些导致Erlang崩溃的常见事件.

[1] Figuring out that a message queue is the problem is explained in Chapter 6, specifically in Section 6.2