Stuff Goes Bad:Erlang In Anger

Locks and Blocking Operations


Locking and blocking operations will often be problematic when they’re taking unexpectedly long to execute in a process that’s constantly receiving new tasks.
One of the most common examples I’ve seen is a process blocking while accepting a connection or waiting for messages with TCP sockets.
During blocking operations of this kind, messages are free to pile up in the message queue.
One particularly bad example was in a pool manager for HTTP connections that I had written in a fork of the lhttpc library. It worked fine in most test cases we had, and we even had a connection timeout set to 10 milliseconds to be sure it never took too long 3.
After a few weeks of perfect uptime, the HTTP client pool caused an outage when one of the remote servers went down.
The reason behind this degradation was that when the remote server would go down, all of a sudden, all connecting operations would take at least 10 milliseconds, the time before which the connection attempt is given up on. With around 9,000 messages per second to the central process, each usually taking under 5 milliseconds, the impact became similar to roughly 18,000 messages a second and things got out of hand.
The solution we came up with was to leave the task of connecting to the caller process, and enforce the limits as if the manager had done it on its own. The blocking operations were now distributed to all users of the library, and even less work was required to be done by the manager, now free to accept more requests.

 一个我见过最常见的例子:一个进程为了接受连接(accepting a connection)或等待TCP sockets消息而阻塞着。
 另一个是非常糟糕的例子:我写的一个lhttpc 库(a fork of the lhttpc)用于HTTP连接的进程池管理。 它在绝大多数test cases里都工作正常,我们甚至把一个连接的timeout设置为10ms,来确保它不会花太多时间。
 完美地运行了几个星期后,一个远程服务器崩溃了,引起了HTTP进程池(HTTP client pool)中断。
 这次中断背后的原因是:当远程服务器挂掉后,突然所有的连接操作都要用至少10ms(放弃尝试连接的最小时间)的时间. 大约有9000条每秒的消息袭向中央进程,每个处理要花费5ms,这就相当于18000条每秒,然后服务器就失控了。

When there is any point of your program that ends up being a central hub for receiving messages, lengthy tasks should be moved out of there if possible. Handling predictable overload 4 situations by adding more processes — which either handle the blocking operations or instead act as a buffer while the "main" process blocks — is often a good idea.
There will be increased complexity in managing more processes for activities that aren’t intrinsically concurrent, so make sure you need them before programming defensively.


Another option is to transform the blocking task into an asynchronous one. If the type of work allows it, start the long-running job and keep a token that identifies it uniquely, along with the original requester you’re doing work for. When the resource is available, have it send a message back to the server with the aforementioned token. The server will eventually get the message, match the token to the requester, and answer back, without being blocked by other requests in the mean time. 5
This option tends to be more obscure than using many processes and can quickly devolve into callback hell, but may use fewer resources.


[3] 10 milliseconds is very short, but was fine for collocated servers used for real-time bidding.
[4] Something you know for a fact gets overloaded in production
[5] The redo application is an example of a library doing this, in its redo_block module. The [underdocumented] module turns a pipelined connection into a blocking one, but does so while maintaining pipeline
aspects to the caller — this allows the caller to know that only one call failed when a timeout occurs, not all of the in-transit ones, without having the server stop accepting requests.

[注5]:redo application就是这样的示例,在redo_block模块中...The [underdocumented] module turns a pipelined connection into a blocking one, but does so while maintaining pipeline
aspects to the caller — this allows the caller to know that only one call failed when a timeout occurs, not all of the in-transit ones, without having the server stop accepting requests.【求助,这段怎么译啊】