That's correct, I should have included a chart explicitly measuring the I/O activity done by the two containers, but I can assure you there was literally no I/O activity, a dozen open files per second is a very negligible throughput. The bottleneck was solely in the cache.
Yep, the old version of this was 'why is tar/find/rsync/etc hosing my server even when I've [io]niced it to hell'. Except now (as everything else) with containers!
The solution is very simple: as mentioned in the article, just use a newer kernel and always set memory limits for containers, the blog post is based on an older kernel (2.6.32) that quite a few people irresponsibly still use in containerized environments, mostly because EL6 is so popular among enterprises.
In newer kernels, allocations from object pools are now tied to the limits of the memory cgroups that requested them in userspace, if any, so you wouldn't incur in this specific issue and you would just effectively have a container not being able to use more than X MB of dcache entries (although there are probably other minor ones, for example related to sharing global kernel mutexes and such).
I couldn't understand two things from the article:
1. If one of the two containers caused the issue, then the why you needed both of the containers to produce the issue? Why running just the offending one was not enough?
My guess is that "worker" container requested those non-existent files from a volume mounted by the other container, is it right?
2. Kernel hash table implementation. The whole point of hash table is that it's size is O(N), where N is the number of elements it holds.
Capping the hash table size to some constant and putting all the excess elements to its linked lists makes it perform like a linked list divided by the constant, no surprise. So it sounds like there's a bug in dentry hash table implementation -- it should either increase its size accordingly to elements count, or stop accepting new/evict old entries.
> 1. If one of the two containers caused the issue, then the why you needed both of the containers to produce the issue? Why running just the offending one was not enough?
Running just the offending one would have been clearly enough, since its effects would have caused the same increased latency for every other process in the system (including itself). However, using a second container to observe the performance degradation proves the point that one container is able to affect another one, which is sort of the gist of the article, since too many people think containers provide much more isolation than what in reality happens.
> My guess is that "worker" container requested those non-existent files from a volume mounted by the other container, is it right?
No, the containers didn't share any volume, the dentry cache is effectively a singleton within the kernel, so even if the set of volumes is not overlapping, all processes in the system will see a performance degradation, regardless of where the files being accessed reside.
> 2. Kernel hash table implementation. The whole point of hash table is that it's size is O(N), where N is the number of elements it holds.
Your speculation is correct, however, there are sound reasons for doing such a thing in the kernel (and not allowing the main array of the hash table dynamically expand/shrink), so I wouldn't consider it a bug per se. I'll refer you to this excellent comment: https://news.ycombinator.com/item?id=14660954
> It's not irresponsible to use a perfectly fine OS.
The first problem with this statement is the idea that there's such a thing as a "perfectly fine OS". We don't even need to consider containers, the longer an OS has been in the wild, the longer its potential vulnerabilities have been found and exploited.
Windows XP is a perfectly fine OS; using it nowadays is irresponsible.
> What is irresponsible is for Docker to purposefully avoid to mention that it has endless issues on these widely used OS.
That responsibility doesn't and should never fall on the developers of an application. The extent of one's responsibility as a developer is to define the recommendations for its use. Anything beyond that is entirely on the user.
One would go insane if one had to wonder every single operating system someone decided to use one's application in.
> It is not a 2.6 kernel by the way, redhat is backporting tons of stuff from the 3 and 4 branches.
"Backporting stuff" doesn't make it not the 2.6 Kernel, it very much is.
>We don't even need to consider containers, the longer an OS has been in the wild, the longer its potential vulnerabilities have been found and exploited.
I challenge you to find exploitable bugs in its kernel. Windows XP is not supported anymore, while RHEL 6 is.
Definitely a feasible approach. Let's just say that the reality has a bit more color and we have some other practical advantages in controlling the exact moment when we disconnect a particular client :)
Both approaches are reasonable (and there's also a third one, ship your application in containers and replace containers instead of instances).
We update existing instances because in our test environment we deploy at every single new commit (we absolutely love that), and we have hundreds (or more) a day. At that pace, replacing instances would be more time consuming (again, for our specific use case) and less cost efficient.
Plus, updating existing instances is handled automatically by AWS Code Deploy, which provides a very good deploying pipeline that you can control using the aws cli tool.
There are other minor advantages but those are the two main ones.
We definitely observed such drops that we attributed to presumably internal ELB scaling activity, but they happen so occasionally that for the moment they haven't been a real issue, as opposed to this one described in the article which happened consistently at every deployment in our test environment.
Yeah, we've decided to live with the internal ELB scaling risks for the moment as well. We had the exact same situation where a deployment without gradual connection draining (even if we kept an instance in service in every AZ) would cause the ELBs to scale and drop all of our connections every time once we were at a certain scale. Definitely caused us a fair amount of confusion as it would happen minutes after the deploy when everything seemed to be calmed down again.
The author said he needed at least 2 instances in a AZ to avoid the bug, and used that as his workaround in the mean time that Amazon works on the bug.
To be fair, the scenario is less common due to the fact that it happens just when the drained connections are terminated in a certain pattern (as shown in the charts). Still definitely common enough that can be easily replicated and cause real troubles :)
The LD_PRELOAD workaround is meant to "fix" the shellshock bug, sysdig doesn't do that, it just passively monitors every injection attempt up to the point where the injected function is actually read by a newly spawn bash.
With the LD_PRELOAD trick you can definitely secure your system, but you won't be able to see if there's a new service that is being used as an attack vector (for your own curiosity). With sysdig, you can, and if you capture a trace file you can also follow the exact process chain that caused the propagation of the environment variable.
- At this point sysdig is estimated to have tens of thousands of users, and we haven't gotten a kernel bug in a while, with people (us included) regularly using it a lot in production. Of course, I see the irony of mentioning this in a "shellshock" thread
- the dkms packaging should completely hide all the complexities required in maintaining a kernel module
- Part of the kernel code, if you look at the contributors, has been written/reviewed by gregkh, so we like to think the quality is "high enough"
- There might be plans at some point to try and propose a merge of the code to mainline
its respectable but in the end it doesnt matter. when you have to run this on thousand of systems that have not been tested with that LKM, the LKM can potentially destroy everything.
its not like if grekh code was bug free - theres a lot of bugs being fixed daily in the kernel as well.
additionally, the kernel distribution path has better verifications than sysdig's and sorry, ill trust that more than a few guys. It doesnt make your work any less, its just the way it is.
We just released 0.1.89 (special release to include the shellshock chisel) a few hours ago, so distribution maintainers aren't that fast: https://github.com/draios/sysdig/releases
Assuming you have a C/C++ compiler installed (comes via XCode) it really takes like 2 minutes.
Lazy alternative, in a couple days maximum Homebrew should be updated, unfortunately it doesn't depend on us.
Also, notice that sysdig for OSX doesn't (yet) have live capture, so you'll just be able to run the chisel on a trace file that you previously created on a Linux host.
Yes, we most definitely can publish a private brew tap. I'm no expert as I mainly use Linux, but my understanding is that I'd need to create a brand new draios/homebrew-sysdig repo. I'll try to find the time to look into this and update the documentation. If, instead, just a PR to our main repo would suffice, feel free to send it over and we'll merge it in no time :)
That's correct, I should have included a chart explicitly measuring the I/O activity done by the two containers, but I can assure you there was literally no I/O activity, a dozen open files per second is a very negligible throughput. The bottleneck was solely in the cache.