More

gighi · on June 29, 2017

OP here

That's correct, I should have included a chart explicitly measuring the I/O activity done by the two containers, but I can assure you there was literally no I/O activity, a dozen open files per second is a very negligible throughput. The bottleneck was solely in the cache.

pvg · on June 29, 2017

Yep, the old version of this was 'why is tar/find/rsync/etc hosing my server even when I've [io]niced it to hell'. Except now (as everything else) with containers!

gighi · on June 28, 2017

OP here

The solution is very simple: as mentioned in the article, just use a newer kernel and always set memory limits for containers, the blog post is based on an older kernel (2.6.32) that quite a few people irresponsibly still use in containerized environments, mostly because EL6 is so popular among enterprises.

In newer kernels, allocations from object pools are now tied to the limits of the memory cgroups that requested them in userspace, if any, so you wouldn't incur in this specific issue and you would just effectively have a container not being able to use more than X MB of dcache entries (although there are probably other minor ones, for example related to sharing global kernel mutexes and such).

deepsun · on June 29, 2017

I couldn't understand two things from the article:

1. If one of the two containers caused the issue, then the why you needed both of the containers to produce the issue? Why running just the offending one was not enough?

My guess is that "worker" container requested those non-existent files from a volume mounted by the other container, is it right?

2. Kernel hash table implementation. The whole point of hash table is that it's size is O(N), where N is the number of elements it holds.

Capping the hash table size to some constant and putting all the excess elements to its linked lists makes it perform like a linked list divided by the constant, no surprise. So it sounds like there's a bug in dentry hash table implementation -- it should either increase its size accordingly to elements count, or stop accepting new/evict old entries.

gighi · on June 29, 2017

> 1. If one of the two containers caused the issue, then the why you needed both of the containers to produce the issue? Why running just the offending one was not enough?

Running just the offending one would have been clearly enough, since its effects would have caused the same increased latency for every other process in the system (including itself). However, using a second container to observe the performance degradation proves the point that one container is able to affect another one, which is sort of the gist of the article, since too many people think containers provide much more isolation than what in reality happens.

> My guess is that "worker" container requested those non-existent files from a volume mounted by the other container, is it right?

No, the containers didn't share any volume, the dentry cache is effectively a singleton within the kernel, so even if the set of volumes is not overlapping, all processes in the system will see a performance degradation, regardless of where the files being accessed reside.

> 2. Kernel hash table implementation. The whole point of hash table is that it's size is O(N), where N is the number of elements it holds.

Your speculation is correct, however, there are sound reasons for doing such a thing in the kernel (and not allowing the main array of the hash table dynamically expand/shrink), so I wouldn't consider it a bug per se. I'll refer you to this excellent comment: https://news.ycombinator.com/item?id=14660954

deepsun · on June 29, 2017

Thank you. Very good article, thank you for writing it!

user5994461 · on June 29, 2017

It's not irresponsible to use a perfectly fine OS.

What is irresponsible is for Docker to purposefully avoid to mention that it has endless issues on these widely used OS.

The 2.6.X is used in CentOS/RHEL 6, which is the standard in numerous enterprises.

It is not a 2.6 kernel by the way, redhat is backporting tons of stuff from the 3 and 4 branches.

luord · on June 29, 2017

> It's not irresponsible to use a perfectly fine OS.

The first problem with this statement is the idea that there's such a thing as a "perfectly fine OS". We don't even need to consider containers, the longer an OS has been in the wild, the longer its potential vulnerabilities have been found and exploited.

Windows XP is a perfectly fine OS; using it nowadays is irresponsible.

> What is irresponsible is for Docker to purposefully avoid to mention that it has endless issues on these widely used OS.

That responsibility doesn't and should never fall on the developers of an application. The extent of one's responsibility as a developer is to define the recommendations for its use. Anything beyond that is entirely on the user.

One would go insane if one had to wonder every single operating system someone decided to use one's application in.

> It is not a 2.6 kernel by the way, redhat is backporting tons of stuff from the 3 and 4 branches.

"Backporting stuff" doesn't make it not the 2.6 Kernel, it very much is.

ibotty · on July 3, 2017

>We don't even need to consider containers, the longer an OS has been in the wild, the longer its potential vulnerabilities have been found and exploited.

I challenge you to find exploitable bugs in its kernel. Windows XP is not supported anymore, while RHEL 6 is.

kevinmgranger · on June 28, 2017

> ES6 is so popular among enterprises

I had to re-read this a few times-- I think you meant EL6, right?

gighi · on June 28, 2017

Updated, thanks! I am working with ElasticSearch (ES) more than EL these days and my muscle memory tricked me ;)

fckdemodss · on June 29, 2017

I ran into a similar issue with kernel memory caching behavior.

While it's nice to just say LOL upgrade you fool, most of us are stuck with the environment were given.

You can adjust kernel level memory behavior, in particular vfs_cache_pressure can be set very high to force dentry to empty more aggressively.

https://www.kernel.org/doc/Documentation/sysctl/vm.txt

gighi · on April 28, 2016

Definitely a feasible approach. Let's just say that the reality has a bit more color and we have some other practical advantages in controlling the exact moment when we disconnect a particular client :)

gighi · on April 28, 2016

Both approaches are reasonable (and there's also a third one, ship your application in containers and replace containers instead of instances).

We update existing instances because in our test environment we deploy at every single new commit (we absolutely love that), and we have hundreds (or more) a day. At that pace, replacing instances would be more time consuming (again, for our specific use case) and less cost efficient.

Plus, updating existing instances is handled automatically by AWS Code Deploy, which provides a very good deploying pipeline that you can control using the aws cli tool.

There are other minor advantages but those are the two main ones.

nzoschke · on April 28, 2016

That's an awesomely aggressive deployment rate and a great reason to do instance mutation.

Does something verify every commit in the testing environment too?

gighi · on April 28, 2016

Yes, every commit gets pulled by jenkins which builds the whole thing, runs unit tests and then starts the deployment once the tests pass.

gighi · on April 28, 2016

We definitely observed such drops that we attributed to presumably internal ELB scaling activity, but they happen so occasionally that for the moment they haven't been a real issue, as opposed to this one described in the article which happened consistently at every deployment in our test environment.

azundo · on April 28, 2016

Yeah, we've decided to live with the internal ELB scaling risks for the moment as well. We had the exact same situation where a deployment without gradual connection draining (even if we kept an instance in service in every AZ) would cause the ELBs to scale and drop all of our connections every time once we were at a certain scale. Definitely caused us a fair amount of confusion as it would happen minutes after the deploy when everything seemed to be calmed down again.

gleenn · on April 29, 2016

The author said he needed at least 2 instances in a AZ to avoid the bug, and used that as his workaround in the mean time that Amazon works on the bug.

gighi · on April 28, 2016

To be fair, the scenario is less common due to the fact that it happens just when the drained connections are terminated in a certain pattern (as shown in the charts). Still definitely common enough that can be easily replicated and cause real troubles :)

gighi · on Sept 26, 2014

The LD_PRELOAD workaround is meant to "fix" the shellshock bug, sysdig doesn't do that, it just passively monitors every injection attempt up to the point where the injected function is actually read by a newly spawn bash.

With the LD_PRELOAD trick you can definitely secure your system, but you won't be able to see if there's a new service that is being used as an attack vector (for your own curiosity). With sysdig, you can, and if you capture a trace file you can also follow the exact process chain that caused the propagation of the environment variable.

gighi · on Sept 26, 2014

Fair point, even though:

- At this point sysdig is estimated to have tens of thousands of users, and we haven't gotten a kernel bug in a while, with people (us included) regularly using it a lot in production. Of course, I see the irony of mentioning this in a "shellshock" thread

- the dkms packaging should completely hide all the complexities required in maintaining a kernel module

- Part of the kernel code, if you look at the contributors, has been written/reviewed by gregkh, so we like to think the quality is "high enough"

- There might be plans at some point to try and propose a merge of the code to mainline

zobzu · on Sept 27, 2014

its respectable but in the end it doesnt matter. when you have to run this on thousand of systems that have not been tested with that LKM, the LKM can potentially destroy everything.

its not like if grekh code was bug free - theres a lot of bugs being fixed daily in the kernel as well.

additionally, the kernel distribution path has better verifications than sysdig's and sorry, ill trust that more than a few guys. It doesnt make your work any less, its just the way it is.

gighi · on Sept 26, 2014

What do you get if you run "sysdig --version"?

If you used the official Ubuntu packages, those are a few versions behind upstream (currently at 0.1.87 while we are at 0.1.89): http://packages.ubuntu.com/trusty-backports/sysdig.

What we recommend is uninstalling those ones (sysdig and sysdig-dkms) and just use the binaries that we, Draios, provide, following this: https://github.com/draios/sysdig/wiki/How-to-Install-Sysdig-...

Should be very easy, and sysdig --version should show 0.1.89

mikegioia · on Sept 26, 2014

Ah ok, mine says "sysdig version 0.1.87". Thanks, I'll give that a shot instead.

brador · on Sept 26, 2014

Why is the Ubuntu package version behind the upstream version?

gighi · on Sept 26, 2014

We just released 0.1.89 (special release to include the shellshock chisel) a few hours ago, so distribution maintainers aren't that fast: https://github.com/draios/sysdig/releases

Debian is currently at 0.1.88: https://packages.debian.org/sid/sysdig

And Ubuntu periodically merges all the unstable packages from Debian, so that's why they're lagging one version behind at this moment.

iachimoe · on Sept 26, 2014

version on brew is 0.1.88. What's the best way for mac users to get this?

gighi · on Sept 26, 2014

Yeah, bummer, I can submit a PR to Homebrew but it would take a few hours/days, and we don't ship OSX binaries from Draios, so why don't you go with:

https://github.com/draios/sysdig/wiki/How-to-Install-Sysdig-...

Assuming you have a C/C++ compiler installed (comes via XCode) it really takes like 2 minutes.

Lazy alternative, in a couple days maximum Homebrew should be updated, unfortunately it doesn't depend on us.

Also, notice that sysdig for OSX doesn't (yet) have live capture, so you'll just be able to run the chisel on a trace file that you previously created on a Linux host.

_bbs · on Sept 26, 2014

If you publish the updated brew, can't it manually be added by users?

gighi · on Sept 26, 2014

Yes, we most definitely can publish a private brew tap. I'm no expert as I mainly use Linux, but my understanding is that I'd need to create a brand new draios/homebrew-sysdig repo. I'll try to find the time to look into this and update the documentation. If, instead, just a PR to our main repo would suffice, feel free to send it over and we'll merge it in no time :)

gighi · on Sept 16, 2014

I use feedly and this self-hosted application to read all sections of Hacker News: http://gianlucaborello.github.io/rssify/