High on %wa from top command, is there any way to constrain it?

Jason Zhu asked:

Here’s my last top info before it stucks:

top - 18:26:10 up 238 days,  5:43,  3 users,  load average: 1782.01, 1824.47, 1680.36
Tasks: 1938 total,   1 running, 1937 sleeping,   0 stopped,   0 zombie
Cpu(s):  2.4%us,  3.0%sy,  0.0%ni,  0.0%id, 94.5%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  65923016k total, 65698400k used,   224616k free,    13828k buffers
Swap: 33030136k total, 17799704k used, 15230432k free,   157316k cached

As you can see, since I’ve launched about 2000 processes executing hadoop get command, %wa is very high. I limit memory and cpu in cgroups, will it be helpful if I limit disk IO, too? If so, could anyone give me some idea on how to do that in cgroups? Thanks in advance.

My answer:

You don’t have enough RAM to run these 2000 processes.

We can see here that you have used all of your 64GB of RAM, and are also using an additional 17GB of swap. Your server is thrashing, trying to swap data in and out, valiantly trying to let each of those 2000 processes do something.

But of course it’s not working.

There are only two solutions here:

  1. Start fewer processes, so that you do not run out of RAM. (Try 1500.)
  2. Add more RAM to the server, so that it can run all of the processes.

View the full question and answer on Server Fault.

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.