Here’s my last top info before it stucks:
top - 18:26:10 up 238 days, 5:43, 3 users, load average: 1782.01, 1824.47, 1680.36 Tasks: 1938 total, 1 running, 1937 sleeping, 0 stopped, 0 zombie Cpu(s): 2.4%us, 3.0%sy, 0.0%ni, 0.0%id, 94.5%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 65923016k total, 65698400k used, 224616k free, 13828k buffers Swap: 33030136k total, 17799704k used, 15230432k free, 157316k cached
As you can see, since I’ve launched about 2000 processes executing
hadoop get command, %wa is very high. I limit memory and cpu in
cgroups, will it be helpful if I limit disk IO, too? If so, could anyone give me some idea on how to do that in
cgroups? Thanks in advance.
You don’t have enough RAM to run these 2000 processes.
We can see here that you have used all of your 64GB of RAM, and are also using an additional 17GB of swap. Your server is thrashing, trying to swap data in and out, valiantly trying to let each of those 2000 processes do something.
But of course it’s not working.
There are only two solutions here:
- Start fewer processes, so that you do not run out of RAM. (Try 1500.)
- Add more RAM to the server, so that it can run all of the processes.
This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.