A post almost after 2 years!!!
One common question I get asked is, "what is the reference I follow for troubleshooting an issue at hand". I would not be able to give an answer to the question directly as most of the times, I won't have even a single reference material handy. It's not a self boasting article. It's an article describing how knowledge we gather at random places help during an issue.
Let's dissect a memory usage issue in Linux I faced recently and see how the triage shaped up. One of our processes was getting repeated ENOMEM when it was trying to call malloc for some reason despite the box had plenty of unused RAM.
Lets see how the triage went through
- I didn't understand in my Operating systems course what a virtual memory is. I did convincing myself that virtual memory is physical memory + swap(in a way correct but not completely)
- I attended an interview in 2013, where the Director of the division asked me when you do malloc do you get physical memory address or virtual memory address. I said virtual memory address. This I had in back of my mind that arrays are contiguous in virtual memory and not in physical memory. So malloc gives an address in virtual memory. He shot a question can two programs get same address from malloc. This is where I fumbled
- The first learning is virtual memory lets each program to think it's the only program running on the system. Each program assumes entire RAM + swap for itself. malloc allocates pages for the request in Virtual memory. But these pages are not allocated in RAM till a page fault happens to avoid wastage of RAM
- Since each program thinks it's the only running program in the system, more than one program can have same virtual memory address. So each context switch will involve TLB flush. TLB is a buffer per CPU to cache virtual memory to physical memory address mapping. TLB miss will cause page walk which is one of the factors context switch is expensive. This again became a topic of discussion during Meltdown/Spectre vulnerability fix and a feature pcid of TLB is leveraged to avoid TLB misses when same program switches between kernel space and userspace
- With these knowledge we are pretty sure malloc is never going to allocate real memory then why does malloc return ENOMEM? While debugging previous issues related to page cache I have went through sysctl documentation of vm subsystem many times. Here an interesting parameter is overcommit_memory,
When this flag is 0, the kernel attempts to estimate the amount of free memory left when userspace requests more memory. When this flag is 1, the kernel pretends there is always enough memory until it actually runs out. When this flag is 2, the kernel uses a "never overcommit" policy that attempts to prevent any overcommit of memory. Note that user_reserve_kbytes affects this policy. This feature can be very useful because there are a lot of programs that malloc() huge amounts of memory "just-in-case" and don't use much of it.
Now lets see how this parameter is useful
- Our systems had vm.overcommit_memory set to 2 where OS acts honourable. It doesn't let combined VM usage of all processes beyond RAM(depend on other parameters like commit_ratio) and swap put together. This doesn't work great if the processes use a lot of COW pages. My discussions on COW happened during a discussion with a colleague probably in a bus depot when we were discussing Redis (lifeless folks outside work)
- When a process using huge amount of memory forks, virtual memory space alone gets separated for the child but both Virtual memory will point to same physical memory region till any one process tries to modify the page. Hence COW will cause VM usage to be way higher than actual memory usage if a heavy process forks children processes often. If we are sure chances of COW using up all RAM is less, we can disable OS being honourable by toggling the vm.overcommit_memory to 1. As a result OS will pass all malloc requests without counting how much it has committed. And eventually if all these processes page faults and asks for real memory, OS will wake up its bloody OOM killer to reap some processes and let the world move on.
All we had to do in our case to get rid of crashes due to ENOMEM was to set vm.overcommit_memory to 1. The triage was straightforward when we have the foundations right. Though there is no straightforward reference material for the solution, applying accumulated knowledge gives a sense of satisfaction that the learning is indeed correct.
I will try to take you through few more recent triages in similar fashion in my upcoming blog posts
Comments
Post a Comment