Skip to main content

More on Memory

 A post almost after 2 years!!!

One common question I get asked is, "what is the reference I follow for troubleshooting an issue at hand". I would not be able to give an answer to the question directly as most of the times, I won't have even a single reference material handy. It's not a self boasting article. It's an article describing how knowledge we gather at random places help during an issue.

Let's dissect a memory usage issue in Linux I faced recently and see how the triage shaped up. One of our processes was getting repeated ENOMEM when it was trying to call malloc for some reason despite the box had plenty of unused RAM.

Lets see how the triage went through

  1. I didn't understand in my Operating systems course what a virtual memory is. I did convincing myself that virtual memory is physical memory + swap(in a way correct but not completely)
  2. I attended an interview in 2013, where the Director of the division asked me when you do malloc do you get physical memory address or virtual memory address. I said virtual memory address. This I had in back of my mind that arrays are contiguous in virtual memory and not in physical memory. So malloc gives an address in virtual memory. He shot a question can two programs get same address from malloc. This is where I fumbled 
    1.  The first learning is virtual memory lets each program to think it's the only program running on the system. Each program assumes entire RAM + swap for itself. malloc allocates pages for the request in Virtual memory. But these pages are not allocated in RAM till a page fault happens to avoid wastage of RAM
    2.  Since each program thinks it's the only running program in the system, more than one program can have same virtual memory address. So each context switch will involve TLB flush. TLB is a buffer per CPU to cache virtual memory to physical memory address mapping. TLB miss will cause page walk which is one of the factors context switch is expensive. This again became a topic of discussion during Meltdown/Spectre vulnerability fix and a feature pcid of TLB is leveraged to avoid TLB misses when same program switches between kernel space and userspace
  3. With these knowledge we are pretty sure malloc is never going to allocate real memory then why does malloc return ENOMEM? While debugging previous issues related to page cache I have went through sysctl documentation of vm subsystem many times. Here an interesting parameter is overcommit_memory,


When this flag is 0, the kernel attempts to estimate the amount
of free memory left when userspace requests more memory.

When this flag is 1, the kernel pretends there is always enough
memory until it actually runs out.

When this flag is 2, the kernel uses a "never overcommit"
policy that attempts to prevent any overcommit of memory.
Note that user_reserve_kbytes affects this policy.

This feature can be very useful because there are a lot of
programs that malloc() huge amounts of memory "just-in-case"
and don't use much of it.

 Now lets see how this parameter is useful

  1. Our systems had vm.overcommit_memory set to 2 where OS acts honourable. It doesn't let combined VM usage of all processes beyond RAM(depend on other parameters like commit_ratio) and swap put together. This doesn't work great if the processes use a lot of COW pages. My discussions on COW happened during a discussion with a colleague probably in a bus depot when we were discussing Redis (lifeless folks outside work)  
  2. When a process using huge amount of memory forks, virtual memory space alone gets separated for the child but both Virtual memory will point to same physical memory region till any one process tries to modify the page. Hence COW will cause VM usage to be way higher than actual memory usage if a heavy process forks children processes often.  If we are sure chances of COW using up all RAM is less, we can disable OS being honourable by toggling the vm.overcommit_memory to 1. As a result OS will pass all malloc requests without counting how much it has committed. And eventually if all these processes page faults and asks for real memory, OS will wake up its bloody OOM killer to reap some processes and let the world move on. 
All we had to do in our case to get rid of crashes due to ENOMEM was to set  vm.overcommit_memory to 1. The triage was straightforward when we have the foundations right. Though there is no straightforward reference material for the solution, applying accumulated knowledge gives a sense of satisfaction that the learning is indeed correct.

I will try to take you through few more recent triages in similar fashion in my upcoming blog posts

Comments

Popular posts from this blog

How we have systematically improved the roads our packets travel to help data imports and exports flourish

This blog post is an account of how we have toiled over the years to improve the throughput of our interDC tunnels. I joined this company around 2012. We were scaling aggressively then. We quickly expanded to 4 DCs with a mixture of AWS and colocation. Our primary DC is connected to all these new DCs via IPSEC tunnels established from SRX. The SRX model we had, had an IPSEC throughput of 350Mbps. Around December 2015 we saturated the SRX. Buying SRX was an option on the table. Buying one with 2Gbps throughput would have cut the story short. The tech team didn't see it happening. I don't have an answer to the question, "Is it worth spending time in solving a problem if a solution is already available out of box?" This project helped us in improving our critical thinking and in experiencing the theoretical network fundamentals on live traffic, but also caused us quite a bit of fatigue due to management overhead. Cutting short the philosophy, lets jump to the story.

Lessons from Memory

Started debugging an issue where Linux started calling OOM reaper despite tons of memory is used as Linux cached pages. My assumption was if there is a memory pressure, cache should shrink and leave way for the application to use. This is the documented and expected behavior. OOM reaper is called when few number of times page allocation has failed consequently. If for example mysql wants to grow its buffer and it asks for a page allocation and if the page allocation fails repeatedly, kernel invokes oom reaper. OOM reaper won't move out pages, it sleeps for some time and sees if kswapd or a program has freed up caches/application pages. If not it will start doing the dirty job of killing applications and freeing up memory. In our mysql setup, mysql is the application using most of the Used Memory, so no other application can free up memory for mysql to use. Cached pages are stored as 2 lists in Linux kernel viz active and inactive. More details here https://www.kernel.org/doc/gorma