Skip to main content

More on Memory

 A post almost after 2 years!!!

One common question I get asked is, "what is the reference I follow for troubleshooting an issue at hand". I would not be able to give an answer to the question directly as most of the times, I won't have even a single reference material handy. It's not a self boasting article. It's an article describing how knowledge we gather at random places help during an issue.

Let's dissect a memory usage issue in Linux I faced recently and see how the triage shaped up. One of our processes was getting repeated ENOMEM when it was trying to call malloc for some reason despite the box had plenty of unused RAM.

Lets see how the triage went through

  1. I didn't understand in my Operating systems course what a virtual memory is. I did convincing myself that virtual memory is physical memory + swap(in a way correct but not completely)
  2. I attended an interview in 2013, where the Director of the division asked me when you do malloc do you get physical memory address or virtual memory address. I said virtual memory address. This I had in back of my mind that arrays are contiguous in virtual memory and not in physical memory. So malloc gives an address in virtual memory. He shot a question can two programs get same address from malloc. This is where I fumbled 
    1.  The first learning is virtual memory lets each program to think it's the only program running on the system. Each program assumes entire RAM + swap for itself. malloc allocates pages for the request in Virtual memory. But these pages are not allocated in RAM till a page fault happens to avoid wastage of RAM
    2.  Since each program thinks it's the only running program in the system, more than one program can have same virtual memory address. So each context switch will involve TLB flush. TLB is a buffer per CPU to cache virtual memory to physical memory address mapping. TLB miss will cause page walk which is one of the factors context switch is expensive. This again became a topic of discussion during Meltdown/Spectre vulnerability fix and a feature pcid of TLB is leveraged to avoid TLB misses when same program switches between kernel space and userspace
  3. With these knowledge we are pretty sure malloc is never going to allocate real memory then why does malloc return ENOMEM? While debugging previous issues related to page cache I have went through sysctl documentation of vm subsystem many times. Here an interesting parameter is overcommit_memory,

When this flag is 0, the kernel attempts to estimate the amount
of free memory left when userspace requests more memory.

When this flag is 1, the kernel pretends there is always enough
memory until it actually runs out.

When this flag is 2, the kernel uses a "never overcommit"
policy that attempts to prevent any overcommit of memory.
Note that user_reserve_kbytes affects this policy.

This feature can be very useful because there are a lot of
programs that malloc() huge amounts of memory "just-in-case"
and don't use much of it.

 Now lets see how this parameter is useful

  1. Our systems had vm.overcommit_memory set to 2 where OS acts honourable. It doesn't let combined VM usage of all processes beyond RAM(depend on other parameters like commit_ratio) and swap put together. This doesn't work great if the processes use a lot of COW pages. My discussions on COW happened during a discussion with a colleague probably in a bus depot when we were discussing Redis (lifeless folks outside work)  
  2. When a process using huge amount of memory forks, virtual memory space alone gets separated for the child but both Virtual memory will point to same physical memory region till any one process tries to modify the page. Hence COW will cause VM usage to be way higher than actual memory usage if a heavy process forks children processes often.  If we are sure chances of COW using up all RAM is less, we can disable OS being honourable by toggling the vm.overcommit_memory to 1. As a result OS will pass all malloc requests without counting how much it has committed. And eventually if all these processes page faults and asks for real memory, OS will wake up its bloody OOM killer to reap some processes and let the world move on. 
All we had to do in our case to get rid of crashes due to ENOMEM was to set  vm.overcommit_memory to 1. The triage was straightforward when we have the foundations right. Though there is no straightforward reference material for the solution, applying accumulated knowledge gives a sense of satisfaction that the learning is indeed correct.

I will try to take you through few more recent triages in similar fashion in my upcoming blog posts


Popular posts from this blog

How we have systematically improved the roads our packets travel to help data imports and exports flourish

This blog post is an account of how we have toiled over the years to improve the throughput of our interDC tunnels. I joined this company around 2012. We were scaling aggressively then. We quickly expanded to 4 DCs with a mixture of AWS and colocation. Our primary DC is connected to all these new DCs via IPSEC tunnels established from SRX. The SRX model we had, had an IPSEC throughput of 350Mbps. Around December 2015 we saturated the SRX. Buying SRX was an option on the table. Buying one with 2Gbps throughput would have cut the story short. The tech team didn't see it happening. I don't have an answer to the question, "Is it worth spending time in solving a problem if a solution is already available out of box?" This project helped us in improving our critical thinking and in experiencing the theoretical network fundamentals on live traffic, but also caused us quite a bit of fatigue due to management overhead. Cutting short the philosophy, lets jump to the story.

LXC and Host Crashes

 We had set up a bunch of lxc containers on two servers each with 16 core CPUs and 64 GB RAM(for reliability and loadbalancing). Both the servers are on same vlan. The servers need to have atleast one of their network interface in promiscuous mode so that it forwards all packets on vlan to the bridge( ) which takes care of the routing to containers. If the packets are not addressed to the containers, the bridge drops the packet. Having this setup, we moved all our platform maintenance services to these containers. They are fault tolerant as we used two host machines where each host machine has a replica of the containers on the other. The probability to crash for both the servers at the same time due to some hardware/software failure is less. But to my surprise both the servers are crashing exactly the same time with a mean life time 20 days. We had to wake up late nights(early mornings) to fix stuffs that gone down The

The server, me and the conversation

We were moving a project from AWS to our co-located DC. We have setup KVMs scheduled by Cloudstack for each of the component in the architecture. The KVMs used local storage. The VMs are provisioned with more than required resources because we have the opinion that in our DC scaling during peak load and then downscaling doesn't offer much benefits financially as we are anyways paying for the hardware in advance and its also powered on. Its going to be idle if not used. Now we found something interesting our latency in co-located DC was 2 times more than in AWS. The time for first byte at our load balancer in aws was 60ms average and at our DC was 112ms. We started our debugging mission, Mission Conquer-AWS. All the servers are newer Dell hardwares. So the initially intuition was virtualisation is causing the issue. Conversation with the Hypervisor We started with CPU optimisation, we started using the host-passthrough mode of CPU in libvirt so VMs dont see QEMU emulated CPUs,