Skip to main content

How we have systematically improved the roads our packets travel to help data imports and exports flourish

This blog post is an account of how we have toiled over the years to improve the throughput of our interDC tunnels. I joined this company around 2012. We were scaling aggressively then. We quickly expanded to 4 DCs with a mixture of AWS and colocation. Our primary DC is connected to all these new DCs via IPSEC tunnels established from SRX. The SRX model we had, had an IPSEC throughput of 350Mbps. Around December 2015 we saturated the SRX. Buying SRX was an option on the table. Buying one with 2Gbps throughput would have cut the story short. The tech team didn't see it happening.

I don't have an answer to the question, "Is it worth spending time in solving a problem if a solution is already available out of box?" This project helped us in improving our critical thinking and in experiencing the theoretical network fundamentals on live traffic, but also caused us quite a bit of fatigue due to management overhead. Cutting short the philosophy, lets jump to the story.

December 2015 it was decided to build tunnels from opensource solutions to supplement the SRX. The initial solution proposed was SSH tunnel. SSH tunnel(Socks proxy) is established from a public box in our primary DC to a public EC2 instance in AWS. Any application which needs to skip the SRX path can be started with a proxychains command. The proxychain will override all glibc connect call to use SOCKS proxy as per the config. This solution actually broke a single end to end TCP connection to multiple(3) TCP connections viz Server to Socks proxy, Socks Proxy in DC to socks proxy in AWS, socks proxy in AWS to EC2 instance. Its a TCP in TCP tunnel which will be more susceptible to congestion as Multiplicative decrease is going to affect three times. As a result we decided to go with TCP in UDP tunnel. Looking back this solution it shows how naive we were.

February 2016 we started our exploration of TCP in UDP tunnel. We started with openvpn. We established client to site openvpn between our DC and AWS EC2 instance. Servers can change their route to AWS via openvpn box in DC. Openvpn box forwards all packets it receives on LAN to tun interface which forwards to AWS instance. AWS instance NATs and sends to destination since it is a client to site VPN. The first problem we faced here is some of our boxes used tcp recycle(now removed) kernel functionality. TCP recycle is not supposed to work properly for instances behind NAT. Hence we changed client to site openvpn to site to site openvpn. DC servers send to openvpn box, Openvpn sends to AWS openvpn server, AWS openvpn server sends to EC2 instances without NAT. Route table in EC2 is updated to send selected DC traffic via openvpn. The openvpn tunnel was not able to give more than 100 Mbps. We figured out CPU instruction set support for AES-NI will reduce CPU utilization on encryption and decryption. We spawned AWS instance with AES-NI support and tweaked udp buffer values. We were able to do 200 Mbps.  But our data team found 200Mbps +SRX 350 Mbps(selective routing) insufficient. Openvpn is a userspace process, so it causes context switch to kernel space and openvpn can't scale across multiple CPUs. We maintained TCP end to end semantics as this is a TCP in UDP tunnel.

IPSEC looked magic to us. No routes were added on the tunnel box but packets get routed properly, no user space process was running. But since industry standard is IPSEC, we decided to move to IPSEC. pfsense was picked as OS for IPSEC tunnel endpoint at DC and AWS VPN Gateway was used on AWS side. On May 2017 we installed pfsense and proceeded with our testing. We figured out pfsense  capping at 300Mbps. This was a shock to us. Adding a bonded 2 Gb NIC  improved the performance to 500Mbps. Exploring further we figured out interrupts were hogging CPU. We did lot of stupid explorations before coming to this conclusion which are not necessary. Distributing interrupts is the key. We enabled MSIx and added a 10G NIC. This distributed interrupts based on source IP, Destination IP, Source Port, Destination Port (similar to Request Side Steering in Linux). IPSEC traffic is still received by one queue but LAN traffic is distributed across queues as they were independent TCP connections. End of this change with 1 nic we were able to reach 600Mbps. One CPU which does IPSEC would be loaded higher than other cores which receives LAN traffic. But it is nowhere close to 100%. There were some bottlenecks in the network which caused a cap at 600Mbps. On July 2017, we made pfsense as our primary tunnel device with static routing (BGP daemon was not stable then in pfsense). We had few more network bottlenecks which are not trivial to explain here. On March 2018 we were able to reach 1.4Gbps on pfsense and our daily traffic stood upwards of 1Gbps (in + out).

Again a team started seeing lag in the pfsense setup. We changed inbound via a different ISP which has 10Gbps capacity. Our throughput improved to 1.7Gbps and we were doing consistent 1.2Gbps(in + out) per day. This is not sufficient for the team during peak traffic. On July 2018, we put forth the conclusion that there is a bottleneck due to LACP before the hop to router. LACP puts all IPSEC traffic to one AWS region in one LACP bucket as they have same source IP, destination IP and source and destination ports. Depending on the availablity of bandwidth in the LACP 1Gbps link the performance of tunnel varies. Similarly the CPU is also utilized when the traffic handled is close to 800Mbps. Remember one CPU being utilized for IPSEC traffic. Now that CPU has become the bottleneck. If we had bought an SRX we would have jumped here directly instead of going through all the stories above. Using direct connect without IPSEC will shard traffic across LACP links and removes PFsense from datapath. Again the story forced itself not to stop there as no direct connect was in sight.

We agreed sharding traffic is the way to go. IPSEC performance optimisation docs also suggests either offload IPSEC to NIC (from 4.16 kernels) or shard across multiple IPSEC tunnels. We setup strongswan on Linux on both sides AWS and DC. We were not able to use AWS VPN gateway as they stopped supporting ECMP across multiple tunnels. Between strongswan boxes we ran 3 IPSEC tunnels for the same policy (DC traffic to AWS traffic and vice versa). We created vti interface for each tunnel and added linux ECMP routes via all 3 vti interfaces for cross DC traffic. Each TCP connection will be bucketed in one of 3 tunnels. LACP might put 3 tunnels in different links as their IPs change. We were able to do under 2Gbps IPSEC throughput on the strongswan setup along with 1Gbps on existing pfsense setup. We are capable of doing close to 3Gbps now and we are seeing 3 CPUs of strongswan are utilized instead of just one CPU in pfsense. This tunnel has thus become horizontally scalable to number of cores to improve performance. In three years we have scaled our tunnel throughput 10times with the opensource tools and commodity hardware.

Would it have been better if we jumped to bigger SRX and then to Direct connect? When is the right time to ask your tech team stop experimenting as they can spend their time on something that does add value to company ? Should we jump to experimentation even though we haven't hit a dead-end or maxed out the existing solutions? If no, how will you keep the team's experimental behaviour alive? I will leave all these questions for the management to ponder. 

Comments

  1. It's really a cool and interesting story of squeezing resources to accomplish higher performance!

    I recently see an interesting project: https://www.wireguard.com , and it may have a better performance too:

    I tested two c5.large machines (two cores) in the same vpc / subnet:

    ```
    root@ip-10-0-0-100:/home/ubuntu# iperf3 -c 10.0.0.33
    Connecting to host 10.0.0.33, port 5201
    [ 4] local 10.0.0.100 port 37378 connected to 10.0.0.33 port 5201
    [ ID] Interval Transfer Bandwidth Retr Cwnd
    [ 4] 0.00-1.00 sec 1.12 GBytes 9.63 Gbits/sec 0 612 KBytes
    [ 4] 1.00-2.00 sec 1.12 GBytes 9.60 Gbits/sec 0 638 KBytes
    [ 4] 2.00-3.00 sec 1.12 GBytes 9.61 Gbits/sec 0 708 KBytes
    [ 4] 3.00-4.00 sec 1.12 GBytes 9.61 Gbits/sec 0 743 KBytes
    [ 4] 4.00-5.00 sec 1.12 GBytes 9.61 Gbits/sec 0 743 KBytes
    [ 4] 5.00-6.00 sec 1.12 GBytes 9.61 Gbits/sec 0 1.12 MBytes
    [ 4] 6.00-7.00 sec 1.12 GBytes 9.60 Gbits/sec 0 1.12 MBytes
    [ 4] 7.00-8.00 sec 1.12 GBytes 9.60 Gbits/sec 0 1.12 MBytes
    [ 4] 8.00-9.00 sec 1.12 GBytes 9.60 Gbits/sec 0 1.12 MBytes
    [ 4] 9.00-10.00 sec 1.12 GBytes 9.60 Gbits/sec 0 1.12 MBytes
    - - - - - - - - - - - - - - - - - - - - - - - - -
    [ ID] Interval Transfer Bandwidth Retr
    [ 4] 0.00-10.00 sec 11.2 GBytes 9.61 Gbits/sec 0 sender
    [ 4] 0.00-10.00 sec 11.2 GBytes 9.60 Gbits/sec receiver

    iperf Done.
    root@ip-10-0-0-100:/home/ubuntu# iperf3 -c 192.168.0.2
    Connecting to host 192.168.0.2, port 5201
    [ 4] local 192.168.0.1 port 35276 connected to 192.168.0.2 port 5201
    [ ID] Interval Transfer Bandwidth Retr Cwnd
    [ 4] 0.00-1.00 sec 815 MBytes 6.83 Gbits/sec 7 2.22 MBytes
    [ 4] 1.00-2.00 sec 900 MBytes 7.55 Gbits/sec 22 1.50 MBytes
    [ 4] 2.00-3.00 sec 875 MBytes 7.34 Gbits/sec 0 2.11 MBytes
    [ 4] 3.00-4.00 sec 866 MBytes 7.27 Gbits/sec 10 1.65 MBytes
    [ 4] 4.00-5.00 sec 886 MBytes 7.43 Gbits/sec 2 2.01 MBytes
    [ 4] 5.00-6.00 sec 839 MBytes 7.04 Gbits/sec 0 2.33 MBytes
    [ 4] 6.00-7.00 sec 872 MBytes 7.32 Gbits/sec 0 2.97 MBytes
    [ 4] 7.00-8.00 sec 867 MBytes 7.27 Gbits/sec 25 1.53 MBytes
    [ 4] 8.00-9.00 sec 866 MBytes 7.27 Gbits/sec 39 1.30 MBytes
    [ 4] 9.00-10.00 sec 910 MBytes 7.63 Gbits/sec 27 1.24 MBytes
    - - - - - - - - - - - - - - - - - - - - - - - - -
    [ ID] Interval Transfer Bandwidth Retr
    [ 4] 0.00-10.00 sec 8.49 GBytes 7.30 Gbits/sec 132 sender
    [ 4] 0.00-10.00 sec 8.49 GBytes 7.29 Gbits/sec receiver

    iperf Done.
    ```

    The two cores utilized ~95%. I also tried with bigger machines and it result looks similar. I can be the bottle neck of the over head of the solution itself.

    ReplyDelete
    Replies
    1. One of my team members have also suggested wireguard. Thanks for bringing this up. Will set up a poc system and compare it's management overhead with IPSEC whenever we get time. Based on that will update this post

      Delete
  2. I appreciate for such a valuable post and also very attractive blog. This is the high-grade content and helped to me. Believe me, your creative style is very super. I am hoping for more unique ideas from your post...
    Corporate Training in Chennai
    Corporate Training institute in Chennai
    Excel Training in Chennai
    Social Media Marketing Courses in Chennai
    Pega Training in Chennai
    Primavera Training in Chennai
    Corporate Training in Chennai
    Corporate Training institute in Chennai

    ReplyDelete

Post a Comment

Popular posts from this blog

Lessons from Memory

Started debugging an issue where Linux started calling OOM reaper despite tons of memory is used as Linux cached pages. My assumption was if there is a memory pressure, cache should shrink and leave way for the application to use. This is the documented and expected behavior. OOM reaper is called when few number of times page allocation has failed consequently. If for example mysql wants to grow its buffer and it asks for a page allocation and if the page allocation fails repeatedly, kernel invokes oom reaper. OOM reaper won't move out pages, it sleeps for some time and sees if kswapd or a program has freed up caches/application pages. If not it will start doing the dirty job of killing applications and freeing up memory. In our mysql setup, mysql is the application using most of the Used Memory, so no other application can free up memory for mysql to use. Cached pages are stored as 2 lists in Linux kernel viz active and inactive.
More details here
https://www.kernel.org/doc/gorman…

Walking down the Memory Lane!!!

This post is going to be an account of  few trouble-shootings I did recently to combat various I/O sluggishness.
Slow system during problems with backup
We have a NFS mount where we push backups of our database daily. Due to some update to the NFS infra, we started seeing throughput of NFS server drastically affected. During this time we saw general sluggishness in the system during backups. Even ssh logins appeared slower. Some boxes had to be rebooted due to this sluggishness as they were too slow to operate on them. First question we wanted to answer, does NFS keep writing if the server is slow? The slow server applied back pressure by sending small advertised window(TCP) to clients. So clients can't push huge writes if server is affected. Client writes to its page cache. The data from page cache is pushed to server when there is a memory pressure or file close is called. If server is slow, client can easily reach upto dirty_background_ratio set for page cache in sysctl. This di…