Skip to main content

How we have systematically improved the roads our packets travel to help data imports and exports flourish

This blog post is an account of how we have toiled over the years to improve the throughput of our interDC tunnels. I joined this company around 2012. We were scaling aggressively then. We quickly expanded to 4 DCs with a mixture of AWS and colocation. Our primary DC is connected to all these new DCs via IPSEC tunnels established from SRX. The SRX model we had, had an IPSEC throughput of 350Mbps. Around December 2015 we saturated the SRX. Buying SRX was an option on the table. Buying one with 2Gbps throughput would have cut the story short. The tech team didn't see it happening.

I don't have an answer to the question, "Is it worth spending time in solving a problem if a solution is already available out of box?" This project helped us in improving our critical thinking and in experiencing the theoretical network fundamentals on live traffic, but also caused us quite a bit of fatigue due to management overhead. Cutting short the philosophy, lets jump to the story.

December 2015 it was decided to build tunnels from opensource solutions to supplement the SRX. The initial solution proposed was SSH tunnel. SSH tunnel(Socks proxy) is established from a public box in our primary DC to a public EC2 instance in AWS. Any application which needs to skip the SRX path can be started with a proxychains command. The proxychain will override all glibc connect call to use SOCKS proxy as per the config. This solution actually broke a single end to end TCP connection to multiple(3) TCP connections viz Server to Socks proxy, Socks Proxy in DC to socks proxy in AWS, socks proxy in AWS to EC2 instance. Its a TCP in TCP tunnel which will be more susceptible to congestion as Multiplicative decrease is going to affect three times. As a result we decided to go with TCP in UDP tunnel. Looking back this solution it shows how naive we were.

February 2016 we started our exploration of TCP in UDP tunnel. We started with openvpn. We established client to site openvpn between our DC and AWS EC2 instance. Servers can change their route to AWS via openvpn box in DC. Openvpn box forwards all packets it receives on LAN to tun interface which forwards to AWS instance. AWS instance NATs and sends to destination since it is a client to site VPN. The first problem we faced here is some of our boxes used tcp recycle(now removed) kernel functionality. TCP recycle is not supposed to work properly for instances behind NAT. Hence we changed client to site openvpn to site to site openvpn. DC servers send to openvpn box, Openvpn sends to AWS openvpn server, AWS openvpn server sends to EC2 instances without NAT. Route table in EC2 is updated to send selected DC traffic via openvpn. The openvpn tunnel was not able to give more than 100 Mbps. We figured out CPU instruction set support for AES-NI will reduce CPU utilization on encryption and decryption. We spawned AWS instance with AES-NI support and tweaked udp buffer values. We were able to do 200 Mbps.  But our data team found 200Mbps +SRX 350 Mbps(selective routing) insufficient. Openvpn is a userspace process, so it causes context switch to kernel space and openvpn can't scale across multiple CPUs. We maintained TCP end to end semantics as this is a TCP in UDP tunnel.

IPSEC looked magic to us. No routes were added on the tunnel box but packets get routed properly, no user space process was running. But since industry standard is IPSEC, we decided to move to IPSEC. pfsense was picked as OS for IPSEC tunnel endpoint at DC and AWS VPN Gateway was used on AWS side. On May 2017 we installed pfsense and proceeded with our testing. We figured out pfsense  capping at 300Mbps. This was a shock to us. Adding a bonded 2 Gb NIC  improved the performance to 500Mbps. Exploring further we figured out interrupts were hogging CPU. We did lot of stupid explorations before coming to this conclusion which are not necessary. Distributing interrupts is the key. We enabled MSIx and added a 10G NIC. This distributed interrupts based on source IP, Destination IP, Source Port, Destination Port (similar to Request Side Steering in Linux). IPSEC traffic is still received by one queue but LAN traffic is distributed across queues as they were independent TCP connections. End of this change with 1 nic we were able to reach 600Mbps. One CPU which does IPSEC would be loaded higher than other cores which receives LAN traffic. But it is nowhere close to 100%. There were some bottlenecks in the network which caused a cap at 600Mbps. On July 2017, we made pfsense as our primary tunnel device with static routing (BGP daemon was not stable then in pfsense). We had few more network bottlenecks which are not trivial to explain here. On March 2018 we were able to reach 1.4Gbps on pfsense and our daily traffic stood upwards of 1Gbps (in + out).

Again a team started seeing lag in the pfsense setup. We changed inbound via a different ISP which has 10Gbps capacity. Our throughput improved to 1.7Gbps and we were doing consistent 1.2Gbps(in + out) per day. This is not sufficient for the team during peak traffic. On July 2018, we put forth the conclusion that there is a bottleneck due to LACP before the hop to router. LACP puts all IPSEC traffic to one AWS region in one LACP bucket as they have same source IP, destination IP and source and destination ports. Depending on the availablity of bandwidth in the LACP 1Gbps link the performance of tunnel varies. Similarly the CPU is also utilized when the traffic handled is close to 800Mbps. Remember one CPU being utilized for IPSEC traffic. Now that CPU has become the bottleneck. If we had bought an SRX we would have jumped here directly instead of going through all the stories above. Using direct connect without IPSEC will shard traffic across LACP links and removes PFsense from datapath. Again the story forced itself not to stop there as no direct connect was in sight.

We agreed sharding traffic is the way to go. IPSEC performance optimisation docs also suggests either offload IPSEC to NIC (from 4.16 kernels) or shard across multiple IPSEC tunnels. We setup strongswan on Linux on both sides AWS and DC. We were not able to use AWS VPN gateway as they stopped supporting ECMP across multiple tunnels. Between strongswan boxes we ran 3 IPSEC tunnels for the same policy (DC traffic to AWS traffic and vice versa). We created vti interface for each tunnel and added linux ECMP routes via all 3 vti interfaces for cross DC traffic. Each TCP connection will be bucketed in one of 3 tunnels. LACP might put 3 tunnels in different links as their IPs change. We were able to do under 2Gbps IPSEC throughput on the strongswan setup along with 1Gbps on existing pfsense setup. We are capable of doing close to 3Gbps now and we are seeing 3 CPUs of strongswan are utilized instead of just one CPU in pfsense. This tunnel has thus become horizontally scalable to number of cores to improve performance. In three years we have scaled our tunnel throughput 10times with the opensource tools and commodity hardware.

Would it have been better if we jumped to bigger SRX and then to Direct connect? When is the right time to ask your tech team stop experimenting as they can spend their time on something that does add value to company ? Should we jump to experimentation even though we haven't hit a dead-end or maxed out the existing solutions? If no, how will you keep the team's experimental behaviour alive? I will leave all these questions for the management to ponder. 

Comments

  1. It's really a cool and interesting story of squeezing resources to accomplish higher performance!

    I recently see an interesting project: https://www.wireguard.com , and it may have a better performance too:

    I tested two c5.large machines (two cores) in the same vpc / subnet:

    ```
    root@ip-10-0-0-100:/home/ubuntu# iperf3 -c 10.0.0.33
    Connecting to host 10.0.0.33, port 5201
    [ 4] local 10.0.0.100 port 37378 connected to 10.0.0.33 port 5201
    [ ID] Interval Transfer Bandwidth Retr Cwnd
    [ 4] 0.00-1.00 sec 1.12 GBytes 9.63 Gbits/sec 0 612 KBytes
    [ 4] 1.00-2.00 sec 1.12 GBytes 9.60 Gbits/sec 0 638 KBytes
    [ 4] 2.00-3.00 sec 1.12 GBytes 9.61 Gbits/sec 0 708 KBytes
    [ 4] 3.00-4.00 sec 1.12 GBytes 9.61 Gbits/sec 0 743 KBytes
    [ 4] 4.00-5.00 sec 1.12 GBytes 9.61 Gbits/sec 0 743 KBytes
    [ 4] 5.00-6.00 sec 1.12 GBytes 9.61 Gbits/sec 0 1.12 MBytes
    [ 4] 6.00-7.00 sec 1.12 GBytes 9.60 Gbits/sec 0 1.12 MBytes
    [ 4] 7.00-8.00 sec 1.12 GBytes 9.60 Gbits/sec 0 1.12 MBytes
    [ 4] 8.00-9.00 sec 1.12 GBytes 9.60 Gbits/sec 0 1.12 MBytes
    [ 4] 9.00-10.00 sec 1.12 GBytes 9.60 Gbits/sec 0 1.12 MBytes
    - - - - - - - - - - - - - - - - - - - - - - - - -
    [ ID] Interval Transfer Bandwidth Retr
    [ 4] 0.00-10.00 sec 11.2 GBytes 9.61 Gbits/sec 0 sender
    [ 4] 0.00-10.00 sec 11.2 GBytes 9.60 Gbits/sec receiver

    iperf Done.
    root@ip-10-0-0-100:/home/ubuntu# iperf3 -c 192.168.0.2
    Connecting to host 192.168.0.2, port 5201
    [ 4] local 192.168.0.1 port 35276 connected to 192.168.0.2 port 5201
    [ ID] Interval Transfer Bandwidth Retr Cwnd
    [ 4] 0.00-1.00 sec 815 MBytes 6.83 Gbits/sec 7 2.22 MBytes
    [ 4] 1.00-2.00 sec 900 MBytes 7.55 Gbits/sec 22 1.50 MBytes
    [ 4] 2.00-3.00 sec 875 MBytes 7.34 Gbits/sec 0 2.11 MBytes
    [ 4] 3.00-4.00 sec 866 MBytes 7.27 Gbits/sec 10 1.65 MBytes
    [ 4] 4.00-5.00 sec 886 MBytes 7.43 Gbits/sec 2 2.01 MBytes
    [ 4] 5.00-6.00 sec 839 MBytes 7.04 Gbits/sec 0 2.33 MBytes
    [ 4] 6.00-7.00 sec 872 MBytes 7.32 Gbits/sec 0 2.97 MBytes
    [ 4] 7.00-8.00 sec 867 MBytes 7.27 Gbits/sec 25 1.53 MBytes
    [ 4] 8.00-9.00 sec 866 MBytes 7.27 Gbits/sec 39 1.30 MBytes
    [ 4] 9.00-10.00 sec 910 MBytes 7.63 Gbits/sec 27 1.24 MBytes
    - - - - - - - - - - - - - - - - - - - - - - - - -
    [ ID] Interval Transfer Bandwidth Retr
    [ 4] 0.00-10.00 sec 8.49 GBytes 7.30 Gbits/sec 132 sender
    [ 4] 0.00-10.00 sec 8.49 GBytes 7.29 Gbits/sec receiver

    iperf Done.
    ```

    The two cores utilized ~95%. I also tried with bigger machines and it result looks similar. I can be the bottle neck of the over head of the solution itself.

    ReplyDelete
    Replies
    1. One of my team members have also suggested wireguard. Thanks for bringing this up. Will set up a poc system and compare it's management overhead with IPSEC whenever we get time. Based on that will update this post

      Delete
  2. I appreciate for such a valuable post and also very attractive blog. This is the high-grade content and helped to me. Believe me, your creative style is very super. I am hoping for more unique ideas from your post...
    Corporate Training in Chennai
    Corporate Training institute in Chennai
    Excel Training in Chennai
    Social Media Marketing Courses in Chennai
    Pega Training in Chennai
    Primavera Training in Chennai
    Corporate Training in Chennai
    Corporate Training institute in Chennai

    ReplyDelete
  3. Sylenth1 Crack is one of the only synthesizers that can compete with hardware synths quality requirements all the time. Sylenth1 Full download was built ...Sylenth1 Crack

    ReplyDelete
  4. I am a professional web blogger so visit my website link is given below!To get more information
    iZotope Ozone v Crack/

    ReplyDelete

Post a Comment

Popular posts from this blog

LXC and Host Crashes

 We had set up a bunch of lxc containers on two servers each with 16 core CPUs and 64 GB RAM(for reliability and loadbalancing). Both the servers are on same vlan. The servers need to have atleast one of their network interface in promiscuous mode so that it forwards all packets on vlan to the bridge( http://blogs.eskratch.com/2012/10/create-your-own-vms-i.html ) which takes care of the routing to containers. If the packets are not addressed to the containers, the bridge drops the packet. Having this setup, we moved all our platform maintenance services to these containers. They are fault tolerant as we used two host machines where each host machine has a replica of the containers on the other. The probability to crash for both the servers at the same time due to some hardware/software failure is less. But to my surprise both the servers are crashing exactly the same time with a mean life time 20 days. We had to wake up late nights(early mornings) to fix stuffs that gone down The

The server, me and the conversation

We were moving a project from AWS to our co-located DC. We have setup KVMs scheduled by Cloudstack for each of the component in the architecture. The KVMs used local storage. The VMs are provisioned with more than required resources because we have the opinion that in our DC scaling during peak load and then downscaling doesn't offer much benefits financially as we are anyways paying for the hardware in advance and its also powered on. Its going to be idle if not used. Now we found something interesting our latency in co-located DC was 2 times more than in AWS. The time for first byte at our load balancer in aws was 60ms average and at our DC was 112ms. We started our debugging mission, Mission Conquer-AWS. All the servers are newer Dell hardwares. So the initially intuition was virtualisation is causing the issue. Conversation with the Hypervisor We started with CPU optimisation, we started using the host-passthrough mode of CPU in libvirt so VMs dont see QEMU emulated CPUs,