Skip to main content

Posts

Showing posts from 2013

Ptrace

Ptrace is a nice setup ( some people call dirty setup) on linux to debug running processes. This ptrace in sys/ptrace.h is used by strace and gdb. To trace a child process, the child process should call PTRACE_TRACEME. The kernel during each system call(or execution of each instruction) checks if the process is traced. If it is traced, it issues a SIGTRAP, the parent process if in wait() state, will get a signal. The parent issues a SIGSTOP to hold current state of child and can access the registers and memory of child using PEEKDATA and alter the values in register and memory using POKEDATA. Once the required job is done, parent will allow the child to run with a SIGCONT signal. Since one can access registers, the next instruction to be executed can be easily found using instruction pointer, this comes in handy when we need to set breakpoints while debugging. The entire code base can also be changed using ptrace. PTRACE_ATTACH attaches a running process. It does some hack to become

Elastic Search

I was reading about logstash( http://logstash.net/ ), a tool to mine raw logs and make data useful. I got attracted to the elastic search module it incorporates. Elastic search( http://www.elasticsearch.org/ ) helps in indexing documents and thereby increase the performance of text searches. When you pass a document to elastic search, elastic search indexes the document and creates an inverted index. Inverted index is like an index at the back of the book. The terms are mapped to the documents they belong to. So text searching becomes easier. I had a quick read through elastic search and tried a little bit of hands-on to use elastic search and create an autocomplete system. Elastic Search allows you to create an index which is similar to a database, an index can have types which is similar to tables and each type has mapping similar to schema. Elastic search has a set of tokenizers and analyzers for text processing which it inherits from Lucene project. Elastic search helps to scal

TCP Recycle

TCP connection tear down is one part which is least documented in the internet. In an ideal web server client architecture, once the server sends FIN, client will send FIN+ACK, server sends an ACK and enters time wait state. Look at this diagram Now lets consider our problem. We have deployed new servers which receive huge traffic with less processing time(100 ms). Now all these connections enter into time wait state to tear down connection. Time Wait state is useful for two reasons If the last ack sent by the server is missed half way, client will retransmit Fin+Ack. If the socket was reused by some other connection, there would be confusions in the network If  a router malfunctioned and any of the data from client to server was lost. Client would have retransmitted it after RTO (retransmission timeout) , There is a high probability that router might reinject the old packet if its segment lifetime is less than MSL. Now this packet could also cause conflict if the same tc

HTTP Range Header

There was a huge attack on our infrastructure couple of days before. It didnt follow the regular pattern. So did a lot of googling (google verb!) but eventually we were not to able to successfully find the intricacies employed in the attack. During this literature reference, I came across Partial Get feature in HTTP header. This is the feature used by Download Managers like IDM by spawning multiple threads to download a file. The partial Get request specifies the byte ranges it requires to download. Eg GET / HTTP/1.1 Host: 127.0.0.1 Range: bytes=0-89 The request requires first 89bytes of the file. There can be more than one chunk requested in the range header. The webserver responds with a 206 code for partial GET. I have coded a primitive threaded downloader which spawns 10 threads, downloads 10 different chunks of a big file and unite them as a single file. Committed the initial version at github  https://github.com/kalyanceg/downloader/blob/master/curler.java . Pl

Amateur Project ii

I have decided to port few of my projects (poorly maintained in app engine) as android apps. Today I ported the chatbot to android. It is very dumb now. Will continue developing the project iteratively. Please do check out the apk file for pre-alpha version here  http://goo.gl/JVrbH  and give your suggestions by dropping a  mail .

LXC and Host Crashes

 We had set up a bunch of lxc containers on two servers each with 16 core CPUs and 64 GB RAM(for reliability and loadbalancing). Both the servers are on same vlan. The servers need to have atleast one of their network interface in promiscuous mode so that it forwards all packets on vlan to the bridge( http://blogs.eskratch.com/2012/10/create-your-own-vms-i.html ) which takes care of the routing to containers. If the packets are not addressed to the containers, the bridge drops the packet. Having this setup, we moved all our platform maintenance services to these containers. They are fault tolerant as we used two host machines where each host machine has a replica of the containers on the other. The probability to crash for both the servers at the same time due to some hardware/software failure is less. But to my surprise both the servers are crashing exactly the same time with a mean life time 20 days. We had to wake up late nights(early mornings) to fix stuffs that gone down The

Stupidity

A requirement came to me that we need to enable timeout in php-curl with timeout values being in millisecond. We cant even wait for a second to curl to terminate. Php uses native curl in unix systems. The curl supports this feature in version>7.19. For our OS the curl version is 7.15. Ofcourse we can install 7.19 using rpm or source But php had a wrapper module curl.so, so all php-curl method call will go to curl.so which inturn will call native curl. So curl.so is not aware of the timeout_ms in feature and started throwing php errors.  A hack would be to replace curl.so with the one which supports timeout_ms from other version of OS.  But nobody is interested in this change because if something crashes in production, it would be huge effort and loss.  I downloaded the Source RPMS(RPM are binary and Source RPMS will have the source code and SPEC file). The SPEC file will take care of configure, make and install part (which we have to do manually if we install from sourc

Idle Cpu

We are facing a issue for the past one month and I am still unsure of the reason. Will rollout a change next monday on all our webservers to see how my reasoning works. Situation: We have a dozen webservers behind a Load Balancer in one of our setups. Each webserver receives some 300  req/sec i.e our setup receives around 5000 req/sec. Each web server will have 1024 apache process waiting for connections. At any moment 600 apache process will be free. But at the start of every hour the number of free process dips and reaches even to zero. When number of idle process reaches zero, response time, memory used and contentions for resource will increase drastically and thereby hitting the performance. Possible Reasons: It is proved that this situation is reproduced whenever logrotate is run. Logrotate is scheduled to run at the start of every hour. LogRotate: We log each apache access and error details in a file, which comes in handy when the system is being attacked by some ano