Sunday, January 8, 2017

Analyzing UDP Performance

I came upon an issue where there was concern about external factors that might impact receipt and processing of UDP packets. Specifically, the question was why am I dropping UDP packets? There are a multitude of explanations, but since the answer isn't nearly as much fun as the search, here's how I started to investigate the problem. This post is focused strictly upon the process, not the answer.

Based upon the query, I knew walking into this the following:
  • irqbalance was in play, so the potential existed for a process to move around.
  • Other processes ran on the host (no big surprise), but admin processes should have all been pinned to CPU0.
  • The process in question (that consumed the UDP traffic) was being pinned to another CPU (not CPU0).
  • The fluctuation in the UDP receive queue implied that something was taking place on the same CPU, temporarily halting the movement of data from the buffer to the application.

The Lab


The target host is a 4-core Intel Xeon (2 sockets x 2 cores each) host with a 1Gbit Full duplex NIC. It's running CentOS 7.2 (build 1511). The application written for testing is written for (and run in) Python 2.7. The client is a Linux Mint desktop, also with a 1 Gbit NIC, full duplex.

Generating Traffic


To begin with, we need a way to induce UDP traffic at a pretty high rate (we want to induce load). I came across a wiki page at python.org that explained how to pass a message via UDP using a client/server application. I extended it into something that would do the same at a very high rate. Here's the code:

Server Side Application


#!/usr/bin/python

# udpconsumer.py: Open a port to read inbound UDP packet data

import socket

# Create a TCP/IP socket
sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)

# Bind the socket to the port.
server_address = ('0.0.0.0', 10000)
print 'starting up on %s port %s' % server_address
sock.bind(server_address)

while True:
    data, address = sock.recvfrom(4096)
    print data



We begin by including the necessary modules, and creating the socket. Next we bind the socket to all interfaces (essentially, unlocking the door to allow traffic in). After a short message to the user, we enter an infinite loop that does nothing more than read the data being sent by the client application below, and printing it to the console (nice to know the app is working, but if you want even higher frequency, eliminate the print statement; then your server can process traffic at a higher rate.

Client Side Application


#!/usr/bin/python

#udptg.py: UDP traffic generator

import socket
import sys

#Create a UDP socket
try:
    sock = socket.socket(socket.AF_INET,socket.SOCK_DGRAM)
except socket.error, msg:
    print 'Failed to open socket. Error code: ' + str(msg[0]) + ': ' + msg
    sys.exit()

server_address = ('172.16.1.195', 10000)
pktcounter = 0

while True:
    pktcounter += 1
    # send data
    pkt_msg = str(pktcounter)
    sent = sock.sendto(pkt_msg, server_address)



In this app, we include our modules and open the socket. Next we specify where we are sending traffic (IP and port). We set a counter - this will end up being both the payload data, and the count of UDP packets sent. We create another infinite while loop to send data (the incrementing counter value).

Running the Applications


On the server, we start the server with:

    # ./udpconsumer.py

It will display a message, then just sit and wait for data.


On the client, when it comes time to send data, we initiate the script with:

    # ./udptg.py

When this is done, you can watch the terminal where the server started, and you should see numbers being output that increment by 1 on each line.

Measurement


OK, so with that, now we need some tools to measure what's going on. There are a few things I want to look at to start with. These include:

Size of the UDP buffer (how much stuff is waiting for the application)

Impact to CPUs (Utililzation)


Context switches (how often our ability to process inbound data is being interrupted)

Fortunately, these are not difficult to measure. To capture the buffer size, we can issue the following command:

    # while [ 1 ]; do netstat -lun | grep 10000 | awk \
      '{print $2}';sleep .1;done > file.dat

This polls the netstat command for UDP sockets, and filters out everything except the receive queue size, which is measured in bytes. It will produce a file with numbers on each line. Drop that into a spreadsheet to graph, grab average and standard deviation stats, and you have a good picture of what's happening in the UDP receive buffer.

Obtaining CPU utilization and context switches is just as easy. Each of the following will get that for you:

    # sar -P ALL   #displays CPU utilization on each CPU, and for
                   #all of them collectively.

    # sar -w       # shows context switches per second.


One important note for the sar data: you might consider increasing the interval from the default 10 minutes to 1 minute. The values sar displays are not point-in-time, but rather an average across the interval. By shortening the interval, your test runs don't need to be as long to get valid data. I set to 1-minute intervals, and ran the test for just over 7 minutes, throwing out the first and last minutes.

Poking the Bear to See What Happens


Now that we can watch what's happening, we want to play with some tunables to see what impact they have on overall performance. There are two things I've looked at: CPU pinning, and scheduling. By default, processes can run on any CPU, and can be switched around as the OS feels is necessary. In addition, the default CPU scheduler is SCHED_OTHER, and the default priority is 0 (basically, every process gets equal treatment. In our test, we want to see what happens if we force the kernel to only allow the application on a single CPU, and if we give that process more CPU time through the scheduler. The commands that allow this are taskset and chrt. We need to be able to do two things for each: set the values, and get (display) the values.

Getting taskset and chrt Values


First, obtain the pid of the process. Since the server-side application just sits and waits for traffic, we can start it up, and get its pid from the process table. As long as we don't kill the server application, we can reuse the same pid throughout the exercise.

    # ps ax | grep [u]dpconsumer

The first value on the line will be the pid (which in this case, is 17648).

The following illustrates checking the current scheduler configuration:

    [root@nebula sysadmin]# chrt -p 17648
    pid 17648's current scheduling policy: SCHED_OTHER
    pid 17648's current scheduling priority: 0


Here, we check the CPU affinity:

    [root@nebula sysadmin]# taskset -pc 17648
    pid 17648's current affinity list: 0-3
    [root@nebula sysadmin]#


Now that we know how to inspect the settings, we need to be able to modify them.

Let's change the scheduler to giver our application something more of real-time priority, and verify the change:

    [root@nebula sysadmin]# chrt -f -p 99 17648
    [root@nebula sysadmin]# chrt -p 17648
    pid 17648's current scheduling policy: SCHED_FIFO
    pid 17648's current scheduling priority: 99


Our process now takes advantage of a first-in-first-out scheduler, with a priority of 99 (nearly pushing other processes out of the nest). Note that once you do this, everything else slows down, including ssh sessions into the host.

Next, we're going to pin the process to run only on CPU3.

    [root@nebula sysadmin]# taskset -p 08 17648
    pid 17648's current affinity mask: f
    pid 17648's new affinity mask: 8
    [root@nebula sysadmin]# taskset -pc 17648
    pid 17648's current affinity list: 3


There is a little black magic, here. The command uses a mask as the -p parameter, which allows you to set the affinity to not only a single CPU, but any combination of all of the CPUs. This is handy, but requires that you understand how to translate the mask value (f) to one or more CPUs, and vice versa. This is pretty well documented in the taskset documentation.

Test Design


OK, so we can see how to inspect the queue, CPU utilization, and context switching, and we know how to change a few tunables that might impact those values. Let's put them all together.

I set up a test that consisted of 4 runs (tying the scheduler/priority to a single variable; you can split them out, but that then would reqiure 8 runs). For each run, I started by invoking the udptg.py application on the client (we executed udpconsumer.py earlier to get the pid, and it's still running on the server). After waiting a few seconds, I then kicked off the netstat command in the while loop on the server. Let that run for 7 minutes, then kill the while/netstat command, followed by killing the udptg.py command. After a minute or two of rest, I issue the sar commands, and capture the output. The reason we wait a few minutes is to allow periods of low activity to sit in between the test runs. This makes it easier to differentiate the periods of testing from each other.

Next we issue a taskset or chtr command (but only one at a time; if you found this page, you already understand the perils of measuring the impact of changing two variables at the same time).

Start the next test the same as the first, rinse and repeat through all four combinations.

Here's the result from testing in my lab:

AFFINITY       ALL        CPU 3      CPU 3      ALL
PRIO/SCHED     0/OTHER    0/OTHER    99/FIFO    99/FIFO

-------------------------------------------------------
AVG Q SZ        204448     207177      55571      51660
STD_DEV          24866      17924      90645      88253
CPU Util% Avg     42.0       96.2       49.3       14.7
CSWITCH/s     234035.3   242664.8    16752.4    17020.4






This shows the impact of CPU utilization across all CPUs during each of the 4 tests.

Here we see the impact of CPU utilization of CPU 3. We've removed much of the impact on CPU 0, at the expense of CPU 3.

In our context switches, we see that the impact of increasing the priority for our application really stopped much of the swapping of processes in and out of the CPUs. This may be bad for our server, but is good for the application.

TODO


There are certainly other variables you can change. If you're dropping packets, you can increase the buffer size, but all this really does is prolong the inevitable. There are other things to test as well, but this is really just a first pass. You might also consider pinning all administrative OS activity to a single CPU. This will get it off the pinned CPU (you didn't pin your application to CPU 0, did you?) In addition, if you're running other production applications, you might consider pinning them to CPUs other than where you're running your test in a second round, and observing the receive queue statistics.

I have intentionally avoided providing a conclusion. There are still some outstanding items (in addition to those above) that are not fully understood, and as such, it's difficult to infer a course of action. Most notably, there are a number of rows in the output of the netstat command with the value '213312' (many more than one would expect to occur organically). My initial hypothesis is that this is the maximum size of the buffer. I still need to understand why this value appears so frequently. In addition, there are roughly 44M UDP packet receive errors as reported in 'netstat -s', but no drops reported in the output of ifconfig. Finally, I am not comfortable with the excessive standard deviation listed for the last two columns. It's surely related to the appearance of '213312' in the data, but needs to be confirmed. All that said, We've at least made a start in understanding how to reduce the buildup of data in the UDP buffer.

UPDATE 1


I ran the third configuration (pin the application to CPU3, PRIO=99, scheduler=FIFO) again, but this time, islolated CPUs 2 and 3 from all other processes. This was accomplished by adding

     isolcpus=2,3
 
to the end of the GRUB_CMDLINE_LINUX line in /etc/default/grub, reinstalling grub (grub2, actually) with:

     # grub2-mkconfig -o /boot/grub2/grub.cfg

and finally, rebooting. A Quick look at the output of 'sar -P ALL' confirmed that there was no activity on CPUs 2 and 3. Here is the comparison of the two sets, changing only the CPU isolation:

ISOLCPUS       N/A        2,3
AFFINITY       CPU 3      CPU 3
PRIO/SCHED     99/FIFO    99/FIFO

---------------------------------
AVG Q SZ         55571       6754
STD_DEV          90645       8761
CPU Util% Avg     49.3       44.0
CSWITCH/s      16752.4    15507.6


The receive Q size is just over 10% of the previous value, the standard deviation is less than 10% what it was, and we trimmed another ~5% off CPU utilization for the core. This is a huge improvement, but by no means, deterministic, yet. In addition, the '213312' value is still showing up in the output of the netstat command, and our context switches show only minimal improvement.

I next tried to test the hypothesis that  213312 is the UDP receive buffer size. By adding the two following statements before the while loop, we can get and print the buffer size:

udprcvbufsz = sock.getsockopt(socket.SOL_SOCKET,socket.SO_RCVBUF)
print 'UDP Recv Buffer Size: ' + str(
udprcvbufsz) + ' bytes'

This returns the value 212992, which is exactly 320 bytes smaller than our mystery value. I read that TCP reserves some of the TCP buffer space for admin purposes, and I think this may be the case (but still need to research this).

I also captured the pre- and post-netstat statistics (netstat -s), and found the following:

Metric                     Actual  %_of_Tot
-------------------------------------------
TOTAL UDP Pkts Received   2995612    100.00
Pkt Recv Errors             94435      3.15
Recv Buffer Errors              0      0.00