Transmission Speed Tests

1. Initial speed test

Sending packages has several limitations:

We assume that the receiver will inevitably drop packages if he will be flooded with packages, because they are copied into the buffers, and if they are full (because the receiver is too slow to process all packages at this rate), additional packages will be dropped.

We also assume that AWS imposes a flow limit on packages from (or to) the same port (UDP port? NIC port?) We have to look more into this.

For this initial speed test, I wanted to measure what the upper limit of packages is that could be sent. For this experiment, I gradually increased the number of packages sent per second. This was achieved by reducing the usleep time in between package sending.

The duration of a run is determined by measuring the absolute time once at the beginning and once at the end of the run. Since accuracy of measurements often decreases, when the measured unit gets smaller, I wanted to keep the duration of the run in the milliseconds range. To achieve this, the sleep time in between sent packages is decreased by half for every run, while the number of sent packages is increased by 1.3 per run.

I transmitted a single uint64_t as payload. The first four byte contain the run number, the second four byte the number of the package. This should keep the work for the receiver as low as possible, as the receiver only has to check if the package

For an initial speed test, I measured the number of packages that could be transmitted, by sending an integer value as payload. The first 4 byte car

1. Results

In this experiment, not a single package was dropped. Every package was received by RX.

1. Discussion

Currently, the bottleneck is not the receiver RX.

For future experiments, we have to look into the following things:

2. Changing UDP port

In this experiment, the UDP port is changed for every package. TX::sentPacket now contains udpHeader.src_port = rte_cpu_to_be_16(static_cast<uint16_t>(content)); udpHeader.dst_port = rte_cpu_to_be_16(static_cast<uint16_t>(content)); .

Infuriatingly, changing the UDP port with every package leads to quite some dropped packages, and less packages sent on their way by TX.

Fewer packages sent by TX are of course explainable by the (slightly) increased work of changing the UDP header every time. It is probable that the bottleneck is not the network after all, but the CPU speed in which the packages are added to the NIC buffer. Maybe the CPU just can not create more packages in this time.

Further Investigation

Currently, TX always only adds a single package. As the ixy paper explains in chapter 5.2 Batching, this is not optimal.

According to the DPDK documentation about rte_mbuf_raw_free:

"Put mbuf back into its original mempool. The caller must ensure that the mbuf is direct and properly reinitialized (refcnt=1, next=NULL, nb_segs=1), as done by rte_pktmbuf_prefree_seg(). This function should be used with care, when optimization is required. For standard needs, prefer rte_pktmbuf_free() or rte_pktmbuf_free_seg()."

So rte_mbuf_raw_free is a function that should be used for optimizations, but can only safely be used if the mbuf has only a reference count of 1 (noone else uses it), and is not chained with other mbufs (due to next=NULL and nb_segs=1).

According to stackoverflow, rte_eth_tx_burst ???, so maybe calling rte_mbuf_raw_free is not correct.

So in order to increase the number of packages sent, I could just single-threadedly add more packets in a single burst.

But I can also use multithreading easily, because the DPDK documentation claims:

"Multiple threads can invoke rte_eth_tx_burst() concurrently on the same Tx queue without SW lock."