Transmission Speed Tests

1. Initial speed test

Sending packages has several limitations:

Bandwidth of the network
The speed in which the sender can send packages out
The speed in which the receiver can process packages

We assume that the receiver will inevitably drop packages if he will be flooded with packages, because they are copied into the buffers, and if they are full (because the receiver is too slow to process all packages at this rate), additional packages will be dropped.

We also assume that AWS imposes a flow limit on packages from (or to) the same port (UDP port? NIC port?) We have to look more into this.

For this initial speed test, I wanted to measure what the upper limit of packages is that could be sent. For this experiment, I gradually increased the number of packages sent per second. This was achieved by reducing the usleep time in between package sending.

The duration of a run is determined by measuring the absolute time once at the beginning and once at the end of the run. Since accuracy of measurements often decreases, when the measured unit gets smaller, I wanted to keep the duration of the run in the milliseconds range. To achieve this, the sleep time in between sent packages is decreased by half for every run, while the number of sent packages is increased by 1.3 per run.

I transmitted a single uint64_t as payload. The first four byte contain the run number, the second four byte the number of the package. This should keep the work for the receiver as low as possible, as the receiver only has to check if the package

For an initial speed test, I measured the number of packages that could be transmitted, by sending an integer value as payload. The first 4 byte car

1. Results

In this experiment, not a single package was dropped. Every package was received by RX.

1. Discussion

Currently, the bottleneck is not the receiver RX.

For future experiments, we have to look into the following things:

Increase Ethernet package size. Currently: 862 bytes. According to Wikipedia, the MTU for Ethernet frames is 1500 bytes.
Increase bandwidth for the server. This experiment was performed on a c5.large, with a net_peak_gbitps of 10, according to cloudspecs. Since we want a server that doesn't use boost network speed, we have to chose a server where net_peak_gbitps = net_gbitps, but r8gn.48xlarge for example offers this, with net_peak_gbitps = 600, but for a substantial price of course.

10 Gbit/s uses SI units and are based on base 10, instead of base 2. So 10 Gbit = 10⁹ Bit = 1,000,000,000 Bit.
With 16880 sent packages per second, each of which containing 862 bytes, we only amount to 116,273,456 bit per second, which is about a tenth of the available bandwidth for c5.large.

Use different ports.

What happens if we increase UDP port number with every sent package?
What happens if we use multiple NICs?

Can we omit the UDP header entirely?

2. Changing UDP port

In this experiment, the UDP port is changed for every package. TX::sentPacket now contains udpHeader.src_port = rte_cpu_to_be_16(static_cast<uint16_t>(content)); udpHeader.dst_port = rte_cpu_to_be_16(static_cast<uint16_t>(content));.

Infuriatingly, changing the UDP port with every package leads to quite some dropped packages, and less packages sent on their way by TX.

Fewer packages sent by TX are of course explainable by the (slightly) increased work of changing the UDP header every time. It is probable that the bottleneck is not the network after all, but the CPU speed in which the packages are added to the NIC buffer. Maybe the CPU just can not create more packages in this time.

Further Investigation

Use TX with no usleep at all.
Find a way to send packages faster, without resorting to multithreading. Is it possible to just pass the pointer to the same package over and over to the NIC, so no actual copying happens?
In the end, I have to find out how to use multithreading to add packages to the NIC buffer to be sent.

Currently, TX always only adds a single package. As the ixy paper explains in chapter 5.2 Batching, this is not optimal.

According to the DPDK documentation about rte_mbuf_raw_free:

"Put mbuf back into its original mempool. The caller must ensure that the mbuf is direct and properly reinitialized (refcnt=1, next=NULL, nb_segs=1), as done by rte_pktmbuf_prefree_seg(). This function should be used with care, when optimization is required. For standard needs, prefer rte_pktmbuf_free() or rte_pktmbuf_free_seg()."

So rte_mbuf_raw_free is a function that should be used for optimizations, but can only safely be used if the mbuf has only a reference count of 1 (noone else uses it), and is not chained with other mbufs (due to next=NULL and nb_segs=1).

According to stackoverflow, rte_eth_tx_burst ???, so maybe calling rte_mbuf_raw_free is not correct.

So in order to increase the number of packages sent, I could just single-threadedly add more packets in a single burst.

But I can also use multithreading easily, because the DPDK documentation claims:

"Multiple threads can invoke rte_eth_tx_burst() concurrently on the same Tx queue without SW lock."

Results by the investigation

The usleep slows down execution a lot.
The number of total packages printed on RX is wrong.
We need to send the packages as batches.