Sending packages has several limitations:
We assume that the receiver will inevitably drop packages if he will be flooded with packages, because they are copied into the buffers, and if they are full (because the receiver is too slow to process all packages at this rate), additional packages will be dropped.
We also assume that AWS imposes a flow limit on packages from (or to) the same port (UDP port? NIC port?) We have to look more into this.
For this initial speed test, I wanted to measure what the upper limit of packages is that could be sent. For this experiment, I
gradually increased the number of packages sent per second. This was achieved by reducing the usleep time in between
package sending.
The duration of a run is determined by measuring the absolute time once at the beginning and once at the end of the run. Since accuracy of measurements often decreases, when the measured unit gets smaller, I wanted to keep the duration of the run in the milliseconds range. To achieve this, the sleep time in between sent packages is decreased by half for every run, while the number of sent packages is increased by 1.3 per run.
I transmitted a single uint64_t as payload. The first four byte contain the run number, the second four byte the number
of the package. This should keep the work for the receiver as low as possible, as the receiver only has to check if the package
For an initial speed test, I measured the number of packages that could be transmitted, by sending an integer value as payload. The first 4 byte car
In this experiment, not a single package was dropped. Every package was received by RX.
Currently, the bottleneck is not the receiver RX.
For future experiments, we have to look into the following things:
c5.large, with a net_peak_gbitps of
10, according to cloudspecs. Since we want a server that doesn't
use boost network speed, we have to chose a server where net_peak_gbitps = net_gbitps, but
r8gn.48xlarge for example offers this, with net_peak_gbitps = 600, but for a substantial price of course.
10 Gbit/s uses SI units and are based on base 10, instead of base 2. So
10 Gbit = 109 Bit = 1,000,000,000 Bit.
16880 sent packages per second, each of which containing 862 bytes, we only amount to
116,273,456 bit per second, which is about a tenth of the available bandwidth for c5.large.
In this experiment, the UDP port is changed for every package. TX::sentPacket now contains
udpHeader.src_port = rte_cpu_to_be_16(static_cast<uint16_t>(content)); udpHeader.dst_port =
rte_cpu_to_be_16(static_cast<uint16_t>(content)); .
Infuriatingly, changing the UDP port with every package leads to quite some dropped packages, and less packages sent on their way by TX.
Fewer packages sent by TX are of course explainable by the (slightly) increased work of changing the UDP header every time. It is probable that the bottleneck is not the network after all, but the CPU speed in which the packages are added to the NIC buffer. Maybe the CPU just can not create more packages in this time.
usleep at all.
Currently, TX always only adds a single package. As the
ixy paper
explains in chapter 5.2 Batching, this is not optimal.
According to the DPDK documentation about
rte_mbuf_raw_free:
"Put mbuf back into its original mempool. The caller must ensure that the mbuf is direct and properly reinitialized (refcnt=1, next=NULL, nb_segs=1), as done by rte_pktmbuf_prefree_seg(). This function should be used with care, when optimization is required. For standard needs, prefer rte_pktmbuf_free() or rte_pktmbuf_free_seg()."
So rte_mbuf_raw_free is a function that should be used for optimizations, but can only safely be used if the
mbuf has only a reference count of 1 (noone else uses it), and is not chained with other mbufs (due
to next=NULL and nb_segs=1).
According to
stackoverflow, rte_eth_tx_burst ???, so maybe calling rte_mbuf_raw_free is not correct.
So in order to increase the number of packages sent, I could just single-threadedly add more packets in a single burst.
But I can also use multithreading easily, because the DPDK documentation claims:
"Multiple threads can invoke rte_eth_tx_burst() concurrently on the same Tx queue without SW lock."