\chapter{\label{results}Results} %** Results.tex: What were the results achieved including an evaluation % This section describes the achieved results and compares the P4 based implementation with real world software solutions. We distinguish the software implementation of P4 (BMV2) and the hardware implementation (NetFPGA) due to significant differences in deployment and development. We present benchmarks for the existing software solutions as well as for our hardware implementation. As the objective of this thesis was to demonstrate the high speed capabilities of NAT64 in hardware, no benchmarks were performed on the P4 software implementation. % ok % ---------------------------------------------------------------------- \section{\label{results:p4}P4 based implementations} We successfully implemented P4 code to realise NAT64~\cite{schottelius:thesisrepo}. It contains parsers for all related protocols (ipv6, ipv4, udp, tcp, icmp, icmp6, ndp, arp), supports EAMT as defined by RFC7757 ~\cite{rfc7757} and is feature equivalent to the two compared software solutions tayga~\cite{lutchansky:_tayga_simpl_nat64_linux} and jool~\cite{mexico:_jool_open_sourc_siit_nat64_linux}. Due to limitations in the P4 environment of the NetFPGA~\cite{conclusion:netfpga} environment, the BMV2 implementation is more feature rich. Table \ref{tab:benchmark} summarises the achieved bandwidths of the NAT64 solutions. BEFORE OR AFTER MARKER - FIXME All planned features could be realised with P4 and a controller. For this thesis the parsing capabilities of P4 were adequate. However P4, at the time of writing, cannot parse ICMP6 options in general, as the upper level protocol does not specify the number of options that follow and parsing of an undefined number of 64 bit blocks is required, which P4 does not support. The language has some limitations on where the placement of conditional statements (\texttt{if/switch}).\footnote{In general, if and switch statements in actions lead to errors, but not all constellations are forbidden.} Furthermore P4/BMV2 does not support for multiple LPM keys in a table, however it supports multiple keys with ternary matching, which is a superset of LPM matching. When developing P4 programs, the reason for incorrect behaviour we have seen were checksum problems. This is in retrospective expected, as the main task our implementation does is modify headers on which the checksums depend. In all cases we have seen Ethernet frame checksum errors, the effective length of the packet was incorrect. The tooling around P4 is somewhat fragile. We encountered small language bugs during the development~\cite{schottelius:github1675}, \ref{appendix:expressionbug} or found missing features~\cite{schottelius:github745}, ~\cite{theojepsen:_get}: it is at the moment impossible to retrieve the matching key from table or the name of the action called. Thus if different table entries call the same action, it is impossible within the action, or if forwarded to the controller, within the controller to distinguish on which match the action was triggered. This problem is very consistent within P4, as not even the matching table name can be retrieved. While these information can be added manually as additional fields in the table entries, we would expect a language to support reading and forwarding this kind of meta information. While in P4 the P4 code and the related controller are tightly coupled, their data definitions are not. Thus the packet format definition that is used between the P4 switch and the controller has to be duplicated. Our experiences in software development indicate that this duplication is a likely source of errors in bigger software projects. The supporting scripts in the P4 toolchain are usually written in python2. However python2 ``is legacy''~\cite{various:_shoul_i_python_python}. During development errors with unicode string handling in python2 caused changes to IPv6 addresses.\footnote{Compare section ~\ref{appendix:p4:python2unicode}.} % ok % ---------------------------------------------------------------------- \section{\label{results:bmv2}P4/BMV2} The software implementation of P4 has most features, which is mostly due to the capability of creating checksums over the payload. It enables the switch to act as a ``proper'' participant in NDP, as this requires the host to calculate checksums over the payload. Table~\ref{tab:p4bmv2features} references all implemented features. \begin{table}[htbp] \begin{center}\begin{minipage}{\textwidth} \begin{tabular}{| c | c | c |} \hline \textbf{Feature} & \textbf{Description} & \textbf{Status} \\ \hline Switch to controller & Switch forwards unhandled packets to controller & fully implemented\footnote{Source code: \texttt{actions\_egress.p4}}\\ \hline Controller to Switch & Controller can setup table entries & fully implemented\footnote{Source code: \texttt{controller.py}}\\ \hline NDP & Switch responds to ICMP6 neighbor & \\ & solicitation request (without controller) & fully implemented\footnote{Source code: \texttt{actions\_icmp6\_ndp\_icmp.p4}} \\ \hline ARP & Switch can answer ARP request (without controller) & fully implemented\footnote{Source code: \texttt{actions\_arp.p4}}\\ \hline ICMP6 & Switch responds to ICMP6 echo request (without controller) & fully implemented\footnote{Source code: \texttt{actions\_icmp6\_ndp\_icmp.p4}} \\ \hline ICMP & Switch responds to ICMP echo request (without controller) & fully implemented\footnote{Source code: \texttt{actions\_icmp6\_ndp\_icmp.p4}} \\ \hline NAT64: TCP & Switch translates TCP with checksumming & \\ & from/to IPv6 to/from IPv4 & fully implemented\footnote{Source code: \texttt{actions\_nat64\_generic\_icmp.p4}} \\ \hline NAT64: UDP & Switch translates UDP with checksumming & \\ & from/to IPv6 to/from IPv4 & fully implemented\footnote{Source code: \texttt{actions\_nat64\_generic\_icmp.p4}} \\ \hline NAT64: & Switch translates echo request/reply & \\ ICMP/ICMP6 & from/to ICMP6 to/from ICMP with checksumming & fully implemented\footnote{Source code: \texttt{actions\_nat64\_generic\_icmp.p4}} \\ \hline NAT64: Sessions & Switch and controller create 1:n sessions/mappings & fully implemented\footnote{Source code: \texttt{actions\_nat64\_session.p4}, \texttt{controller.py}} \\ \hline Delta Checksum & Switch can calculate checksum without payload inspection & fully implemented\footnote{Source code: \texttt{actions\_delta\_checksum.p4}}\\ \hline Payload Checksum & Switch can calculate checksum with payload inspection & fully implemented\footnote{Source code: \texttt{checksum\_bmv2.p4}}\\ \hline \end{tabular} \end{minipage} \caption{P4/BMV2 feature list} \label{tab:p4bmv2features} \end{center} \end{table} The switch responds to ICMP echo requests, ICMP6 echo requests, answers NDP and ARP requests. Overall P4/BMV is very easy to use even without a controller a fully functional network host can be implemented. This P4/BMV implementation supports translating ICMP/ICMP6 echo request and echo reply messages, but does not support all ICMP/ICMP6 translations that are defined in RFC6145~\cite{rfc6145}. % ---------------------------------------------------------------------- \section{\label{results:netpfga}P4/NetFPGA} In the following section we describe the achieved feature set of P4/NetFPGA in detail and analyse differences to the BMV2 based implementation. % ok % ---------------------------------------------------------------------- \subsection{\label{results:netpfga:features}Features} While the NetFPGA target supports P4, compared to P4/BMV2 we only implemented a reduced features set on P4/NetPFGA. The first reason for this is missing support of the NetFPGA P4 compiler to inspect payload and to compute checksums over payload. While this can (partially) be compensated using delta checksums, the compile time of 2 to 6 hours contributed to a significant slower development cycle compared to BMV2. Lastly, the focus of this thesis was to implement high speed NAT64 on P4, which only requires a subset of the features that we realised on BMV2. Table \ref{tab:p4netpfgafeatures} summarises the implemented features and reasons about their implementation status. \begin{table}[htbp] \begin{center}\begin{minipage}{\textwidth} \begin{tabular}{| c | c | c |} \hline \textbf{Feature} & \textbf{Description} & \textbf{Status} \\ \hline Switch to controller & Switch forwards unhandled packets to controller & portable\footnote{While the NetFPGA P4 implementation does not have the clone3() extern that the BMV2 implementation offers, communication to the controller can easily be realised by using one of the additional ports of the NetFPGA and connect a physical network card to it.}\\ \hline Controller to Switch & Controller can setup table entries & portable\footnote{The p4utils suite offers an easy access to the switch tables. While the P4-NetFPGA support repository also offers python scripts to modify the switch tables, the code is less sophisticated and more fragile.}\\ \hline NDP & Switch responds to ICMP6 neighbor & \\ & solicitation request (without controller) & portable\footnote{NetFPGA/P4 does not offer calculating the checksum over the payload. However delta checksumming can be used to create the required checksum for replying.} \\ \hline ARP & Switch can answer ARP request (without controller) & portable\footnote{As ARP does not use checksums, integrating the source code \texttt{actions\_arp.p4} into the netpfga code base is enough to enable ARP support in the NetPFGA.} \\ \hline ICMP6 & Switch responds to ICMP6 echo request (without controller) & portable\footnote{Same reasoning as NDP.} \\ \hline ICMP & Switch responds to ICMP echo request (without controller) & portable\footnote{Same reasoning as NDP.} \\ \hline NAT64: TCP & Switch translates TCP with checksumming & \\ & from/to IPv6 to/from IPv4 & fully implemented\footnote{Source code: \texttt{actions\_nat64\_generic\_icmp.p4}} \\ \hline NAT64: UDP & Switch translates UDP with checksumming & \\ & from/to IPv6 to/from IPv4 & fully implemented\footnote{Source code: \texttt{actions\_nat64\_generic\_icmp.p4}} \\ \hline NAT64: & Switch translates echo request/reply & \\ ICMP/ICMP6 & from/to ICMP6 to/from ICMP with checksumming & portable\footnote{ICMP/ICMP6 translations only require enabling the icmp/icmp6 code in the netpfga code base.} \\ \hline NAT64: Sessions & Switch and controller create 1:n sessions/mappings & portable\footnote{Same reasoning as ``Controller to switch''.} \\ \hline Delta Checksum & Switch can calculate checksum without payload inspection & fully implemented\footnote{Source code: \texttt{actions\_delta\_checksum.p4}}\\ \hline Payload Checksum & Switch can calculate checksum with payload inspection & unsupported\footnote{To support creating payload checksums, either an HDL module needs to be created or to modify the generated the PX program.~\cite{schottelius:_exter_p4_netpf}} \\ \hline \end{tabular} \end{minipage} \caption{P4/NetFPGA feature list} \label{tab:p4netpfgafeatures} \end{center} \end{table} % ok % ---------------------------------------------------------------------- \subsection{\label{results:netpfga:stability}Stability} Two different NetPFGA cards were used during the development of the thesis. The first card had consistent ioctl errors (compare section \ref{netpfgaioctlerror}) when writing table entries. The available hardware tests (compare figures \ref{fig:hwtestnico} and \ref{fig:hwtesthendrik}) showed failures in both cards, however the first card reported an additional ``10G\_Loopback'' failure. Due to the inability of setting table entries, no benchmarking was performed on the first NetFPGA card. \begin{figure}[h] \includegraphics[scale=1.4]{hwtestnico} \centering \caption{Hardware Test NetPFGA card 1} \label{fig:hwtestnico} \end{figure} \begin{figure}[h] \includegraphics[scale=0.2]{hwtesthendrik} \centering \caption{Hardware Test NetPFGA card 2, ~\cite{hendrik:_p4_progr_fpga_semes_thesis_sa}} \label{fig:hwtesthendrik} \end{figure} During the development and benchmarking, the second NetFPGA card stopped to function properly multiple times. In theses cases the card would not forward packets anymore. Multiple reboots (up to 3) and multiple times reflashing the bitstream to the NetFPGA usually restored the intended behaviour. However due to this ``crashes'', it was impossible for us run a benchmark for more than one hour. Similarly, sometimes flashing the bitstream to the NetFPGA would fail. It was required to reboot the host containing the NetFPGA card up to 3 times to enable successful flashing.\footnote{Typical output of the flashing process would be: ``fpga configuration failed. DONE PIN is not HIGH''} % ok % ---------------------------------------------------------------------- \subsubsection{\label{results:netpfga:performance}Performance} The NetFPGA card performed at near line speed and offers NAT64 translations at 9.28 Gbit/s (see section \ref{results:benchmark} for details). Single and multiple streams performed almost exactly identical and have been consistent through multiple iterations of the benchmarks. % ok % ---------------------------------------------------------------------- \subsection{\label{results:netpfga:usability}Usability} The handling and usability of the NetFPGA card is rather difficult. In this section we describe our findings and experiences with the card and its toolchain. To use the NetFPGA, the tools Vivado and SDNET provided by Xilinx need to be installed. However a bug in the installer triggers an infinite loop, if a certain shared library\footnote{The required shared library is libncurses5.} is missing on the target operating system. The installation program seems still to be progressing, however does never finish. While the NetFPGA card supports P4, the toolchains and supporting scripts are in a immature state. The compilation process consists of at least 9 different steps, which are interdependent\footnote{See source code \texttt{bin/do-all-steps.sh}.} Some of the steps generate shell scripts and python scripts that in turn generate JSON data.\footnote{One compilation step calls the script ``config\_writes.py''. This script failed with a syntax error, as it contained incomplete python code. The scripts config\_writes.py and config\_writes.sh are generated by gen\_config\_writes.py. The output of the script gen\_config\_writes.py depends on the content of config\_writes.txt. That file is generated by the simulation ``xsim''. The file ``SimpleSumeSwitch\_tb.sv'' contains code that is responsible for writing config\_writes.txt and uses a function named axi4\_lite\_master\_write\_request\_control for generating the output. This in turn is dependent on the output of a script named gen\_testdata.py.} However incorrect parsing generates syntactically incorrect scripts or scripts that generate incorrect output. The toolchain provided by the NetFPGA-P4 repository contains more than 80000 lines of code. The supporting scripts for setting table entries require setting the parameters for all possible actions, not only for the selected action. Supplying only the required parameters results in a crash of the supporting script. The documentation for using the NetFPGA-P4 repository is very distributed and does not contain a reference on how to use the tools. Mapping of egress ports and their metadata field are found in a python script that is used for generating test data. The compile process can take up to 6 hours and because the different steps are interdependent, errors in a previous stage were in our experiences detected hours after they happened. The resulting log files of the compilation process can be up to 5 MB in size. Within this log file various commands output references to other logfiles, however the referenced logfiles do not exist before or after the compile process. During the compile process various informational, warning and error messages are printed. However some informational messages constitute critical errors, while on the other hand critical errors and syntax errors often do not constitute a critical error.\footnote{F.i. ``CRITICAL WARNING: [BD 41-737] Cannot set the parameter TRANSLATION\_MODE on /axi\_interconnect\_0. It is read-only.'' is a non critical warning.} Also contradicting output is generated.\footnote{While using version 2018.2, the following message was printed: ``WARNING: command 'get\_user\_parameter' will be removed in the 2015.3 release, use 'get\_user\_parameters' instead''.} Programs or scripts that are called during the compile process do not necessarily exit non zero if they encountered a critical error. Thus finding the source of an error can be difficult due to the compile process continuing after critical errors occurred. Not only programs that have critical errors exit ``successfully'', but also python scripts that encounter critical paths don't abort with raise(), but print an error message to stdout and don't abort with an error. The most often encountered critical compile error is ``Run 'impl\_1' has not been launched. Unable to open''. This error indicates that something in the previous compile steps failed and can refer to incorrectly generated testdata to unsupported LPM tables. The NetFPGA kernel module provides access to virtual Linux devices (nf0...nf3). However tcpdump does not see any packets that are emitted from the switch. The only possibility to capture packets that are emitted from the switch is by connecting a physical cable to the port and capturing on the other side. Jumbo frames\footnote{Frames with an MTU greater than 1500 bytes.} are commonly used in 10 Gbit/s networks. According to \ref{wikipedia:_jumbo}, even many gigabit network interface card support jumbo frames. However according to emails on the private NetPFGA mailing list, the NetFPGA only supports 1500 byte frames at the moment and additional work is required to implement support for bigger frames. Our P4 source code required contains Xilinx annotations\footnote{F.i. ``@Xilinx\_MaxPacketRegion(1024)''} that define the maximum packet size in bits. We observed two different errors on the output packet, if the incoming packets exceeds the specified size: \begin{itemize} \item The output packet is longer then the original packet. \item The output packet is corrupted. \end{itemize} While most of the P4 language is supported on the netpfga, some key techniques are missing or not supported. \begin{itemize} \item Analysing / accessing payload is not supported \item Checksum computation over payload is not supported \item Using LPM tables can lead to compilation errors \item Depending on the match type, only certain table sizes are allowed \end{itemize} Renaming variables in the declaration of the parser or deparser lead to compilation errors. Function syntax is not supported. For this reason our implementation uses \texttt{\#define} statements instead of functions. %ok % ---------------------------------------------------------------------- \section{\label{results:softwarenat64}Software based NAT64} Both solutions Tayga and Jool worked flawlessly. However as expected, both solutions have a bottleneck that is CPU bound. Under high load scenarios both solutions utilise one core fully. Neither Tayga as a user space program nor Jool as a kernel module implement multi threading. %ok % ---------------------------------------------------------------------- \section{\label{results:benchmark}NAT64 Benchmarks} In this section we summarise the benchmarking results, in the sub sections we discuss the benchmark design and the individual results. FIXME: summary here MTU setting to 1500, as netpfga doesn't support jumbo frames iperf3, iperf 3.0.11 50 parallel = 2x 100% cpu usage 40 parallel = 100%, 70% cpu usage 30 parallel = 70%-100, 70% cpu usage Turning back on checksum offloading (see below) 30 parallel = 70%, 30% cpu usage \subsection{\label{benchmark:tayga:tcp}Tayga/TCP} Tayga running at 100% cpu load, v4->v6 tcp delivering 3.36 gbit/s at P1 3.30 Gbit/s at P20 3.11 gbit/s at P50 v6->v4 tcp P1: 3.02 Gbit/s P20: 3.28 gbit/s P50: 2.85 gbit/s Commands: UDP load generator hitting 100\% cpu at P20. TCP confirmed. Over bandwidth results Feature comparison speed - sessions - eamt can act as host lpm tables ping ping6 support ndp controller support netpfga consistent % ---------------------------------------------------------------------- \subsection{\label{results:benchmark:design}Benchmark Design} \begin{figure}[h] \includegraphics[scale=0.5]{softwarenat64design} \centering \caption{Benchmark design for NAT64 in software implementations} \label{fig:softwarenat64design} \end{figure} We use two hosts for performing benchmarks: a load generator and a NAT64 translator. Both hosts are equipped with a dual port Intel X520 10 Gbit/s network card. Both hosts are connected using DAC without any equipment in between. TCP offloading is enabled in the X520 cards. Figure \ref{fig:softwarenat64design} shows the network setup. When testing the NetPFGA/P4 performance, the X520 cards in the NAT64 translator are disconnected and instead the NetPFGA ports are connected, as show in figure \ref{fig:netpfgadesign}. The load generator is equipped with a quad core CPU (Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz), enabled with hyperthreading and 16 GB RAM. The NAT64 translator is also equipped with a quard core CPU (Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz) and 16 GB RAM. The first 10 seconds of the benchmark are excluded to avoid the TCP warm up phase.\footnote{iperf -O 10 parameter, see section \ref{design:tests}.} \begin{figure}[h] \includegraphics[scale=0.5]{netpfgadesign} \centering \caption{NAT64 with NetFPGA benchmark} \label{fig:netpfgadesign} \end{figure} % ok % ---------------------------------------------------------------------- \newpage \subsection{\label{results:benchmark:v6v4tcp}IPv6 to IPv4 TCP Benchmark Results} some text \begin{table}[htbp] \begin{center}\begin{minipage}{\textwidth} \begin{tabular}{| c | c | c | c | c |} \hline Implementation & \multicolumn{4}{|c|}{min/avg/max in Gbit/s} \\ \hline Tayga & 2.79 / 3.20 / 3.43 & 3.34 / 3.36 / 3.38 & 2.57 / 3.02 / 3.27 & 2.35 / 2.91 / 3.20 \\ \hline Jool & 8.22 / 8.22 / 8.22 & 8.21 / 8.21 / 8.22 & 8.21 / 8.23 / 8.25 & 8.21 / 8.23 / 8.25\\ \hline P4 / NetPFGA & 9.28 / 9.28 / 9.29 & 9.28 / 9.28 / 9.29 & 9.28 / 9.28 / 9.29 & 9.28 / 9.28 / 9.29\\ \hline Parallel connections & 1 & 10 & 20 & 50 \\ \hline \end{tabular} \end{minipage} \caption{IPv6 to IPv4 TCP NAT64 Benchmark} \label{tab:benchmarkv6} \end{center} \end{table} % --------------------------------------------------------------------- \subsection{\label{results:benchmark:v4v6tcp}IPv4 to IPv6 TCP Benchmark Results} During the benchmarks the client -- CPU usage \begin{table}[htbp] \begin{center}\begin{minipage}{\textwidth} \begin{tabular}{| c | c | c | c | c |} \hline Implementation & \multicolumn{4}{|c|}{min/avg/max in Gbit/s} \\ \hline Tayga & 2.90 / 3.15 / 3.34 & 2.87 / 3.01 / 3.22 & 2.68 / 2.85 / 3.09 & 2.60 / 2.78 / 2.88 \\ \hline Jool & 7.18 / 7.56 / 8.24 & 7.97 / 8.05 / 8.09 & 8.05 / 8.08 / 8.10 & 8.10 / 8.12 / 8.13 \\ \hline P4 / NetPFGA & 8.51 / 8.53 / 8.55 & 9.28 / 9.28 / 9.29 & 9.29 / 9.29 / 9.29 & 9.28 / 9.28 / 9.29 \\ \hline Parallel connections & 1 & 10 & 20 & 50 \\ \hline \end{tabular} \end{minipage} \caption{IPv4 to IPv6 TCP NAT64 Benchmark} \label{tab:benchmarkv4} \end{center} \end{table} % --------------------------------------------------------------------- \newpage \subsection{\label{results:benchmark:v6v4udp}IPv6 to IPv4 UDP Benchmark Results} other text \begin{table}[htbp] \begin{center}\begin{minipage}{\textwidth} \begin{tabular}{| c | c | c | c | c |} \hline Implementation & \multicolumn{4}{|c|}{avg bandwidth in gbit/s / avg loss / adjusted bandwith} \\ \hline Tayga & 8.02 / 70\% / 2.43 & 9.39 / 79\% / 1.97 & 15.43 / 86\% / 2.11 & 19.27 / 91\% 1.73 \\ \hline Jool & 6.44 / 0\% / 6.41 & 6.37 / 2\% / 6.25 & 16.13 / 64\% / 5.75 & 20.83 / 71\% / 6.04 \\ \hline P4 / NetPFGA & 8.28 / 0\% / 8.28 & 9.26 / 0\% / 9.26 & 16.15 / 0\% / 16.15 & 15.8 / 0\% / 15.8 \\ \hline Parallel connections & 1 & 10 & 20 & 50 \\ \hline \end{tabular} \end{minipage} \caption{IPv6 to IPv4 UDP NAT64 Benchmark} \label{tab:benchmarkv4} \end{center} \end{table} % --------------------------------------------------------------------- \subsection{\label{results:benchmark:v4v6udp}IPv4 to IPv6 UDP Benchmark Results} last text \begin{table}[htbp] \begin{center}\begin{minipage}{\textwidth} \begin{tabular}{| c | c | c | c | c |} \hline Implementation & \multicolumn{4}{|c|}{avg bandwidth in gbit/s / avg loss / adjusted bandwith} \\ \hline Tayga & 6.78 / 84\% / 1.06 & 9.58 / 90\% / 0.96 & 15.67 / 91\% / 1.41 & 20.77 / 95\% / 1.04 \\ \hline Jool & 4.53 / 0\% / 4.53 & 4.49 / 0\% / 4.49 & 13.26 / 0\% / 13.26 & 22.57 / 0\% / 22.57\\ \hline P4 / NetPFGA & 7.04 / 0\% / 7.04 & 9.58 / 0\% / 9.58 & 9.78 / 0\% / 9.78 & 14.37 / 0\% / 14.37\\ \hline Parallel connections & 1 & 10 & 20 & 50 \\ \hline \end{tabular} \end{minipage} \caption{IPv4 to IPv6 UDP NAT64 Benchmark} \label{tab:benchmarkv4} \end{center} \end{table}