Ethernet is now running at 10 Gb rates, making it a viable option for high-speed interprocessor and I/O communications. To help ensure a successful system implementation, designers should consider an FPGA-based protocol offload architecture.
10 Gigabit Ethernet (10 GbE) technology, which is still relatively new in the server space, is now emerging in the real-time embedded space and overlapping the emergence of other proprietary and standard system interconnect fabrics such as RapidIO and PCI Express. What makes 10 GbE so exciting is that for the first time, the world’s most widely understood, deployed, and accepted packet-based communications protocol can address the requirements of the high-performance real-time space.
10 GbE has different appeal for different users. The designer simply seeking a really fast subsystem interconnect now gets a leading edge 10 Gigabit fibre or copper fat pipe. The program manager struggling to find developers familiar with high-speed protocols – while surrounded with those possessing Ethernet experience – suddenly finds that the latter group has the sought-after skills. The system architect who has for years employed Ethernet for the command, control, and status plane while relying on a variety of other technologies for the high-speed data plane can now ease future software migration by basing both on a common technology.
Unfortunately, the act of processing the communication protocol stack at 10 Gb rates heavily taxes modern processors, leaving them very few, if any, cycles to perform signal processing. I/O modules, such as ADCs, could also benefit from using 10 GbE as a communications fabric; however, these modules typically lack processors that could perform the stack processing – for any speed of Ethernet. Employing a discrete processor as an intermediary in the data path to facilitate 10 GbE communication – or even a processor dedicated solely to stack processing – is not a practical solution.
(visit our sponsor below)
So, when it comes to successfully implementing 10 GbE as a fat pipe in high-performance, real-time embedded systems, there are several key considerations:
One, UDP is often the appropriate choice of protocol.
Two, protocol acceleration/offload is typically required.
Three, when being employed to transfer data to or from an I/O module, an architecture that permits direct data streaming between the I/O module and 10 GbE is required.
Four, although ASICs, network processors, and FPGAs are all candidates for the offload technology, there are some key system-level benefits that make the FPGA particularly appealing in this space.
TCP and UDP protocol
Most people use Ethernet every day, countless times per day. Many of us in the embedded space have used it as part of the systems we design and operate, sometimes for high-speed links but more often for control. Still, many in this space have little knowledge of what Ethernet actually looks like. Following is a very brief and simplified introduction, which assumes that IPv4 is utilized.
Figure 1 is a simplified representation of the composition of an Ethernet frame using TCP/IP (Transport Control Protocol/Internet Protocol) or UDP/IP (Universal Datagram Protocol/Internet Protocol). The bottom layer of the diagram shows the frame format that is sent over the Ethernet’s physical medium. Above that, the diagram shows the elements of the protocol that constitute each layer’s payload. The application layer payload is the actual data of interest, and the rest is overhead.
Figure 1
At the data link layer is the Ethernet frame, consisting of a 14-byte header, a 4-byte footer, and between 46 and 1,500 bytes of data payload, where 1,500 bytes is known as the (standard) Maximum Transmission Unit (MTU) size. When communicating at this layer only, with no upper-layer protocols, the entire payload can be stuffed with data. As higher protocol layers are used, chunks of the original (maximum) 1,500 bytes of payload are consumed as upper-layer headers. The first such header is the IP header, which consumes at least 20 bytes. Above that, the TCP header consumes at least another 20 bytes, while the alternative UDP header consumes 8 bytes. Thus, when using TCP, the maximum available data payload size is about 1,460 bytes. UDP use gains only about 12 extra bytes of payload (about 1,472 bytes). It follows that in terms of the ratio of payload to total packet size, using TCP/IP or UDP/IP (and assuming minimal header sizes) gives theoretical efficiencies in excess of 94 percent at the maximum payload size.
The discussion to this point indicates that from the perspective of added headers, TCP and UDP have about the same protocol efficiency. However, the significant differences in the protocols lie not in the size of their headers but in their behaviors. TCP is a reliable, connection-oriented protocol that ensures a link between the two points that are communicating. The protocol guarantees delivery of packets, in the order they were sent, and it will request resends if packets are not received. Overall, it is a protocol well-suited to control and initialization, and has relatively complex associated state machines and transactions.
Unlike TCP/IP, UDP/IP is a “best effort,” connectionless protocol and by design does not guarantee the receipt of packets nor that they will be received in the order they were sent. So, using UDP/IP, there is generally no way for the receiver to know that a packet is received out of order or is missing altogether. Packets just show up at the receiver (or not), and they are processed in the order they arrive. (Note that when using a point-to-point link, ordering is not a concern.) High-performance, real-time embedded applications typically lend themselves to a best effort communications channel. In these applications, such as an ADC continuously sampling data, there is no opportunity for a second chance. The data is just sent and errors must be coped with at a higher application layer (for example, signal processing, filtering, defining a sequence number within the payload, and so on). For such applications, UDP is a good match.
UDP, unlike TCP, also supports a multicast functionality, useful in applications such as beamforming or direction–finding where one or many data channels need to be aggregated at one or many processing destinations. Multicast support is in addition to the inherent addressing features supported by both UDP and TCP. These addressing features can be useful in applications requiring the transport of multiple sources or channels of data (for instance, multiple ADC channels or down converter channels) to different or multiple destinations over a common data pipe. The protocols’ addressing scheme provides a means of logically identifying and isolating the individual data channels. Users are also free to implement a custom scheme within the data payload.
(visit our sponsor below)
Protocol offload
Empirical studies have shown that when using the native operating system protocol stacks to run 10 GbE on a network adapter card, modern server class CPUs can achieve unidirectional data transfer rates of 3 to 5 Gbps, at 100 percent utilization. The CPU is completely occupied traversing through the protocol stack, assembling the data into properly formatted packets, calculating checksums, handling interrupts, and moving data to the adapter. In other words, the CPU can do no other useful work besides manipulating packets for communication. Even so, it is unable to realize the bandwidth available on the data pipe.
Two high-level approaches for accelerating the protocol processing include:
Modifying the stack
Offloading the stack to a specialized external protocol engine
The degree of acceleration required for 10 GbE and a strong preference to maintain compatibility with the large Ethernet ecosystem make the latter approach desirable. In a typical implementation, the processor would move payload data to or from the protocol processing engine where the majority of the stack processing would occur. Figure 2 illustrates the movement of the boundary between software and hardware stack processing as a result of the stack acceleration process. The gray dashed line represents the original location of the hardware/software border when stack processing was primarily occurring on the processor. The black line illustrates the end result, leading to a decrease in processor utilization and improved throughput.
Figure 2
Direct data streaming
While CPUs require accelerated protocol processing to effectively utilize 10 GbE, I/O devices have the even more fundamental need to access the protocol processing functionality altogether. Figure 3 illustrates why a CPU typically cannot function as the gateway to 10 GbE for an I/O module. Using an example of an ADC mezzanine module, of which many such off-the-shelf examples can be found in the embedded space, the data flow consists of:
Data is moved from the ADC module to the CPU RAM (likely through a bridge, not shown);
The CPU must read the data into cache for processing;
The CPU processes the data, wraps it in the Ethernet protocol and then writes it out to RAM;
The data is moved from CPU RAM to the external 10 GbE interface module (again, through a bridge, not shown).
Figure 3
The net result is a quadruple trip on the CPU’s memory bus, where the first two trips involve payload data and the latter two involve the necessarily larger amount of protocol-wrapped data. So, before we even begin to account for any other bus overhead (interrupts and cycles lost due to bus direction turn-arounds, arbitration, and so on), a theoretical 2 GBps CPU memory bus already becomes effectively a sub-500 MBps bus, constricting the ADC-to-Ethernet data path to a theoretical maximum of 4 Gbps. If the application were full-duplex, the maximum would be just 0.5 Gbps.
A performance solution requires a direct data path from the I/O module to the 10 GbE interface, which is only possible if the interface itself is capable of performing the entire protocol offload. The system CPU’s job is to perform some setup functions and get out of the way, allowing the I/O module to transfer data directly to and from the 10 GbE module (as illustrated in Figure 4). Note that beyond control and setup, the system CPU is also highly desirable during development and debug and to support additional features such as the analysis of data snapshots and test and maintenance modes.
Figure 4
FPGA-based acceleration
Our discussion has focused on functionality and architecture. We indicated earlier that there are several implementation technology choices. However, there are some important benefits that make an FPGA particularly well suited as the engine for 10 GbE solutions in the military and other high-performance spaces. Because of problem complexity in these spaces, systems tend to consist of an integration of several hardware and software components. FPGAs, being both programmable and capable of 10 GbE line-speed processing, enable three critical features to facilitate the successful design and implementation of such systems: 1) Protocol customization; 2) project de-risking; and 3) integration of higher-level system functionality.
Protocol customization
The term “customization” may conjure memories of systems past that consisted of proprietary, noninteroperable protocols. However, customization need not be nearly so drastic, and can in fact consist of additions or modifications within the payload or headers that do not in any way make the messages incompatible with standard 10 GbE. Examples include adjustable delays between sending packets, padding for byte alignment, disregarding certain header fields, and the addition of sequence numbers or timestamp information as part of the payload.
Generally, the decision to perform any customization should be done only after an overall system-level analysis is performed. The following example of a receive-side-only customization is completely nonsensical for many applications but might make good sense in a system receiving streaming ADC data over 10 GbE. Under the standard UDP/IP protocol, fragmented packets that are received incomplete or with checksum errors are tossed out by the receiver rather then being passed onto upper layers of software. However, if the packet consisted of a series of ADC samples headed into a signal processing block at the receiver, it might be preferable to incorporate a modification to retain the packet, rather than losing it in its entirety. The reason is that the decoding, filtering, or other signal processing algorithm might be more amenable to handling a packet with individual sample errors rather than enduring an entire missing sequence of samples (burst error). FPGA use enables such customizations, which only make sense when the payload is viewed in the context of the whole system, to be incorporated and tested.
Project de-risking
The de-risking feature is a result of the FPGA’s re-programmability. Specification oversights, yet-to-be-ratified protocol updates, and late stage “fires” can often be directly addressed or worked around between software and FPGA code; with an ASIC-based solution, this may not be possible. Some examples of small, late stage oversights that can represent significant hurdles when discovered at integration time are byte alignment or a need to complement data. An FPGA could be used to fix such problems as the data streams by. However, if the task were left to a general purpose processor, the number of cycles required for such seemingly simple operations could be a show stopper.
Integration of higher-level system functionality
FPGAs are widely recognized and used for their real-time signal processing ability. Functional blocks such as filters, digital up and down converters, and synchronization circuits can be added as custom pre- or postprocessing algorithms in the FPGA used for the protocol offload, as depicted in Figure 5. In addition to signal processing, packet-processing functions like inspection, classification, and optimized fabric bridging can also be incorporated within the FPGA. Depending on the nature of processing, the importance of colocating the algorithmic and the protocol offload functions may vary from a convenience to an absolute necessity. The net effect is that the use of an FPGA enables the protocol offload module to assume a higher-level, portable signal or packet processing functionality well beyond that of a data mover.
Figure 5
A practical 10 GbE implementation
Ethernet has now reached the level of performance required by many high-performance embedded real-time applications. When appropriately architected and integrated into a system, 10 GbE offers new capabilities and advantages. Figure 6 shows a picture of the AdvancedIO Systems V1020 10 GbE XMC module. The module, which has been integrated on COTS carriers, is currently shipping with UDP offload capability for processor-based applications as well as data streaming from I/O modules. The module supports both PCI-X and PCIe interfaces and, as you may have guessed from the rest of the article, is based on an FPGA.
Rob Kraft is VP of Marketing at AdvancedIO Systems. He has more than 12 years of experience in systems engineering and business roles in the embedded real-time space. Prior to joining AdvancedIO, he worked at Spectrum Signal Processing and AlliedSignal Aerospace. Rob has a MASc in Electrical Engineering from the University of Toronto.
For more information, contact Rob at:
AdvancedIO Systems Inc.
Suite 502 - 595 Howe Street
Vancouver, British Columbia, Canada V6C 2T5
604-331-1600, Ext. 209
rkraft@advancedio.com www.advancedio.com
Related articles: FPGAs
This content is temporarily unavailable. Sorry for the interruption. The specified file could not be found.