Simon Kuenzer, Sharan Santhanam, Vlad-Andrei Bădoiu, Hugo Lefeuvre, Alexander Jung, Gaulthier Gain, Cyril Soldani, Costin Lupu, Stefan Teodorescu, Costi Răducanu , Cristian Banu, Laurent Mathy, Răzvan Deaconescu, Costin Raiciu, Felipe Huici: “Unikraft: Fast, Specialized Unikernels the Easy Way”, EuroSys 2021
EuroSys Best Paper Award, 2021
Artifact Evaluation badges: Artifacts Available, Artifacts Functional, Results Reproduced
Unikernels are famous for providing excellent performance in terms of boot times, throughput and memory consumption, to name a few metrics. However, they are in famous for making it hard and extremely time consuming to extract such performance, and for needing significant engineering effort in order to port applications to them. We introduce Unikraft, a novel micro-library OS that (1) fully modularizes OS primitives so that it is easy to customize the unikernel and include only relevant components and (2) exposes a set of composable, performance-oriented APIs in order to make it easy for developers to obtain high performance. Our evaluation using off-the-shelf applications such as nginx, SQLite, and Redis shows that running them on Unikraft results in a 1.7x-2.7x performance improvement compared to Linux guests. In addition, Unikraft images for these apps are around 1MB, require less than 10MB of RAM to run, and boot in around 1ms on top of the VMM time (total boot time 3ms-40ms). Unikraft is a Linux Foundation open source project and can be found atwww.unikraft.org.
Presented at: EuroSys 2021
M.Brunella, G.Belocchi, M.Bonola, S.Pontarelli, G.Siracusano, G.Bianchi, A.Cammarano, A.Palumbo, L.Petrucci, R.Bifulco: "hXDP: Efficient Software Packet Processing on FPGA NIC" USENIX OSDI, 2020
Jay Lepreau Best Paper Award, USENIX OSDI’20
FPGA accelerators on the NIC enable the offloading of expensive packet processing tasks from the CPU. However, FPGAs have limited resources that may need to be shared among diverse applications, and programming them is difficult.
We present a solution to run Linux's eXpress Data Path programs written in eBPF on FPGAs, using only a fraction of the available hardware resources while matching the performance of high-end CPUs. The iterative execution model of eBPF is not a good fit for FPGA accelerators. Nonetheless, we show that many of the instructions of an eBPF program can be compressed, parallelized or completely removed, when targeting a purpose-built FPGA executor, thereby significantly improving performance. We leverage that to design hXDP, which includes (i) an optimizing-compiler that parallelizes and translates eBPF bytecode to an extended eBPF Instruction-set Architecture defined by us; a (ii) soft-processor to execute such instructions on FPGA; and (iii) an FPGA-based infrastructure to provide XDP's maps and helper functions as defined within the Linux kernel.
We implement hXDP on an FPGA NIC and evaluate it running real-world unmodified eBPF programs. Our implementation is clocked at 156.25MHz, uses about 15% of the FPGA resources, and can run dynamically loaded programs. Despite these modest requirements, it achieves the packet processing throughput of a high-end CPU core and provides a 10x lower packet forwarding latency.
Full author details: Marco Spaziani Brunella and Giacomo Belocchi, Axbryd/University of Rome Tor Vergata; Marco Bonola, Axbryd/CNIT; Salvatore Pontarelli, Axbryd; Giuseppe Siracusano, NEC Laboratories Europe; Giuseppe Bianchi, University of Rome Tor Vergata; Aniello Cammarano, Alessandro Palumbo, and Luca Petrucci, CNIT/University of Rome Tor Vergata; Roberto Bifulco, NEC Laboratories Europe
Davide Sanvito, Andrea Marchini, Ilario Filippini, Antonio Capone; “CEDRO: an in-switch elephant flows rescheduling scheme for data-centers”, IEEE Conference on Network Softwarization (NetSoft) 2020
Data-center topologies interconnect an ever largernumber of servers using a high number of alternative pathsto provide high bandwidth and a high degree of resiliency. Thestate-of-the-art routing strategy is based on Equal-cost multipath(ECMP) which employs static hashing mechanism over packetheaderfields to spread the traffic over multiple paths. Routing thetraffic without considering the size of theflows and the utilizationof the paths might cause congestion due to the collision of multiplelargeflows on a same downstream path. We present CEDRO,an in-switch mechanism to detect and reschedule colliding largeflows. By exploiting the latest advances in SDN programmablenetwork devices, we offload to the network the detection ofboth the elephantflows and the path congestion conditions andthe rescheduling mechanism. CEDRO is able to promptly copewith path congestion and failures directly from the dataplane,regardless of the availability of the external controller. Weimplemented CEDRO in an emulated SDN network and testedit against realistic traffic scenarios. Numerical evaluation showsCEDRO is able to improve the average and 95-th percentile ofthe Flow Completion Time compared to ECMP.
Nicolas Weber, Felipe Huici “SOL: Effortless Device Support for AI Frameworks without Source Code Changes”, 3rd High Performance Machine Learning Workshop, 2020
Abstract—Modern high performance computing clusters heav-ily rely on accelerators to overcome the limited compute powerof CPUs. These supercomputers run various applications fromdifferent domains such as simulations, numerical applications orartificial intelligence(AI). As a result, vendors need to be able toefficiently run a wide variety of workloads on their hardware.In the AI domain this is in particular exacerbated by theexistance of a number of popular frameworks (e.g, PyTorch,TensorFlow, etc.) that have no common code base, and can varyin functionality. The code of these frameworks evolves quickly,making it expensive to keep up with all changes and potentiallyforcing developers to go through constant rounds of upstreaming.In this paper we explore how to provide hardware support inAI frameworks without changing the framework’s source code inorder to minimize maintenance overhead. We introduce SOL, anAI acceleration middleware that provides a hardware abstractionlayer that allows us to transparently support heterogenous hard-ware. As a proof of concept, we implemented SOL for PyTorchwith three backends: CPUs, GPUs and vector processors.Index Terms—artificial intelligence, middleware, high perfor-mance computing
P. Luigi Ventre, P. Lungaroni, G. Siracusano, C. Pisa, F. Schmidt, F. Lombardo, S. Salsano, "On the Fly Orchestration of Unikernels: Tuning and Performance Evaluation of Virtual Infrastructure Managers", IEEE Transactions on Cloud Computing Accepted Date November 2018
Network operators are facing significant challenges meeting the demand for more bandwidth, agile infrastructures, innovative services, while keeping costs low. Network Functions Virtualization (NFV) and Cloud Computing are emerging as key trends of 5G network architectures, providing flexibility, fast instantiation times, support of Commercial Off The Shelf hardware and significant cost savings. NFV leverages Cloud Computing principles to move the data-plane network functions from expensive, closed and proprietary hardware to the so-called Virtual Network Functions (VNFs). In this paper we deal with the management of virtual computing resources (Unikernels) for the execution of VNFs. This functionality is performed by the Virtual Infrastructure Manager (VIM) in the NFV MANagement and Orchestration (MANO) reference architecture. We discuss the instantiation process of virtual resources and propose a generic reference model, starting from the analysis of three open source VIMs, namely OpenStack, Nomad and OpenVIM. We improve the aforementioned VIMs introducing the support for special-purpose Unikernels and aiming at reducing the duration of the instantiation process. We evaluate some performance aspects of the VIMs, considering both stock and tuned versions. The VIM extensions and performance evaluation tools are available under a liberal open source licence.
Published in: IEEE Transactions on Cloud Computing
PDF download: On the Fly Orchestration of Unikernels
Full paper download: NLE_Research_Paper_On_the_Fly_Orchestration_of_Unikernels_2018.pdf
S. Pontarelli, R. Bifulco, M. Bonola, G. Siracusano, M. Honda, F. Huici, "FlowBlaze: Stateful Packet Processing in Hardware" in USENIX Symposium on Networked Systems Design and Implementation (NSDI), March 2019
While programmable NICs allow for better scalability to handle growing network workloads, providing an expressive yet simple abstraction to program stateful network functions in hardware remains a research challenge. We address the problem with FlowBlaze, an open abstraction for building stateful packet processing functions in hardware. The abstraction is based on Extended Finite State Machines and introduces the explicit definition of flow state, allowing FlowBlaze to leverage flow-level parallelism. FlowBlaze is expressive, supporting a wide range of complex network functions, and easy to use, hiding low-level hardware implementation issues from the programmer. Our implementation of FlowBlaze on a NetFPGA SmartNIC achieves very low latency (in the order of a few microseconds), consumes relatively little power, can hold per-flow state for hundreds of thousands of flows, and yields speeds of 40 Gb/s, allowing for even higher speeds on newer FPGA models. Both hardware and software implementations of FlowBlaze are publicly available.
Conference: USENIX Symposium on Networked Systems Design and Implementation (NSDI), 2019
PDF download: FlowBlaze: Stateful Packet Processing in Hardware
G. Siracusano, R. Gonzalez, R. Bifulco: "On the application of NLP to discover relationships between malicious network entities”, poster at CCS 2019.
Carmelo Cascone, Roberto Bifulco, Salvatore Pontarelli, Antonio Capone: “Relaxing state-access constraints in stateful programmable data planes”, ACM SIGCOMM Computer Communication Review, 2018
Supporting programmable stateful packet forwarding functions in hardware requires a tight balance between functionality and performance. Current state-of-the-art solutions are based on a very conservative model that assumes worst-case workloads. This finally limits the programmability of the system, even if actual deployment conditions may be very different from the worst-case scenario. We use trace-based simulations to highlight the benefits of accounting for specific workload characteristics. Furthermore, we show that relatively simple additions to a switching chip design can take advantage of such characteristics. In particular, we argue that introducing stalls in the switching chip pipeline enables stateful functions to be executed in a larger but bounded time without harming the overall forwarding performance. Our results show that, in some cases, the stateful processing of a packet could use 30x the time budget provided by state of the art solutions.