With the evolution of modern technology, many organizations are now embracing big data solutions to drive decision-making by uncovering patterns, trends, and correlations in massive amount of raw data. Subsequently, the adoption of big data analytic technologies presents multi-faceted challenges to manage voluminous data while mitigating data security risks.
Computer scientists and engineers across DoD laboratories are finding different means to collect, synthesize, process, and compare data in order to make the most of scientific observations. The capabilities of grid computing to connect large-scale computers to share resources is generating a surplus of unstructured data to analyze. Big data and high-performance computing (HPC) are hot button subjects amongst academic, industrial, and government organizations. Scientists and engineers believe that high performance computing resources can significantly advance scientific research and discovery.
In 2015, the White House published Executive Order 13702 to create a National Strategic Computing Initiative (NSCI). The NSCI was established to promote U.S. leadership in High Performance Computing (HPC) to maximize benefits of high-performance computing (HPC) research, advancement of economic competitiveness, development, and scientific discovery. Most recently, U.S. computing leaders, including Department of Energy Laboratories, have partnered with government, universities, and private sector to launch the COVID-19 High Performance Computing Consortium. The consortium will allow researchers worldwide to access to the world’s most powerful HPC resources in support of COVID-19 research.
The primary objective of HPC systems is to ensure the most resourceful execution of large-scale data analytics, which dictates lightweight security measures with the intention of reducing the overhead coupled with security requirements. Cybersecurity for HPC is a critical mission aspect that presents unique challenges in providing non-repudiation, thus providing a high level of data protection and confidentiality for scientific observations. In this special report, we delineate methods for closing HPC Security Gaps by using Berkeley Packet Filter (BPF) as part of a network load balancer. Berkeley Packet Filter (BFP) was designed in the 1990s as a virtual machine for efficient packet filters. This report will discuss how BFP is used for monitoring, debugging, and collecting statistics from the kernel. This special report is geared toward developers and users who want to understand HPC and BPF broader functionality as part of the Kernel Runtime Security to assist with improving detection of security threats.
Watch the corresponding video podcast: csiac.org/podcast/securing-the-soft-underbelly-of-a-supercomputer-with-bpf-probes/ csiac.org/series/the-csiac-podcast
Table of Contents
5.1. Test Environment
5.3. BPF probes
6.1. Performance Results
6.2. Detection Results
7.1. End Notes
Note: The appendices of this report are available for viewing in the attached PDF.
Today, research organizations are the principal operators of high-performance computing (HPC) systems. Researchers utilize HPC to fast-track scientific discovery while adhering to security control standards to protect data files which could degrade the undertaking of the high-volume calculations. One standard solution is to secure a set of login nodes that mediate access to an enclave of lightly monitored compute nodes, referred to as “the soft underbelly of a supercomputer” by one DoD representative (National, 2016). Recent advances in the BPF subsystem, a Linux tracing technology, have provided a new means to monitor compute nodes with minimal performance degradation. Well-crafted BPF traces can detect malicious activity on an HPC cluster without slowing down systems or the researchers that depend on them. In this paper, a series of low-profile attacks are conducted against a compute cluster under heavy computational load, and BPF probes are attached to detect the attacks. The probes successfully log all attacks, and performance loss is less than one percent for all benchmarks save for one inconclusive set.
2. Introduction High Performance Computing (HPC) Ecosystem
In high-performance computing (HPC), many organizations that facilitate research provide a remote shell for writing, compiling, and executing arbitrary code. The code runs on a networked cluster of servers with hundreds of thousands of processor cores and has access to petabytes of storage. Information security practitioners must secure these environments for government research contracts, but the solutions they architect cannot reduce bare-metal cluster performance by more than a defined percentage, possibly as low as 1%. These limitations impact the security of HPC sites in government agencies, academia, and the private sector.
Colloquially known as “supercomputers,” HPC clusters utilize numerous machines to deliver a more powerful computing environment to deal with computational problems that are too massive for conventional computers. Cluster sizes range from dozens to tens of thousands of “nodes” (HPC parlance for servers). Today, the Summit supercomputer at the Department of Energy’s Oak Ridge National Laboratory ranks number one on the Top500 Supercomputing list, touting over 2,400,000 processor cores and peaking at 200 petaflops (i.e., two hundred quadrillion floating-point operations per second) (TOP500, 2019).
In practice, large clusters share their resources among many users, including those not employed by the host institution. For example, the XSEDE Federation is a cyberinfrastructure ecosystem composed of 36 different institutions across the United States, providing HPC resources to the science and engineering community as a single coordinated effort (XSEDE, n.d.). This author administers an HPC cluster at an academic institution, which serves not only the campus community but also collaborators from other organizations across the world.
A current approach to HPC security is to lock down a few login nodes with required security controls and only lightly monitor the army of isolated compute nodes behind them. At a NIST Workshop on HPC Security in 2016, a DoD representative described these compute nodes as the “soft underbelly” of supercomputing (National, 2016). Detecting malicious activity on the compute nodes themselves while maintaining performance requirements was considered an unsolved problem.
Three years ago, Brendan Gregg announced that “superpowers have finally come to Linux” in the form of Berkeley Packet Filter (BPF) tracing tools (Gregg, 2017). Although systems administrators and analysts had used BPF to filter network packets for decades, Linux kernel developers had both improved its performance and opened it up to general usage through a new bpf() syscall. Thus, BPF was no longer a network tracing tool but a system-wide tracing tool. Gregg has since demonstrated the value of BPF tracing to security practitioners at subsequent conferences (Gregg & Maestretti, 2017).
This paper has two primary purposes. The first is to introduce BPF as a general tracing tool for detecting malicious activity on Linux systems. A summary of recent developments in BPF and an explanation of its usage is provided. Example scripts are also included that demonstrate tracing open TTYs, network activity, filesystem activity, and Bash commands.
The second purpose is to evaluate BPF as a security tool for production HPC clusters, both from a performance perspective and a detection perspective. A security monitoring agent that affects performance by even one or two percent has a low chance of adoption on HPC clusters that prioritize fast research results. Should it be adopted, there must be an assurance that the agent will not slow down compute nodes and will detect the attacks it purports to defend against.
To validate BPF, a series of low-profile attacks are conducted against eight compute nodes running a series of benchmarks, both without and with BPF probes attached. Benchmarks without BPF probes are compared to benchmarks with BPF probes to determine the performance loss. The logs of the BPF trace scripts are compared with attack script logs to determine the attack detection rate.
3. HPC Cluster Architecture
System monitoring of a large-scale High-Performance Community (HPC) cluster architecture is a difficult task which could easily become further challenging as the scale and complexity of the platforms increase.
The complexity of an HPC cluster can range from elementary to mind-boggling. At one end of the spectrum, students can interconnect a stack of Raspberry Pis to make a Beowulf cluster for educational purposes (Kiepert, 2013). At the other end is the NASA Advanced Supercomputing Division, whose Pleiades cluster interconnects eleven thousand compute nodes in an 11-dimensional hypercube topology for performance purposes (Chang, Jin & Bauer, 2016).
An HPC node provides a multi-compiler and multi-version environment intended to support scientific software from many different disciplines. For example, the author administers nodes that have eight versions of gcc, two versions of Intel compilers, five versions of CUDA libraries, and three versions of Boost C++ libraries, not including additional variations when compiled with MPI support.
Researchers initially authenticate to a “login” node session. From here, they can write, compile, and debug arbitrary code. Once a researcher is ready to launch their software on the compute nodes, they submit a “job” to the scheduler. The job specifies the resources needed, the time required, and the commands to run. The scheduler maintains a queue of all jobs, dispatching them to the compute nodes as time, resources, and fair share permit.
Compute node operating systems are installed using scalable provisioning technologies such as PXE booting, Kickstart for thick provisioning, and read-only root NFS for thin provisioning, among others. The nodes are also configured to mount central storage to make data available for processing across a large set of nodes, with performance tiers ranging from archival tape storage to high-performance parallel filesystems.
Firewalls and network monitors are typically not deployed for compute nodes. Generally, it is due to the thought that the head node would have the overall ability to interact with the compute nodes so keeping the security controls at that gateway could be sufficient if that is the only entry/exit point for media and traffic. With network speeds reaching tens of gigabits per second per node, or terabytes per second in aggregate, host-based and network-based products can degrade performance and cause job failures. Also, the network traffic itself is highly variable. Software may use traditional Ethernet or high-bandwidth, low-latency fabrics such as InfiniBand and Omni-Path. The characteristics of network traffic differ from software to software and even across the lifetime of a single piece of software as it is developed on HPC systems.
These details highlight the following difficulties for the security practitioner:
- Compute nodes run arbitrary code;
- Compute nodes can access centralized research storage;
- Compute nodes produce highly variable network traffic;
- Compute nodes have fewer security controls for performance reasons; and
- Compute nodes produce terabytes of network traffic per second in aggregate.
4. BPF Introduction
Many security practitioners identify the filter expression of tcpdump as BPF, but this is somewhat inaccurate. Tcpdump transparently compiles the expression into BPF code. The actual BPF bytecode can be dumped using the –d option. This bytecode is fed into a register-based virtual machine that runs in the Linux kernel.
Originally implemented in 1992, the two-register virtual machine approach of “BSD Packet Filter” was twenty to one hundred times faster than its competing packet filters, partly because the implementation matched how the underlying RISC CPU operated, and partly because of its improved buffer model (McCanne, 1992).
Support for BPF in Linux was added in the 2.5 development kernel and stayed largely untouched for roughly a decade. In the last eight years, however, BPF has changed dramatically, burgeoning into its own Linux subsystem. Many new terms have evolved over the years, the following section provides a brief review of these developments.
In 2012, Will Drewry struggled to have code accepted into the Linux kernel. He wrote a patch to allow seccomp to filter arbitrary syscalls, but his work was in limbo between a prctl() maintainer who suggested using the perf subsystem for filtering, and a perf maintainer who suggested using prctl() for filtering, with neither gatekeeper budging (Edge, 2011). In a stroke of brilliance, Will found the BPF virtual machine and used it to filter allowed syscalls instead of network traffic (Corbet, 2012).
Two years later, Alexei Starovoitov posted a patch set that greatly improved BPF performance. He increased the number of registers from two to ten, added to its instruction set to better resemble modern processors, and upgraded its registers to 64 bits (Corbet, 2014 May). His work yielded a four-fold increase in speed (Starovoitov, 2014 March), and importantly, he also posted a patch that demonstrated using BPF for tracing filters (Starovoitov, 2014 May).
A month later, Alexei extended BPF further. He moved BPF out of the network subsystem into its own directory, signaling the intention for its general use. He also implemented a new bpf() syscall. This allowed users with CAP_SYS_ADMIN privileges (i.e., root) to load BPF programs into the kernel to respond to specific events that they defined. An in-kernel verifier ensured the safety of the program before loading it (Corbet, 2014 July).
This improved BPF implementation went through many names. It was first known as “internal BPF” (as opposed to “classic BPF”) but was later called extended BPF, or eBPF. Today, system maintainers have chosen to simply call the execution engine BPF, without any reference to what the acronym originally represented (Gregg, 2020 January).
4.1. BPF Compiler Collection
While valuable to kernel developers, the bpf() syscall was impractical to those who didn’t keep a copy of the kernel source code lying around. The BPF Compiler Collection (BCC) was created in April of 2015 to address this issue. It greatly simplified the process of writing tracing tools that could leverage BPF (Fleming, 2017).
Over the course of a few years, this collection grew into a mature suite of tools that were easy for systems administrators to use. There are currently over 100 BCC tools readily available for monitoring system calls, language function calls (including php, perl, ruby, and python), network events, filesystem performance, database performance, and more. Four basic examples of these tools are included below. These examples are not intended to detect sophisticated attackers, but rather to demonstrate the potential of the tools.
The opensnoop tool traces open() and openat() syscalls. In this example, the tool detected a user’s failed attempts to list the /root directory and view /etc/shadow:
The execsnoop tool traces new processes via exec() syscalls. This example shows a user attempting to run nc, download ncat, and create and run a suspicious Python script.
The ttysnoop tool displays the output of a TTY as if the administrator is sitting at the same terminal. The following example shows an administrator snooping /dev/pts/1 and observing a user named “billy” exploring the system:
Last is the tcpstates tool, used here for tracing any TCP state changes involving remote ports 22, 80, or 443. While the trace was running, a user connected over SSH to a neighboring compute node for 10 seconds and then closed the connection. Next, the user attempted to access a website with wget and then sent a keyboard-interrupt after three failed connection attempts.
While easy to use, BCC tools are not necessarily easy to write or maintain. They are python scripts with embedded BPF programs written in C. Tools may break when the traced code changes, requiring continual maintenance from version to version of the traced software.
In December 2016, an even more intuitive tool came to fruition as a result of Alastair Robertson’s spare-time hobby. Robertson started a project built on BCC and BPF called bpftrace, and it offered an AWK-like syntax that was already familiar to many systems administrators and security practitioners. The project attracted prominent BCC contributors and completed its first set of major features in 2018 (Gregg, 2020 January). Today, bpftrace is a full-fledged tracing utility that can use a stupendous variety of sources and trigger many types of actions.
The main downside of the tool is that it requires a minimum Linux kernel version of 4.1 and recommends version 4.9 to take full advantage of its features. This means that the tool is only available on later versions of Linux distributions such as Red Hat Enterprise Linux 8, Debian 9, and Ubuntu 19.04. Even then, the version of bpftrace on these distributions does not have all the features available in the latest version.
For those exploring bpftrace for the first time, two helpful starting points are running `bpftrace –l` for a list of static and dynamic probes available for use and `bpftrace –lv [tracepoint_name]` for the arguments available to retrieve values from when a probe fires.
The basic syntax of bpftrace and a few instructive examples are provided. A full walkthrough of writing bpftrace scripts is outside the scope of this paper, but readers who wish to familiarize themselves with using the tool can review Brendan Gregg’s bpftrace tutorial.
bpftrace scripts follow a basic syntax familiar to AWK users:
The following example traces openat() calls by UID 1000:
The script prints a header; saves the target filename when UID 1000 enters openat(); and prints the command, file, and errno when openat() returns an error. It produced the following output when UID 1000 attempted to open /etc/shadow.
Userspace functions can also be traced. The following example script from Brendan Gregg (2020 January) traces the readline() function in /bin/bash. Once started, it will trace readline() for all current and future invocations of /bin/bash.
The script produced the following output, revealing an attempt by a bash session with PID 28853 to invoke a cryptocurrency miner:
Shared libraries can also be traced. This is especially valuable because it allows an administrator to place probes that are difficult for an attacker to avoid. The following example script places probes in the gethost*() and getaddrinfo() functions of the GNU C library to trace DNS queries. It is modified from Brendan Gregg’s gethostlatency.bt script (Gregg, 2018).
The output shows DNS queries from a user invoking curl and wget on questionable websites:
These examples demonstrated the ability of bpftrace to monitor filesystems, processes, user sessions, and network activity. Once installed, the software includes over thirty high-quality scripts that cover dozens of system activities. As Brendan Gregg put it, gaining this depth and breadth of visibility on a Linux system “can feel like having X-ray vision” (Gregg, 2020 January). This level of vision is available to any Linux systems administrator who becomes adept at using the tools.
5. Performance Analysis of BPF in HPC
The remainder of this paper is dedicated to measuring the performance impact of BPF when monitoring compute nodes under heavy load. As crucial as it is to demonstrate the effectiveness of a security solution, HPC administrators likewise need assurance that security tools will not degrade performance beyond a defined threshold.
Brendan Gregg targeted a performance loss of less than 1% when using BPF tools and scripts in production at Netflix (Gregg & Maestretti, 2017). The expectations in this paper’s performance analysis were as follows:
- Performance loss <1%: BPF probes are widely recommended in HPC
- Performance loss 1%-3%: BPF probes are recommended in qualified circumstances
- Performance loss >3%: BPF probes should be revised until performance is acceptable
5.1. Test Environment
Eight compute nodes with identical hardware were reserved for testing. They were connected to an InfiniBand fabric composed of FDR and EDR switches in a CLOS network topology (i.e., a fat-tree topology with multiple roots). The nodes’ hardware characteristics were as follows:
A new operating system image was built that supported BPF tools, benchmarking software, HPC scheduling, centralized storage, and the InfiniBand fabric. A provisioning server presented this image to the compute nodes, which mounted the image as a read-only root to ensure it was identical and unchangeable across all compute nodes. The provisioning server also provided writable partitions that were bind-mounted onto key locations using /etc/rwtab and /etc/statetab.
The operating system included the following software of interest:
The Intel cores in these compute nodes were of the Broadwell generation. These were touted to have up to 16 floating-point operations per clock cycle because of the new fuse-multiply-add (FMA) instruction, but real-world runs have shown lower results because the instruction wasn’t as generally applicable as other instructions like AVX2. For this analysis, the cores were estimated to provide 12 floating-point operations per cycle.
Each compute node’s theoretical max “flops,” or floating-point operations per second, is the product of its total processor cores, clock speed (GHz), and floating-point operations per cycle. When estimating 12 operations per cycle, the compute nodes for this analysis had an estimated theoretical max of 806 gigaflops per node.
5.2. Benchmarking Software
A series of High-Performance LINPACK (HPL) Benchmarks were executed on a set of compute nodes; these comparative studies were accomplished with and without Berkeley Packet Filter (BPF) probes, measuring to what degree the BPF probes affected performance.
This HPL Benchmark is used as a reference benchmark to calculate the top-performing supercomputers in the world. In essence, HPL Benchmark is performing numerical linear algebra techniques to solve a series of polynomial equations. It is inherent for the administrator to scale the extent of the problem and to optimize the software in order to attain best outcomes. A HPL Linpack’s compiler and parameters are to set the problem size, block size, and process geometry. A broad examination of optimization performances is beyond the scope of this paper; however, a reasonable HPL baseline should obtain 75% to 85% of the theoretical max flops of a compute node.
The HPL parameters and characteristics for each grouping of nodes were as follows:
Using the eight compute nodes, the author ran a total of 32 HPL benchmark tests. First, each individual compute node ran the benchmark. Next, the nodes were grouped into pairs to run the benchmark together. Then, they were grouped into fours. Finally, all eight nodes ran the benchmark as a single cluster, twofold. These tests were all repeated using the BPF probes. For the repetitive tests, a script was executed to simulate low-profile attacks that the BPF probes were projected to detect.
5.3. BPF Probes
The BPF execution engine is fast, but it cannot make up for BPF probes that are frequently fired or inherently slow. If a probe is attached to an event that fires millions of times per second, the overhead will add up. In some cases, tracing malloc() or free() will slow the target application tenfold or more (Gregg, 2020 January). In contrast, an ideal BPF probe will fire infrequently and provide high-value data.
Before writing a BPF probe, it is important to determine the question that needs to be answered. These are the questions that the probes of this performance analysis were written to answer:
- Are compute nodes attempting to send beacons to external systems?
- Are compute nodes running cryptocurrency miners?
- Are compute nodes the source of any suspicious lateral movements?
- Are compute node processes attempting to escalate privileges?
- Are compute nodes using an SSH proxy to connect to external systems?
To this end, four bpftrace scripts were written: dnssnoop.bt, pamsnoop.bt, sshtunnel.bt, and tcpconnect_filter.sh. These scripts produced logs in a key-value format for easy parsing. All scripts output the timestamp, script type (dns, pam, sshproxy, tcp) and the PID, UID, and command of the process that caused the probe to fire. Each script also output additional data for its unique type.
The first script, dnssnoop.bt, logged DNS queries by tracing the relevant function calls in the GNU C library. It took a UID as its first argument on the command line to log only the DNS queries of a given user.
The second script, pamsnoop.bt, detected processes changing from one user to another by tracing Linux PAM, the library responsible for handling authentication tasks. Its first argument on the command line specified the UID to monitor. It logged both the original user and the new user (target) associated with the process. It also logged the return value of the traced function.
The third script, sshtunnel.bt, detected when SSH was used to forward TCP ports. TCP port forwarding is a built-in SSH feature that allows someone to use an SSH server as a proxy to reach external resources. This feature can be disabled on SSH servers with the AllowTCPForwarding family of SSH options, but it is on by default and is often left that way.
The script detected port forwarding by tracing the inet_sock_set_state() syscall and logging whenever an SSH client changed a socket to a LISTEN state. Note that while SSH servers regularly open listening ports, a client opening a listening port is a tell-tale sign of port forwarding.
Specifically, the script detected local, or dynamic, port forwarding. Local port forwarding specifies an SSH server, a remote host and port to connect to, and a local port to open. Any network connections to the local port will be forwarded through the SSH server to the remote host and port. Dynamic port forwarding turns the SSH client into a SOCKS proxy, allowing software to connect to the local port and forward all traffic through an SSH server.
The final script, tcpconnect_filter.sh, was the most complex of the four. This was because the version of bpftrace available on RHEL 8.1 was still missing key functionality for network tracing. It lacked features such as integer casting, strncmp(), and an array operator, making it impossible to retrieve data from some of the most valuable networking data structures.
Dale Hamel wrote the original tcpconnect.bt, which traced the tcp_connect() kernel function to detect all TCP connects (Hamel, 2018). The script was modified for this research, which included wrapping it in a Bash script to enable the whitelisting of a subnet. This allowed the compute cluster’s subnet to be whitelisted so that only TCP connections to external resources would cause probes to fire.
Also note that this script only traced TCP traffic which is protocol; unlike the more traditional traceroute which sends UDP connectionless Internet protocol. Since UDP is a stateless protocol, it will cause the probe to fire on every sent message.
These scripts were copied to each compute node and executed at the beginning of benchmarks that measured BPF performance impact. Their logs were redirected to files on a shared storage system.
5.4. Simulating Low-Profile Attacks
For tests with BPF probes attached, a Bash script was launched on participating nodes that simulated low-profile attacks. The script produced a timestamped log for each action. Every 1 to 15 seconds, it performed one of the following actions as an unprivileged user:
- Triggered a DNS query of an external domain
- Escalated to a privileged user
- Opened an SSH tunnel
- Attempted a TCP connection to a random private IP
For DNS queries, the script randomly chose a domain from a list, many of them representing common bitcoin mining sites. It then chose from one of five command-line tools to trigger the query: curl, wget, python, dig, or host.
For privilege escalation, the unprivileged user was temporarily given privileges to use sudo to escalate to the root user on the compute nodes. The script ran a basic command as root after escalation. This simulated an adversary who had obtained control over an account with sudo privileges and subsequently escalated to root via regular administrative techniques. It did not represent privilege escalation via software flaw exploitations.
For SSH tunneling, the script randomly chose between local port forwarding and dynamic port forwarding. Local port forwards connected to an SSH server in the DMZ and opened a tunnel to https://ubuntu.com using a random local port. Dynamic port forwards connected to that same server and opened a random local port for SOCKS proxy use.
For TCP connection attempts, the script used Bash’s built-in /dev/tcp feature. This is not an actual device on the filesystem, but a device emulated by Bash for easy interaction with TCP sockets. Any I/O to /dev/tcp/[host]/[port] triggers a TCP connection attempt to that host and port. The script chose a random host in the 192.168.0.0/16 subnet and a random port, attempting to connect to it over TCP.
The full body of the low-profile attack script can be found in Appendix A in the attached report.
The results of the benchmarks were analyzed both from a performance perspective and a detection perspective to show whether BPF tracing scripts can sufficiently detect attacks on compute nodes without degrading performance for researchers.
6.1. Performance Results
HPL results indicated that the BPF probes had less than 1% impact on compute node performance. In many cases, HPL benchmarks with BPF enabled recorded higher gigaflops than the non-BPF benchmarks. The author did not interpret these gains to mean that BPF probes improve performance, as such a claim for a tracing tool cannot be concluded. At best, these discrepancies suggested that BPF had nearly zero performance impact on compute nodes. Perhaps more realistically, the gains may have suggested that other factors besides BPF also influenced compute node performance. Possibilities include thermal fluctuations that impact Intel Turbo Mode, congestion on the InfiniBand fabric, and normal jitter from system processes.
Each chart below was scaled based on the theoretical max gigaflops for a node count. Thus, the vertical axis for the single-node chart has a maximum of 806.4 gigaflops, while the vertical axis for the eight-node chart has a maximum of 6451.2 gigaflops, which is eight times greater. Performance percentages are calculated against these maximums.
For single-node runs, Node 1 suffered a performance loss of 0.41%, the greatest loss of all tests apart from a discrepancy with eight-node runs. Node 8 was the only node with no performance loss, instead recording a gain of 0.02%.
For two-node runs, there were no instances of BPF causing performance loss. Node pairs [1-2] and [7-8] tied for the smallest gain of 0.12%, while node pair [3-4] had the largest gain of 0.65%.
The four-node runs likewise recorded small gains for BPF-enabled runs, with nodes [1-4] gaining 0.09% and nodes [5-8] gaining 0.13%.
The eight-node HPL results were unusual enough to warrant discussion. These benchmarks were run twice, meaning that there were two non-BPF results and two BPF results. Depending on how they were paired, the benchmarks either supported that BPF had low performance impact or painted a picture of unexplainable performance differences from run to run.
The four eight-node benchmarks can be paired in two ways. The chart on the left below in Figure 25 shows the results when the low-performing non-BPF and BPF runs are paired together and the high-performing non-BPF and BPF runs are paired together. The chart on the right show the results when the non-BPF and BPF runs that ran first are paired together and the non-BPF and BPF runs that ran second are paired together. Depending on how the results are paired, very different outcomes are seen.
When paired by lows and highs, enabling BPF caused a performance loss of 0.21% in the “low” benchmarks and a gain of 0.05% in the “high” benchmarks. However, when paired chronologically, enabling BPF caused a 1.62% loss in the first benchmarks and a 1.46% gain in the second benchmarks.
Put differently, the results of the two non-BPF runs that used identical hardware and software had a delta of 100 gigaflops, and likewise for the two runs with BPF enabled. Such a delta would only make sense if outside factors such as equipment temperatures or network congestion influenced the results.
Because of these discrepancies, the author found the eight-node results to be inconclusive. Perhaps the larger conclusion to be drawn is that BPF traces caused a performance change somewhere between a 1.62% loss or 1.46% gain for eight-node runs. Taking the averages of the non-BPF and BPF benchmarks, performance loss was only 0.08%. Ultimately, the discrepancies are best resolved with further benchmark testing.
A table of all performance results can be found in Appendix B in the attached report.
6.2. Detection Results
The logs of the four bpftrace scripts were cross-checked against logs from the low-profile attack script to determine whether the BPF probes were adequate in detecting unwanted behavior.
Overall, the author found that while the performance of the bpftrace scripts was exemplary, the fidelity of the scripts was hampered by excess noise. Scripts often produced multiple logs for a single action of the attack script. They also produced logs that were triggered by the benchmark software itself.
The dnssnoop.bt script created logs not only for domain-based host lookups but also for IP-based host lookups, including those handled by the /etc/hosts file. This was especially apparent as HPL began IPC communications with itself and other cluster members. Logged queries included all participating nodes, as well as domains and IPs that pointed to localhost.
When DNS queries involved dig or host commands, the query was obfuscated. It could be concluded that this was due to the query being routed through systemd-resolved but this was unconfirmed.
However, all queries using wget, curl, and python produced one accurate log per action.
The pamsnoop.bt script successfully detected sudo attempts, with the caveat that three logs were produced per sudo attempt. For every set of triplets, one log had a non-zero return value, so it should be possible to reduce log output to one line per successful sudo attempt when filtered by the return value.
The sshtunnel.bt successfully detected all SSH port forwarding connections, but it produced two logs per connection.
Finally, the tcpconnect_filter.sh script successfully detected all TCP connection attempts to resources outside of the compute subnet. The fact that no compute nodes were included in the logs suggested that the subnet whitelist worked properly. One oversight was that the script did not exclude localhost TCP connections.
The table below provides an aggregated count of the logs across all runs. Each row pairs an attack log type with its associated bpftrace script log type.
The dnssnoop.bt script proved the noisiest. It produced logs for /etc/hosts lookups, domain lookups, and IP lookups, as well as DNS queries performed by the attack script. The pamsnoop.bt script produced exactly three times as many logs as sudo attempts. The sshtunnel.bt script produced twice as many logs as SSH proxy attempts, minus one. There was a single time when an ssh_proxy attack log produced only one bpftrace log. The tcpconnect_filter.sh script detected all TCP connections outside the subnet, including connections to the SSH server in the DMZ, but the script also errantly included the many localhost communications by HPL.
7. Conclusions and Future Work
BPF probes, more specifically the bpftrace tool are recommended on HPC compute nodes for detecting malicious behavior. This recommendation is based on performance comparisons of single-node, two-node, and four-node HPL runs, both without and with BPF probes attached.
In future performance analyses, researchers could control additional factors that cause variation in HPL results. One example is disabling Turbo Mode on Intel processors. Researchers could also perform HPL runs above four nodes, as the results of the eight-node HPL runs in this study were inconclusive.
Moreover, future work should focus on the improvement of bpftrace scripts, especially as new features become available in future Linux distributions. Upcoming features include integer casting, the strncmp() function, and the array operator (Gregg, 2020 April). Best case for an auditor is to have a single (or just a few logs to review). Would recommend noting that an audit reduction script could be done to further filter the numerous logs that are generated between needed audit review cycles.
This will be especially true for using the inet_sock_set_state() syscall. Although this syscall was traced for the sshtunnel.bt script, some of its most valuable data in its arguments remained unusable due to the lack of an array operator. Once the operator is available, for example, the tcpconnect_filter.bt script can be entirely rewritten to use this syscall instead of a less stable dynamic kernel probe. The sshtunnel.bt will also be able to log remote port forwarding in addition to local and dynamic port forwarding.
The bpftrace scripts in the analysis were limited to TCP. It would be valuable to write tracing scripts for other protocols such as ICMP, UDP, and InfiniBand if it can be done without producing excessive noise.
Security practitioners familiar with kernel code can use their knowledge to produce new bpftrace scripts catered to detect attack scenarios they see in the wild. As an example, Brendan Gregg used bpftrace to detect attempts to exploit a zero-day Docker vulnerability by tracing the uncommonly used renameat2() syscall (Gregg, 2020 January). Using the signal() function of bpftrace, an administrator today could write tracing scripts that proactively kill processes entering syscalls or functions known to be associated with bad behavior.
The latest bpftrace versions also support cgroups, making it easier to integrate tracing tools with HPC scheduler jobs. Using cgroups, a script can potentially filter processes associated with specific jobs dispatched by users.
Intuitively, with the accelerating digital transformation coupled with the growing complexity of ecosystems, it has become increasingly problematic for organizations to manage their collective digital footprint (i.e. attack surface). It’s important to leverage and examine other emerging technologies for real-time traffic monitoring and the latest event-based analytic tools to detect and manage anomalies. One promising patch set is the Kernel Runtime Security Instrumentation (KRSI), which was developed by Google. This Linux Security Module (LSM) framework provides a mechanism for various security checks with much greater configuration auditing tools compared to Linux Auditing System (AuditD) (Corbet, 2019).
Within this analysis support, the BPF tracing tools’ performance and detection results are accessible for classifying cyber threat actors within the HPC environments. The implementation of the BFP trace scripts successfully recorded the simulation attack activities without degrading the performance beyond 1% of compute nodes, exclusive of the discrepancies of the eight-node processors. With a few supplementary enhancements to the trace scripts, it could eliminate duplication and innocuous classification, while freeing up newly authorized trace scripts to exploit broader categories of attacks. These newly imposed developments will contribute to a high-fidelity detection and response solution built into the Linux kernel to protect against Linux kernel to protect both the security and performance of supercomputers.
7.1. End Notes
 Array operator functionality was merged into the master branch of bpftrace on 21 April 2019 and will hopefully be available in RHEL 8.2.
Chang ,Y. T. S., Jin, H., & Bauer, J. (2016, November). Methodology and Application of HPC I/O Characterization with MPIProf and IOT. In 2016 5th Workshop on Extreme-Scale Programming Tools (ESPT). IEEE, 2016.
Corbet, J. (2012, January). Yet another new approach to seccomp. LWN. Retrieved 10 April 2020 from https://lwn.net/Articles/475043/
Corbet, J. (2014, May). BPF: the universal in-kernel virtual machine. LWN. Retrieved 2 December 2019 from https://lwn.net/Articles/599755/
Corbet, J. (2014, July). Extending extended BPF. LWN. Retrieved 2 December 2019 from https://lwn.net/Articles/603983/
Corbet, J. (2019, December). KRSI — the other BPF security module. LWN. Retrieved 2 May 2020 from https://lwn.net/Articles/808048/
Edge, J. (2011. July). Seccomp filters: No clear path. LWN. Retrieved 10 April 2020 from https://lwn.net/Articles/450291/
Fleming, M. (2017, December). A thorough introduction to eBPF. LWN. Retrieved 2 December 2019 from https://lwn.net/Articles/742082/
Gregg, B. (2017, January). BPF: Tracing and More. Presented at linux conf au, Hobart, Australia. Retrieved 28 March 2020 from https://www.youtube.com/watch?v=JRFNIKUROPE
Gregg, B. (2018, September). gethostlatency.bt [Computer software]. Retrieved 28 March 2020 from https://github.com/iovisor/bpftrace/blob/master/tools/gethostlatency.bt
Gregg, B. (2020, January). BPF Performance Tools: Linux system and Application Observability. United States: Addison-Wesley.
Gregg, B., et al. (2020, April). bpftrace Reference Guide. GitHub. Retrieved 1 May 2020 from https://github.com/iovisor/bpftrace/blob/master/docs/reference_guide.md
Gregg, B., & Maestretti, A. (2017, February). Security Monitoring with eBPF. In BSidesSF 2017, San Francisco, CA. Retrieved 28 March 2020 from https://www.youtube.com/watch?v=44nV6Mj11uw
Hamel, D. (2018, November). tcpconnect.bt [Compute software]. Retrieved 28 March 2020 from https://github.com/iovisor/bpftrace/blob/master/tools/tcpconnect.bt
Kiepert, J. (2013, May). Creating a Raspberry Pi-based Beowulf Cluster. Department of Electrical and Computer Engineering, Boise State University, Boise, ID.
McCanne, S., & Jacobson, V. (1992, December). The BSD Packet Filter: A New Architecture for User-level Packet Capture. In 1993 Winter USENIX Conference. San Diego, 1993. Retrieved 10 April 2020 from https://www.tcpdump.org/papers/bpf-usenix93.pdf
National Institute for Standards and Technology. (2016). HPC Security Best Practices: Strengths and Weaknesses. In NSCI: High-Performance Computing Security Workshop, Gaithersburg, MD.
Starovoitov, A. (2014, March). net: filter: rework/optimize internal BPF interpreter’s instruction set. In kernel/git/torvalds/linux.git [software]. Retrieved 10 April 2020 from https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=bd4cf0ed331a275e9bf5a49e6d0fd55dffc551b8
Starovoitov, A. (2014, May). Tracing: accelerate tracing filters with BPF. In net-next [mailing list]. Retrieved 10 April 2020 from https://lwn.net/Articles/598545/
TOP500. (2019, November). TOP500 List – November 2019. Retrieved 21 March 2020 from https://www.top500.org/list/2019/11/
XSEDE. (n.d.). XSEDE Federation. Retrieved March 21, 2020, from https://www.xsede.org/web/xsede-old/xsede-federation