High Performance Computing (HPC) workloads have traditionally been run only on bare-metal, unvirtualized hardware. However, performance of these highly parallel technical workloads has increased dramatically over the last decade with the introduction of increasingly sophisticated hardware support for virtualization, enabling organizations to begin to embrace the numerous benefits that a virtualization platform can offer.
To demonstrate the results of this continuing performance improvement, this paper explores the application of virtualization to HPC and evaluates the performance of HPC workloads in a virtualized, multitenant computing environment.
Virtualizing HPC Workloads
Although HPC workloads are most often run on bare-metal systems, this has started to change over the last several years as organizations have come to understand that many of the benefits that virtualization offers to the enterprise can often also add value in HPC environments. The following are among those benefits:
- Supports a more diverse end-user population with differing software requirements – By using virtual machines (VMs), each user or group can run the operating system (OS) and other software that are most effective for their needs, and these different software environments can be freely mixed on the same hardware. In addition, that mix can be changed dynamically as user requirements change, which enables IT departments to increase overall agility and to help decrease time to solution for researchers, scientists, and engineers.
- Provides data security and compliance by isolating user workloads into separate VMs, including running within different software-defined virtual networks – This ensures that projects (for example, studies involving clinical data) maintain control over their data and that the data is not shared inappropriately with other users. In addition, data security can be achieved while allowing projects to share underlying hardware, increasing overall utilization of physical resources.
- Provides fault isolation, root access, and other capabilities not available in traditional HPC environments through the use of VMs – By running users’ jobs within dedicated VMs, each user can be protected from problems caused by other users, a common issue in bare-metal HPC environments in which jobs from multiple users are frequently run within the same OS instance. In addition, the isolating nature of the VM abstraction means that root access can be granted to those users who require it, because that privilege is granted only within the VM and it does not compromise the security of other users or their data.
- Creates a more dynamic IT environment in which VMs and their encapsulated workloads can be live-migrated across the cluster for load balancing, for maintenance, for fault avoidance, and so on – This is a considerable advance over the traditional approach of statically scheduling jobs onto a bare-metal cluster without any ability to reassess those placement decisions after the fact. Such dynamic workload migration can increase overall cluster efficiency and resilience.
In HPC, performance is of paramount importance, and delivering high performance in the VMware virtual environment has been a key part of the work required to make virtualization viable for these workloads.
HPC workloads can be broadly divided into two categories: parallel distributed workloads and throughput workloads. Parallel distributed applications—often these are MPI applications, referring to the most popular messaging library for building such applications—consist of many simultaneously running processes that communicate with each other, often with extremely high intensity. Because this communication is almost always in the critical performance path, the HPC community has adopted specialized hardware and software to achieve the lowest possible latency and highest bandwidth to support running these applications efficiently.
InfiniBand and RDMA are the two most widely used hardware and software approaches for HPC message passing, and they can be used in a virtual environment as well. Figure 1 shows how InfiniBand latencies under a VMware vSphere® platform using VMware vSphere DirectPath I/O™ have improved over the last several releases of VMware ESXi™, the vSphere hypervisor. Latencies now approach those achievable without virtualization, and Figure 2 shows performance results for a variety of popular open-source and commercial MPI applications. These tests were run on a 16-node EDR InfiniBand cluster running one large VM per node. As can be seen when compared to Figure 3, the degradations can be higher with this workload class than with throughput workloads. Because overheads depend on the specific application, the model being used, and the scale at which the application is run, a proof-ofconcept deployment is often recommended to determine achievable acceptable performance. Additional information about MPI performance can be found on the Dell Community site.
Throughput workloads often require a large number of tasks to be run to complete a job, with each task running independently with no communication between the tasks. Rendering the frames of a digital movie is a good example of such a throughput workload: Each frame can be computed independently and in parallel; when all frames have been computed, the overall job has been completed. Throughput workloads currently run with very little degradation on vSphere, typically either a percentage point or two of degradation, and in some circumstances they can run slightly faster when virtualized. Figure 3 shows performance comparisons for a popular set of life sciences throughput benchmarks, illustrating that virtual performance can be very similar to unvirtualized for this workload class. In these tests, we compare the runtime of each program in the benchmark suite, running each within a VM and on bare metal, using identical hardware and OSs for each test.
HPC infrastructure typically consists of a cluster of nodes, ranging from tens to hundreds of thousands, to support a large degree of parallelism. The nodes in such a complex system are often split into multiple partitions based on their roles, such as login nodes, management nodes, and compute nodes. Rather than directly accessing the compute resources, users submit and manage jobs via login nodes, which are sometimes duplicated for load balancing and fault tolerance. To efficiently share the resources among multiple users while being able to enforce specific rules for fairness and quality of service, most production HPC systems execute user jobs on a pool of compute nodes in a batch mode. That is, each user-submitted job is first put into a job queue, waiting until a job scheduler acquires the requested resources from a resource manager. After the resources are obtained, they are then allocated to the job. The job scheduler and resource manager are management services that run on dedicated management nodes.
With the test bed described in the previous section, a fair performance comparison between bare-metal and virtual clusters is to contrast the completion time for a fixed sequence of jobs. For this test, we first boot Linux on each node and time how long it takes to run the job stream through this physical TORQUE cluster. We then reboot the machines with the ESXi hypervisor and run the same throughput test on a virtual TORQUE cluster built using one VM per node. However, in a virtualized HPC environment, there is an extra configuration parameter—the number of VMs on each host—an exclusive advantage of virtualization that supports multitenancy and resource sharing. As has been confirmed in many enterprise use cases, resource consolidation can improve utilization and thereby increase overall throughput. To verify if this also applies to HPC throughput computing, we experimented with CPU overcommitment in this work.
To study solely the effect of CPU overcommitment and avoid memory overcommitment, in the virtual environment 28GB of memory on each node is always reserved for the ESXi hypervisor instance, and the remaining 100GB is evenly split among the running VMs. For example, when four VMs are running on each host, each VM is given a reservation of 25GB memory.
Job Execution Time
To perform fair comparisons between test scenarios with differing numbers of active TORQUE clusters, we must ensure that the sequence of jobs that runs on the hardware in each scenario is approximately the same. To achieve this, we adopt a bottom-up approach by first generating a randomized job stream with each of the seven benchmarks repeated 116 times, for a total of 812 jobs. In the 4X overcommitment case in which four virtual TORQUE clusters execute jobs simultaneously, each of the clusters is fed a copy of this job stream, resulting in a total of 3,248 jobs being run over the course of this test. For the 2X overcommitment case with two active clusters, two copies of the 812-job stream are combined in an interleaved manner to produce a new job stream containing 1,624 jobs. This 1,624-job stream is fed to each of the two active clusters, resulting in the same number of jobs run as in the 4X test case (3,248) and with jobs launched in approximately the same order. Finally, two copies of the 1,624-job stream are combined again to create a third job stream with 3,248 jobs, which is fed to the single active cluster for bare-metal and single–virtual-cluster tests. This approach ensures that the same job sequence is used in all test cases to ensure fairness.
Besides execution time, another performance metric that we monitored is the total CPU utilization across the whole physical cluster. In an ESXi ssh session, the esxtop tool can gather a variety of metrics at fine granularity to give a detailed view of system state. We used esxtop running on each compute node to sample CPU utilization of all VMs at 5-second intervals, and the results for one, two, and four virtual clusters are shown in Figure 7. With two and four virtual clusters, the decreasing trend at the end is due to job completion.
Per-Cluster CPU Utilization
In a real production HPC system, an important principle is fairness when multiple users are sharing the computing resources. This is also true in a virtualized environment. In particular, the previously mentioned CPU overcommitment configurations are representative of a future virtualized HPC environment where each user or group is given a virtual cluster during resource allocation. We can delve a little deeper into the two– and four–virtual cluster aggregate results in the previous section and examine the per-cluster CPU utilization in each case as shown in Figure 8. It is clear in both cases that the ESXi scheduler effectively maintains fairness so that each virtual cluster gets the same amount of CPU resources.
A rising trend toward virtualizing HPC environments is being driven by a motivation to take advantage of the increased flexibility and agility that virtualization offers. This paper explores the concepts of virtual throughput clusters and CPU overcommitment with VMware vSphere to create multitenant and agile virtual HPC computing environments that offer the ability to deliver quality-of-service guarantees between HPC users with good aggregate performance. Results from this work demonstrate that HPC users can expect similar to native performance for HPC throughput workloads while enjoying the various benefits of virtualization.