Open Source in HPC [part 5]

Jon Thor Kristinsson

on 28 November 2022

In this blog post, we will cover the many ways open source has influenced the high-performance computing industry and some of the common open-source solutions in HPC.

This blog is part of a series that introduces you to the world of HPC

What is High-performance computing? – Introduction to the concept of HPC
High-performance computing clusters anywhere – Introduction to HPC cluster hosting
What is supercomputing? A short history of HPC and how it all started with supercomputers
High-performance computing cluster architectures – An overview of HPC cluster architecture
High-performance computing (HPC) technologies – what does the future hold?

What kind of open-source software is there in HPC?

There are a lot of open-source projects in HPC, which makes sense as its foundations were mostly driven by governmental organisations and universities (some of the primary consumers of open source).

These are some of the types of open-source tools typically used for HPC:

Operating systems: Linux
Schedulers: SLURM, OpenPBS, Grid Engine, HTCondor and Kubernetes
Libraries for parallel computation: OpenMP, OpenMPI, MPICH, MVAPICH
Cluster provisioners: MAAS, xCat, Warewulf
Storage: Ceph, Lustre, BeeGFS and DAOS
Workloads: BLAST, OpenFOAM, ParaView, WRF and FDS-SMV
Containers: LXD, Docker, Singularity (Apptainer) and Charliecloud

Operating systems

Linux in HPC

The Linux operating system, probably one of the most recognised open-source projects, has been both a driver for open source software in HPC and been driven by HPC use cases. NASA were early users of Linux and Linux, in turn, was fundamental to the first beowulf cluster. Beowulf clusters were essentially clusters created using commodity servers and high speed interconnect, instead of more traditional mainframes or supercomputers. The first such cluster was deployed at NASA, which went on to shape HPC as we know it today. It drove Linux adoption from then onwards in government and expanded well outside that sector into others as well as enterprises.

HPC workload reliance on performance has driven a lot of development efforts in Linux, all focused heavily on driving down latency and increasing performance anywhere from networking to storage.

Schedulers

SLURM workload manager

Formerly known as Simple Linux Utility for Resource Management, SLURM is an open source job scheduler. Its development started as a collaborative effort between Lawrence Livermore National Laboratory, SchedMD, HP and Bull. SchedMD are currently the main maintainers and provide a commercially supported offering for SLURM. It’s used on about 60% of the TOP500 clusters and is the most frequently used job scheduler for large clusters. SLURM can currently be installed from the Universe repositories on Ubuntu.

Open OnDemand

Not a scheduler per say, but deserves an honourable mention with SLURM. Open OnDemand is a user interface for SLURM that eases the deployment of workloads via a simple web interface. It was created by the Ohio Supercomputing Centre with a grant from the National Science Foundation.

Grid Engine

A batch scheduler that has had a complicated history, Grid Engine has been known for being open source and also closed source. It started as a closed source application released by Gridware but after their acquisition by Sun, it became Sun Grid Engine (SGE). It was then open sourced and maintained until an acquisition by Oracle took place, at which point they stopped releasing the source and it was renamed Oracle Grid Engine. Forks of the last open source version soon appeared. One called Son of Grid Engine, which was maintained by the University of Liverpool and no longer is (for the most part). Another called Grid Community Toolkit is also available but not really under active maintenance. A company called Univa started another closed source fork after hiring many of the main engineers of the Sun Grid Engine team. Univa Grid Engine is currently the only actively maintained version of Grid Engine. It’s closed sourced and Univa was recently acquired by Altair. The Grid Community Toolkit Grid Engine manager is available on Ubuntu under the Universe repositories.

OpenPBS

Portable Batch System (PBS) was originally developed for NASA, under a contract by MRJ. It was made open source in 1998 and is actively developed. Altair now owns PBS, and releases an open source version called OpenPBS. Another fork exists that used to be maintained as open source but has since gone closed source. It’s called Terascale Open-source Resource and QUEue Manager (TORQUE) and it was forked and maintained by Adaptive Computing. PBS is currently not available as a package on Ubuntu.

HTCondor

HTCondor is a scheduler in its own right, but differentiated compared to the others, as it was written to make use of unused workstation resources instead of HPC clusters. It has the ability to execute workloads on idle systems and kills them once it detects activity. HTCondor is available on Ubuntu in the universe package repository.

Kubernetes

Kubernetes is a container scheduler that has gained a loyal following for scheduling cloud-native workloads. Interest in expanding the use of Kubernetes in more compute-focused workloads that depend on parallelisation has grown. Some machine learning workloads have even built up a substantial ecosystem around Kubernetes, sometimes driving a need to deploy Kubernetes as a temporary workload on a subset of resources to handle workloads that so heavily depend on it. There are also efforts to expand the overall scheduling capabilities of Kubernetes to better cater to the needs of computational workloads, so efforts are ongoing.

Libraries for parallel computation

OpenMP

OpenMP is an application programming interface (API) and library for parallel programming that supports shared-memory multiprocessing. When programming with OpenMP all threads share both memory and data. OpenMP is highly portable and gives programers a simple interface for developing parallel applications that can run on anything from multi-core desktops to the largest supercomputers. OpenMP enables processes to communicate with each other within a single node in an HPC cluster, but there is an additional library and API for processing between nodes. That’s where MPI or Message Passing Interface comes in, as it allows a process to communicate between nodes. OpenMP is available on Ubuntu through most compilers, such as gcc.

OpenMPI

OpenMPI is an open-source implementation of the MPI standard, developed and maintained by a consortium of academic, research and industry partners. It was created through a merger of three well-known MPI implementations, that are as of the merger no longer being individually maintained. The implementations were FT-MPI from the University of Tennessee, LA-MPI from Los Alamos National Laboratory and LAM/MPI from Indiana University. Each of these MPI implementations were excellent in one way or another. The project aimed to bring the best ideas and technologies from each into a new world-class open source implementation that excels overall in an entirely new code base. OpenMPI is available on Ubuntu in the universe package repository.

MPICH

Formerly known as MPICH2, MPICH is a freely available open-source implementation. Started by Argonne National Laboratory and Mississippi State University, its name comes from the combination of MPI and CH. CH stands for Chameleon, which was a parallel programming library developed by one of the founders of MPICH. It is one of the most popular implementations of MPI, and is used as the foundation of many MPI libraries available today, including Intel MPI, IBM MPI, Cray MPI, Microsoft MPI and the open source MVAPICH project. MPICH is available on Ubuntu in the universe package repository.

MVAPICH

Originally based on MPICH, MVAPICH is freely available and open source. The implementation is being led by Ohio State University. Its stated goals are to “deliver the best performance, scalability and fault tolerance for high-end computing systems and servers” that use high performance interconnects. Its development is very active and there are multiple versions – all made to get the best performance possible for the underlying fabric. Notable developments include its support for DPU offloading, where MVAPICH takes advantage of underlying smartnics to offload MPI processes allowing the processors to focus entirely on the workload.

Cluster provisioners

MAAS

Metal as a Service or MAAS, is an open source project developed and maintained by Canonical. MAAS was created from scratch with one purpose: API-centric bare-metal provisioning. MAAS automates all aspects of hardware provisioning, from detecting a racked machine to deploying a running, custom-configured operating system. It makes management of large server clusters, such as those in HPC, easy through abstraction and automation. It was created to be easy to use, has a comprehensive UI unlike many other tools in this space and is highly scalable thanks to its disaggregated design. MAAS is split into a region controller which manages overall state and a rack controller which handles PXE booting and Power Control, multiple rack controllers can be deployed allowing for easy scale out regardless of the environment’s size. It’s notable that MAAS can be deployed in a highly available configuration, giving it the fault tolerance that many other projects in the industry don’t have.

xCAT

Extreme Cloud Administration Toolkit or xCAT, is an open-source project developed by IBM. Its main focus is on the HPC space, with features primarily catering to the creation and management of diskless clusters, parallel installation and management of Linux cluster nodes. It’s also suitable to set up high-performance computing stacks such as batch job schedulers. It also has the abilities to clone and image Linux and Windows machines. It has some features that primarily cater to IBM and Lenovo servers. It’s used by many large governmental HPC sites for the deployment of diskless HPC clusters.

Warewulf

Warewulf’s stated purpose is to be a “stateless and diskless container operating system provisioning system for large clusters of bare metal and/or virtual systems”. It has been used for HPC cluster provisioning for the last two decades. And has recently been rewritten in its latest release, Warewulf v4, using golang.

Storage

Ceph

Ceph is an open-source software-defined storage solution implemented based on object storage. It was originally created by Sage Weil for a doctoral dissertation and has roots in supercomputing. Its creation was sponsored by the Advanced Simulation and Computing Program (ASC) which includes supercomputing centres such as Los Alamos National Laboratory (LANL), Sandia National Laboratories (SNL), and Lawrence Livermore National Laboratory (LLNL). Its creation started through a summer program at LLNL. After concluding his studies, Sage continued to develop Ceph full time, and created a company called Inktank to further its development. Inktank was eventually purchased by Red Hat. Ceph continues to be a strong open-source project, and is maintained by multiple large companies, including members of the Ceph Foundation like Canonical, Red Hat, Intel and others.

Ceph was meant to replace Lustre when it comes to supercomputing, and through significant development efforts it has added features like CephFS, which give it POSIX compatibility and make it a formidable files-based network storage system. Its foundations are truly based on fault tolerance over performance, and there are significant performance overheads to its storage model based on replication. Thus, it has not quite reached other solutions’ level in terms of delivering close to underlying hardware performance. But Ceph at scale is a formidable opponent as it scales quite well and can deliver an overwhelming amount of the overall Ceph cluster performance.

Lustre

Lustre is a parallel distributed file system used for large-scale cluster computing. The word lustre is a blend of the words Linux and Cluster. It has consistently ranked high on the IO500, a bi-yearly benchmark that compares storage solution performance as it relates to high-performance computing use cases, and has seen significant use throughout the TOP500 list, a bi-yearly benchmark publication focused on overall cluster performance. Lustre was originally created as a research project by Peter J. Braam, who worked at Carnegie Mellon University, and went on to found his own company (Cluster File Systems) to work on Lustre. Like Ceph, Lustre was developed under the Advanced Simulation and Computing Program (ASC) and its PathForward project, which received its funding through the US Department of Energy (DoE), Hewlett-Packard and Intel. Sun Microsystems eventually acquired Cluster File Systems, which was acquired shortly after by Oracle.

Oracle announced soon after the Sun acquisition that it would cease the development of Lustre. Many of the original developers of Lustre had left Oracle by that point and were interested in further maintaining and building Lustre but this time under an open community model. A variety of organisations were formed to do just that, including the Open Scalable File System (OpenSFS), EUROPEAN Open File Systems (EOFS) and others. To join this effort by OpenSFS and EOFS a startup called Whamcloud was founded by several of the original developers. OpenSFS funded a lot of the work done by Whamcloud. This significantly furthered the development of Lustre, which continued after Whamcloud was eventually acquired by Intel. Through restructuring at Intel, the development department focused on Lustre was eventually spun out to a company called DDN.

BeeGFS

A parallel file system developed for HPC, BeeGFS was originally developed at the Fraunhofer Centre for High Performance Computing by a team around Sven Breuner. He became the CEO of ThinkParQ, a spin-off company created to maintain and commercialise professional offerings around BeeGFS. It’s used by quite a few European institutions whose clusters reside in the TOP500.

DAOS

Distributed Asynchronous Object Storage or DAOS is an open source storage solution aiming to take advantage of the latest generation of storage technologies, such as non volatile memory or NVM. It uses both distributed Intel Optane persistent memory and NVM express (NVMe) storage devices to expose storage resources as a distributed storage solution. As a new contender it did relatively well in the IO500 10 node challenge, as announced during ISC HP’22, where it managed to get 4 places in the top 10. Intel created DAOS and actively maintains it.

Workloads

Many HPC workloads come from either in-house or open-source development, driven by a strong need for community effort. Often these workloads come from either a strong research background, initiated through University work or through national interests, often serving multiple institutes or countries. When it comes to open source there are plenty of workloads covering all sorts of scenarios – anything from weather research to physics.

BLAST

Basic Local Alignment Search Tool or BLAST is an algorithm in bioinformatics for comparing biological sequence information, such as those in protein or the nucleotides of DNA or RNA sequences. It allows researchers to compare a sequence with a library or database of known sequences, easing identification. It can be used to compare sequences found in animals to those found in the human genome, helping scientists identify connections between them and how they might be expressed.

OpenFOAM

Open-source Field Operation And Manipulation or OpenFOAM, as it’s better known as, is an open-source toolbox used for the primary purpose of developing numerical solvers for computational fluid dynamics. OpenFOAM as it’s known today was originally sold commercially as a program called FOAM. However, through the efforts of its owners it was open sourced under a GPL licence and renamed to OpenFOAM. In 2018, a steering committee was formed to set the direction of the OpenFOAM project; many of its members come from the Automotive sector. Notably OpenFOAM is available in the Ubuntu package repositories.

ParaView

Is an open-source data analysis and visualisation platform written in a server-client architecture. It’s often used to view results from programs such as OpenFOAM and others. For optimal performance, the rendering or processing needs of ParaView can be spun up as a scheduled cluster job allowing the use of clustered computational resources to assist. ParaView can also be run as a single application; it does not depend on being run exclusively on clusters through its client-server architecture. ParaView was started through collaboration between KitWare Inc and Los Alamos National Laboratories, with funding from the US department of energy. Since then, other national laboratories have joined the development efforts. Notably, ParaView is available in the Ubuntu package repositories.

WRF

Weather Research & Forecasting or WRF Model is an open-source mesoscale numerical weather prediction system. It supports parallel computation and is used by an extensive community for atmospheric research and operational forecasting. It’s used by most of the identities involved in weather forecasting today. It was developed through a collaboration of the National Center for Atmospheric Research (NCAR), the National Oceanic and Atmospheric Administration (NOAA), the U.S. Air Force, the Naval Research Laboratory, the University of Oklahoma, and the Federal Aviation Administration (FAA). It’s a truly multidisciplinary and multi organisational effort. It has an extensive community of about 56,000 users located in over 160 countries.

Fire Dynamics Simulator and Smokeview

Fire Dynamics Simulator (FDS) and Smokeview (SMV) are open source applications created through efforts from the National Institute of Standards and Technology (NIST). FDS is a computational fluid dynamics (CFD) model of fire-driven fluid flow. It uses parallel computation to numerically solve a form of the Navier-Stokes equations. This is appropriate for low-speed, thermal-driven flow, such as those that apply to the spread and transmission of smoke and heat from fires. Smokeview (SMV) is the visualisation component of FDS and is used for analysing the output from FDS. It allows users to better understand and view the spread of smoke, heat and fire. It’s often used to understand large structures and how they might be affected in such disaster scenarios.

Containers

HPC environments often depend on a complex set of dependencies as it relates to the workloads. A lot of effort has been put into the development of module-based systems such as lmod, allowing users to load applications or dependencies such as libraries outside of normal system paths. This is often due to a need to compile applications against a certain set of libraries which depend on specific numerical or vendor versions. To avoid the complex set of dependencies, organisations can put considerable effort into containers. This effectively allows the user to bundle up its application with all its dependencies into a single executable application container.

LXD

Is a next generation system container and virtual machine manager. It offers a unified user experience around full Linux systems running inside containers or virtual machines. Unlike most other container runtimes it allows for management of virtual machines and its ability to run full multi application runtimes is unique. One can effectively run a full HPC environment inside an LXD container providing abstraction and isolation at no cost to performance.

Docker

The predominant container runtime for cloud-native applications has seen some usage in HPC environments. Its adoption has been limited in true multi-user systems, such as those in large cluster environments as Docker fundamentally requires privileged access. Another downside often mentioned is the overall size of Docker images, which is attributed to the need to add all application dependencies, including MPI libraries, along with the application. This quite often creates large application containers which might easily duplicate components of other application containers. However when done right, Docker can be quite effective for dependency management when it comes to the development and enablement of a specific hardware stack and libraries. It allows the packaging of applications to depend on a unified stack. This has some strengths. For example, it avoids storage of multiple dependencies by having dependent container images. You can see this to great effect in the Nvidia NGC containers.

Singularity

Singularity or Apptainer – the name of the most recent fork- is an application container effort that tries to address some of the perceived down sides of docker containers. It avoids dependencies on privileged access making them fit quite well into large multi-user environments, and instead of creating full application containers with all dependencies it can be executed by system level components such as MPI libraries and implementations, allowing for the creation of leaner containers with more specific purposes and dependencies.

Charliecloud

Charliecloud is a containerisation effort that’s in some ways similar to Singularity and uses Docker to build images that can then be executed unprivileged by the Charliecloud runtime. It’s an initiative by Los Alamos National Laboratory (LANL).

Summary

This blog does not even come close to covering the whole breadth of open source HPC focused applications but it should provide a good introduction to some of the main components and available open-source tools.

If you are interested in more information take a look at the previous blog in the series “High-performance computing (HPC) architecture” or at this video of how Scania is Mastering multi cloud for HPC systems with Juju. For more insights, dive into some of our other HPC content.

In the next blog, we will look towards the future of HPC, and where we might be going.

Talk to us today

Interested in running Ubuntu in your organisation?