High Performance Computing – It’s all about the bottleneck

Igor Ljubuncic

on 10 January 2024

Tags: High Performance Computing , optimisation , parallelism , speed

This article is more than 1 year old.

The term High Performance Computing, HPC, evokes a lot of powerful emotions whenever mentioned. Even people who do not necessarily have vocational knowledge of hardware and software will have an inkling of understanding of what it’s all about; HPC solves problems at speed and scale that cannot be achieved with standard, traditional compute resources. But the speed and the scale introduce a range of problems of their own. In this article, we’d like to talk about the “problem journey” – the ripple effect of what happens when you try to solve a large, complex compute issue.

It all starts with data…

Consider a 1TB text data set. Let’s assume we want to analyse the frequency of words in the data set. This is a fairly simple mathematical problem, and it can be achieved by any computer capable of reading and analysing the input file. In fact, with fast storage devices like NVMe-connected SSDs, parsing 1TB of text won’t take too long.

But what if we had to handle a much larger data set, say 1 PB (or 1 EB for that matter), and solve a much more complex mathematical problem? Furthermore, what if we needed to solve this problem relatively quickly?

From serial to parallel

One way to try to handle the challenge is to simply load data in a serial manner into the computer memory for processing. This is where we encounter the first bottleneck. We need storage that can handle the data set, and which can stream the data into memory at a pace that will not compromise the required performance. Even before it starts, the computation problem becomes one of big data. As it happens, this is a general challenge that affects most areas of the software industry. In the HPC space, it is simply several orders of magnitude bigger and more acute.

The general limitations of storage speed compared to computer memory and CPU are not a new thing. They can be resolved in various ways, but by and large, especially when taking into consideration the cost of hardware, I/O operations (storage and network) are slower than in-memory operations or computation done by the processor. Effectively, this means that tasks will run only as quickly as the slowest component. Therefore, if one wants an entire system to perform reasonably fast, considering it is composed of multiple components that all work at different individual speeds, the input of data needs to change, from serial to parallel.

An alternative approach to handle the data problem is to break the large data set into smaller chunks, store them on different computers, and then analyse and solve the mathematical problem as a set of discrete mini-problems in parallel. However, this method assumes that data can be arbitrarily broken without losing meaning, or that execution can be done in parallel without altering the end result of the computation. It also requires an introduction of a new component into the equation: a tool that will orchestrate the division of data, as well as organise the work of multiple computers at the same time.

Data locality

If a single computer cannot handle the task at hand, using multiple computers means they will need to be inter-connected, both physically and logically, to perform the required workload. These clusters of compute nodes will often require network connectivity between them, so they can share data input and output, as well as the data dispatch logic.

The network component presents a new challenge to the mathematical problem we want to solve. It introduces the issue of reliability into the equation, as well as further performance considerations. While one can guarantee that data will correctly move from local storage into memory, there is no such guarantee for over-the-network operations.

Data may be lost, retransmitted, or delayed. Some of these problems can be controlled on the network transmission protocol level; but there will still be a level of variance that may affect the parallel execution. If the data analysis relies on precise timing and synchronisation of operations among the different nodes in inter-connected clusters of machines, we will need to introduce yet another safecheck mechanism into the system. We will need to ensure data consistency on top of the other constraints.

This means that the cluster data dispatcher governor will have to take into account additional logic, including the fine balance among different components in the environment. We need to invest additional resources to create an intelligent and robust scheduler before we can even start any data analysis.

Who guards the guards?

In many scenarios, large data sets will include structure dependencies that will complicate the possible data topology for parallel computation. Data won’t just be a simple if large, well-organised set. For example, certain parts of the data set may have to be processed before other parts, or they may use the output of computation from some of the subsets as the input for further processing with other subsets.

Such limitations will dictate the data processing, and will require a clever governor that manages and organises data. Otherwise, even if there are sufficient storage and computational resources available, they might not be utilised fully or correctly because the data isn’t being dispatched in the most optimal manner.

The scheduler will have to take into account data locality, network transport reliability, correctly parse the data and distribute it across multiple computers, orchestrate the sequence and timing of parallel execution, and finally validate the results of the execution. All of these problems stem from the fact we may want to run a task across multiple computers.

Back to a single node and the laws of thermodynamics

Indeed, one may conclude that parallel execution is too complex, and brings more problems than it solves. But in practice, it is the only possible method that can resolve some large-scale mathematical problems, as no single-node computer can handle the massive amounts of data in the required time frame.

Individual computers designed for high performance operations also have their own challenges, dictated by the limitations of physics. Computation is usually determined by the processor clock frequency. Typically, the higher the frequency, the higher the heat generation in the CPU. This excess heat quickly builds up and needs to be dissipated so that CPUs can continue working normally. When one considers the fact that the power density of silicon-based processors is similar to fission nuclear reactors, the cooling is a critical element, and it will usually be the practical bottleneck of what any one processor can do.

Higher clock speeds, both for the processor and the memory, also affect the rate of error generation that can occur during execution, and the system’s ability to handle these errors. At some point, in addition to the heat problem, execution may become unstable, compromising the integrity of the entire process.

There are also practical limits on the size of physical memory modules and local storage systems, which mean that large data sets will have to be fragmented to be processed, at which point we go back to square one – breaking the large set into a number of small ones, and taking into account the distribution of data, the data logic and ordering, and the reliability of the transport between the local system and the rest of the environment.

In that case, perhaps the solution is to use multiple computers, but treat them as a single, unified entity?

Mean Time Between Failures (MTBF)

The clustering of computers brings into focus an important concept, especially in the world of HPC. MTBF is an empirically-established value that determines the average time a system will run or last before it encounters a failure. It is used to estimate the reliability of a system, in order to implement relevant safeguards and redundancy where needed.

An alternative way to look at MTBF can be through failure rates. For instance, if a hard disk has a 3% failure rate within its first year, one can expect, on average, three hard disk failures from a pool of 100 within one year of being put to use. This isn’t necessarily an issue on its own, but if all these 100 disks are used as part of a clustered system, as a single entity, then the failure is pretty much guaranteed.

The MTBF values are a critical factor for HPC workloads due to their parallelised nature. Quite often, the results are counterintuitive. In many scenarios, HPC systems (or clusters) have much lower MTBF values than the individual computers (or nodes) that comprise them.

Hard disks are a good, simple way to look and analyse MTBF-associated risks. If a hard disk has a 1% failure rate, storing identical data on two different devices significantly reduces the risk of total data loss. In this case, the risk of loss becomes only 0.01%. However, if the two disks are used to store different, non-identical parts of data, the loss of any one will cause a critical failure in the system. The failure rate will increase, and lead to a lower MTBF, effectively half the value for individual disks.

This can be a significant problem with large data sets that need to be analysed in small chunks on dozens, hundreds and sometimes thousands of parallelised compute instances that work as a unified system. While each of these computers is an individual machine, they all form a single, more complex system that will not produce the desired results if a failure occurs on any one node in the entity.

With supercomputing clusters, the MTBF can be as low as several minutes. If the runtime of tasks that need to be solved exceeds the MTBF figure, the setup needs to include mechanisms to ensure a potential loss of individual components will not affect the entire system. However, often, every new safeguard introduces additional cost and potential performance penalty.

Once again, there are critical challenges that need to be addressed before the data set can be safely parsed and analysed. In essence, it becomes a cat-and-mouse chase between performance and reliability. The more time and resources are invested in checks that ensure data safety and integrity, the longer and costlier the execution. A functioning setup will be a tradeoff between raw speed and potential catastrophic loss of the entire data set computation.

Back to data

So far, we haven’t really talked about what we actually want to do with our data. We just know that we have a payload that is too large to load and run on a single machine, and we need to break it into chunks and run on several systems in parallel. The bottleneck of data handling triggered an entire chain of considerations that are tangential to data but critical to our ability to process and analyse the data.

Assuming we can figure out the infrastructure part, we still have the challenge of how to handle our actual workload. A well-designed HPC setup that has as few bottlenecks as possible will still be only as good as our ability to divide the data, process it, and get the results.

To that end, we need a data scheduler. It needs to have the following capabilities:

Be fast enough so the HPC setup is used in an optimal manner – and not introduce performance penalties of its own.
Be aware of and understand the infrastructure topology on which it runs so it can run necessary data integrity checks when needed – to account for data loss or corruption, unresponsive components in the system, and other possible failures.
Be able to scale up and down as needed.

Usually, implementing all of these requirements in a single tool is very difficult, as the end product ends up being highly complex and fragile, as well as not easily adaptable to changes in the environment. Typically, setups will have several schedulers – one or more on the infrastructure level, and one or more on the application and data level. The former will take care of the parallelised compute environment so that the applications that run on top of it will be able to treat it as transparent, i.e., as though working on a single computer. The latter will focus on data management – low overhead, high integrity, and scale. On paper, this sounds like a pretty simple formula.

There are no shortcuts

Realistically, no two HPC or supercomputing setups are identical, and they all require a slightly different approach, for a variety of reasons. The size and complexity of the data that needs to be used for computation will often dictate the topography of the environment, and the combination of individual components used in the system, no matter how much we may want these to be separate and transparent to each other. HPC setups are often built using a specific set of hardware, and sometimes, some of the components are ancient while others are new and potentially less tested, which can lead to instability and runtime issues. A lot of effort is required to integrate and make all these components work together.

At Canonical, we do not presume to have all the answers. But we do understand the colossal challenges that the practitioners of HPC face, and we want to help them. This piece is the beginning of a journey, where we want to solve, or at least simplify, some of the core problems of the HPC world. We want to make HPC software and tooling more accessible, easier to use and maintain. We want to help HPC customers optimise their workloads and reduce their cost, be it the hardware or electricity bills. We want to make HPC better however we can. And we want to do that in a transparent way, using open source applications and utilities, both from the wider HPC community as well as our own home-grown software.If you’re interested, please consider signing up for our newsletter, or contact us, and maybe we can undertake this journey together.

Photo by Aleksandr Popov on Unsplash.

Talk to us today

Interested in running Ubuntu in your organisation?

High Performance Computing – It’s all about the bottleneck

Igor Ljubuncic

It all starts with data…

From serial to parallel

Data locality

Who guards the guards?

Back to a single node and the laws of thermodynamics

Mean Time Between Failures (MTBF)

Back to data

There are no shortcuts

Talk to us today

Newsletter signup

Related posts

Canonical’s recipe for High Performance Computing

We wish you RISC-V holidays!

How to make snaps faster