Frontiers of DeepMind AI in the high-performance computing

Matrix multiplication is one of the basic linear algebra operations that is used almost everywhere. For example, fully connected layers in neural networks use matrix multiplication internally. Surprisingly, DeepMind developed a neural network that found a new multiplication algorithm that outperforms current, the best algorithm. Results were published in Nature journal. In this article, we will discuss this research in more detail.

Author: Michael Yurushkin, CEO of BroutonLab, PhD In Computer Science

Children are taught to multiply matrices in high school. Despite the apparent simplicity of the problem statement, scientists in the field of applied numerical methods have been studying fast algorithms for solving it for decades. But what is meant by a fast algorithm?

The main metrics of the implemented algorithm performance is the number of floating point operations per second (FLOPS) required to solve a given problem. Sometimes it is also useful to compare this number with the algorithm's theoretical peak performance (if such an estimate, of course, can be provided).

How are numerical algorithms developed?

Until the 1990s, the main criterion for the algorithms' quality estimation was the time complexity, or to put it simply, the arithmetic operations number. Then arithmetic operations were slow, so it was natural to optimize them. Strassen's algorithm is just one example of such algorithms. But in the 1990s, processors with cache memory began to appear, and it changed the direction of algorithms development.

Example of a processor with a hierarchical cache. Source:

With the advent of cache memory, it was not the number of arithmetic operations that mattered, but the number of memory accesses. At the same time, families of block algorithms have become popular, which make it possible to utilize the cache memory most efficiently.

It is worth noting that when building an algorithm, it is also important to take into account other elements of hardware. For example, it is important to efficiently use processor prefetching (that is why we consider the data spatial locality), as well as vector registers. How to develop a fast matrix multiplication algorithm is described in detail in the fundamental work of Kazushige Goto.

Based on the above, the main points that I want to highlight:

1. The theoretical efficiency of algorithms is determined based on the computation time model relevant for a given computing system.

2. With the evolution of hardware, the computation model, and therefore the algorithms themselves, may become outdated.

3. Computation models, so as not to lose flexibility, are still quite simple and abstract. The resulting implementations of the algorithms are still compared and optimized on real hardware.

It is also worth noting that many numerical method packages have a Linpack package interface, which was developed by Jack Dongarra in the 1970s. Among them, such packages as PLASMA, Intel MKL, OpenBLAS, MAGMA, ScaLAPACK, etc. are popular.

Me with Jack Dongarra, founder of Linpack

What is so special about the DeepMind study?

The researchers from DeepMind have managed to develop a neural network that efficiently looks over various matrix multiplication algorithms on a specific hardware. Notably, it’s not required to have any a priori theoretical runtime models. All is needed is target hardware, on which it’s possible to do many experiments during the neural network training. According to the results given in the article, the discovered algorithm outperformed the current best human-designed algorithm. It seems to me that this is a significant event. It can impact on the development of tools for fundamental research. However, I would like to highlight a few important points.

1. Low generalizing ability. The proposed approach can only be used to generate hypotheses that researchers can test, analyze and draw conclusions. After all, the response of a neural network does not have any generalizing ability. Indeed, a neural network will not answer the following fundamental questions:

  • What is the asymptotic complexity of the algorithm?
  • On which datasets does the algorithm work better, and on which ones worse
  • Why does found algorithm perform faster?

2. High cost of research. Given the fact that during its training, the neural network itself sets up many experiments and draws conclusions based on the measured running time, this requires large computing resources. Moreover, because of the reason Transformers, which are distinguished by their large size, are used at the heart of the neural network, I suspect that not all universities will be able to conduct such research.

3. Results are not generalizable for other hardware architectures. Another point that is also worth paying attention to: this approach helps to create efficient algorithms for specific hardware. Indeed, during training, the neural network conducts many experiments on a given hardware and measures the running time of the algorithm. Obviously, this approach will not be able to answer how the found algorithm will perform on other hardware.

4. And the last point: very often, hardware and algorithms are designed together, and not separately. For example, such an approach makes sense when the hardware is very expensive for mass use, you need to provide requirements for hardware manufacturers, and these requirements are driven by the target algorithm. The areas for which this problem is relevant are robotics, high-performance computing, problems of mathematical physics, etc. In these cases, it is obvious that suggested AI from DeepMind cannot be applied, since the hardware has not even been designed yet, and there is nothing to train the neural network.

However, I will repeat myself. I was very interested to hear about this work. And I am glad that these studies are being carried out.


Michael Yurushkin

Founder of BroutonLab, PhD