Neural Network Evaluation Parallelization

Neural networks work by activating neurons in the prior row based on the linear combination of the prior row and weights determined by the network's training.

This is equivalent to the matrix operations:

Matrix operations are easily parallelized. Below is a table showing the number of operations needed to multiply an NxN matrix by a vector of length N, the number of cycles needed if optimally parallelized, and the number of cores to achieve optimal parallelization.

N	Operations	Parallelized Cycles	Cores Required
2	6	2	4
4	28	3	16
8	120	4	64
16	496	5	256
32	2,016	6	1,024
64	8,128	7	4,096
128	32,640	8	16,384

Commercial CPUs top out at around 32 cores (AMD Ryzen Threadripper 3970X), so for anything more complex than 5x5 matrices, we need to use GPUs, which have more cores.

GPUs

GPUs have many more cores than any CPU. Although they cannot handle processes as complex as CPUs, they are great at performing simple addition and multiplication very quickly.

GPUs also have the advantage of specially designed circuits to do addition and multiplication in a single clock cycle (GPUs can take several). They can also have circuits specifically designed to do matrix multiplication, able to multiply two 4x4 matrices in a single clock cycle.

For more information on GPUs, see Cornell, and NVIDIA.

On the computer used for training, we are using an RTX A4000. Its full datasheet can be found here.

ONNX

In order to fully utilize GPUs, much work needs to be done to optimize code. Fortunately, there are packages that can do almost all of the work for us.

ONNX (the Open Neural Network eXchange) takes trained neural networks from other frameworks and converts them into low-level instructions.

Space shortcuts

Page tree

Neural Network Evaluation Parallelization

GPUs

ONNX

Space shortcuts

Page tree

GPU and Neural Network Acceleration

Neural Network Evaluation Parallelization

GPUs

ONNX