You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 4 Current »

Neural Network Evaluation Parallelization

Neural networks work by activating neurons in the prior row based on the linear combination of the prior row and weights determined by the network's training.

This is equivalent to the matrix operations:

Matrix operations are easily parallelized. Below is a table showing the number of operations needed to multiply an NxN matrix by a vector of length N, the number of cycles needed if optimally parallelized, and the number of cores to achieve optimal parallelization.

NOperationsParallelized CyclesCores Required
2624
428316
8120464
164965256
322,01661,024
648,12874,096
12832,640816,384

Commercial CPUs top out at around 32 cores (AMD Ryzen Threadripper 3970X), so for anything more complex than 5x5 matrices, we need to use GPUs, which have more cores.

GPUs

GPUs have many more cores than any CPU. Although they cannot handle processes as complex as CPUs, they are great at performing simple addition and multiplication very quickly. 

GPUs also have the advantage of specially designed circuits to do addition and multiplication in a single clock cycle (GPUs can take several). They can also have circuits specifically designed to do matrix multiplication, able to multiply two 4x4 matrices in a single clock cycle.

For more information on GPUs, see Cornell, and NVIDIA.

On the computer used for training, we are using an RTX A4000. Its full datasheet can be found here.

ONNX

In order to fully utilize GPUs, much work needs to be done to optimize code. Fortunately, there are packages that can do almost all of the work for us.

ONNX (the Open Neural Network eXchange) takes trained neural networks from other frameworks and converts them into low-level instructions.

For more on ONNX, see https://onnx.ai/about.html











  • No labels