Neural Network Evaluation Parallelization
Neural networks work by activating neurons in the prior row based on the linear combination of the prior row and weights determined by the network's training.
This is equivalent to the matrix operations:
Matrix operations are easily parallelized. Below is a table showing the number of operations needed to multiply an NxN matrix by a vector of length N, the number of cycles needed if optimally parallelized, and the number of cores to achieve optimal parallelization.
N | Operations | Parallelized Cycles | Cores Required |
---|---|---|---|
2 | 6 | 2 | 4 |
4 | 28 | 3 | 16 |
8 | 120 | 4 | 64 |
16 | 496 | 5 | 256 |
32 | 2,016 | 6 | 1,024 |
64 | 8,128 | 7 | 4,096 |
128 | 32,640 | 8 | 16,384 |
Commercial CPUs top out at around 32 cores (AMD Ryzen Threadripper 3970X), so for anything more complex than 5x5 matrices, we need to use GPUs, which have more cores.
GPUs
GPUs have many more cores than any CPU. Although they cannot handle processes as complex as CPUs, they are great at performing simple addition and multiplication very quickly.
GPUs also have the advantage of specially designed circuits to do addition and multiplication in a single clock cycle (GPUs can take several). They can also have circuits specifically designed to do matrix multiplication, able to multiply two 4x4 matrices in a single clock cycle.
For more information on GPUs, see Cornell, and NVIDIA.
On the computer used for training, we are using an RTX A4000. Its full datasheet can be found here.
ONNX
In order to fully utilize GPUs, much work needs to be done to optimize code. Fortunately, there are packages that can do almost all of the work for us.
ONNX (the Open Neural Network eXchange) takes trained neural networks from other frameworks and converts them into low-level instructions.
For more on ONNX, see https://onnx.ai/about.html