## Hardware Matrix Multiplication

__Matrix multiplication background__

Matrix multiplication is one of those rather mysterious math problems that most of us dreaded in college! But to be clear, matrix multiplication is an important operation in linear algebra as it provides an organized system for performing linear transformations on sets of linear equations.

Matrices themselves are not new, and in fact have been used by mathematicians for many centuries. Matrices proved quite useful over the years; therefore, new matrix operations and applications have been and continue to be devised. It should come as no surprise that the multiplication of two matrices is a primary operation for just about any real-world matrix-oriented algorithm.

There are dozens of applications that use matrix multiplication. For example, matrix multiplication is critical in 3D graphics to perform coordinate transformations, and matrices are commonly used in scientific and engineering problems. Also, matrix multiplication is found in many modern software applications, such as performing the “page rank” calculation for prioritizing the importance of a web page. Matrix multiplication is also important for tensor-based operations, whereas a tensor is a generalized matrix composed of other tensors, vectors and scalar values.

__Emergence of Hardware based matrix multiplication__

One issue with matrix multiplication is the large amount of product summations that must be performed. To perform matrix multiplication, a dot product is generated for each element of the resulting matrix so that M^{3} number of multiply accumulations is performed for an M x M matrix. Matrix multiplication is a significant burden for modern CPUs which often rely on floating-point operations to perform general purpose computation. One reason is that while the floating-point format is very flexible, it is inefficient for multiply-and-accumulate operations especially if the operations will be pipelined.

Meanwhile, tensors have evolved to become a fundamental arithmetic object critical to artificial intelligence (AI) algorithms. For example, the inference or “retrieval” stage of a neural network can be organized and comprised of matrix multiplication operations. Processing matrix multiplication at very high speed enables real-time AI applications, such as voice and image recognition. Modern CPU’s and even CPU’s equipped with vector processing units (VPUs) may not be fast enough to handle the ever-increasing matrix multiplication workload.

Hardware matrix multiplication has advantages over a single CPU or a VPU because multiply-accumulate operations are performed using a 2-D array of processing units. The use of a M x M array of processing elements provides for a “squared” increase in processing performance over a single vector processor of M elements. This fact has led to a surge in the integration of hardware matrix multiplier circuits which process matrix multiplication at a speed fast enough to enable real-time AI applications. In the final analysis, the hardware matrix multiplier is more efficient for processing matrix multiplication than a CPU or VPU.

However, the difficulty in using floating-point format is still an issue. Thus, TPUs are typically designed with a narrow floating-point format like 16-bit floating point, for example. The much narrower floating-point format provides the numerical range but sacrifices precision; most importantly is the 16-bit FPU processing element is more easily pipelined in hardware. The loss of precision when using a narrow floating-point format may or may not be relevant depending on the AI application. However, increasing precision creates a clear performance issue.

Several top-tier processor companies have implemented hardware matrix multipliers to provide significant acceleration in AI processing performance. These hardware matrix multipliers are often dedicated to AI arithmetic and are referred to as a tensor processing unit (TPU). For example, Google recently designed its own TPU processors, and Nvidia is now placing multiple TPUs in its latest AI chips.

The use of TPUs for hardware acceleration is very important given that Moore’s law is slowing. It is estimated the hardware AI chip market will produce substantially different architectures for AI applications and will exceed $50B dollars annually. It will not be a surprise that most top-tier CPU companies will be integrating TPUs into their hardware very soon!

__The killer application for RNS computation: Product summations and matrix multiplication__

During the early development of RNS for general purpose computation (now referred to as modular computation) it became evident that product summation operations is a primary strength and advantage. As the need for accelerated matrix multiplication expands, it is evident that modular computation can serve a useful role.

In the case of matrix multiplication, there are several advantages of a 2-D structure which specifically benefits matrix multiplication in RNS. Since RNS arithmetic requires conversion of data from floating-point to RNS format, there exist a well-known conversion overhead. This RNS conversion overhead has plagued most RNS systems of the past, but recent advances in modular computation hardware has provided true fixed-point conversion in an efficient pipelined structure. Moreover, the amount of data which must be converted is proportional to O(M^{2}) while the amount of arithmetic operations performed in the matrix is O(M^{3}). Therefore, the RNS conversion overhead ratio is significantly reduced.

RNS conversion overhead is further decreased if the matrix operation is iterative, so that intermediate matrix data values remain in RNS format and therefore no conversion is required until the iteration algorithm is complete.

There are many other important reasons why RNS is a good number system for hardware matrix multipliers. For example, since product summations are performed without carry in RNS, each digit of the RNS machine word may be separated and operated in isolation. This partitioning of the RNS digits gives rise to a new term: the RNS digit matrix multiplier. Operating a plurality of RNS digit matrix multipliers completes a full word size accumulator. Thus, the accumulator word size can be increased without affecting the speed of each digit matrix multiplier.

Ideally, digit matrix multipliers use a narrow bit width to encode each RNS digit value so the digit matrix multiplier circuit is much smaller and generally faster than an equivalent matrix of floating-point MACs. The digit encoding can be adjusted down to as small as 6 or 7 bits in practice. Therefore, only 6-bit modular multipliers and accumulators are required in theory. Moreover, new high-speed techniques have emerged to take advantage of cascading small binary accumulators into larger modular accumulators by adding a few more bits to each RNS digit accumulator.

Another nice advantage of modular arithmetic comes from the narrow bit width of an RNS multiplication or summation result. In modular arithmetic, the bit width of product accumulation result is the same as a single operand, so the routing resources of buses within the digit matrix multiplier are significantly reduced. In the contrasting case of binary arithmetic, consider a 16×16 bit binary multiplication produces a 32-bit result and requires say a 40-bit accumulator to ensure accumulation without overflow. In this case, 40-bit result buses must be routed between processing elements of the matrix since normalization of the 40-bit value will typically occur outside the matrix for reasons of circuit economy.

Another advantage is the precision of the arithmetic provided by modular computation. Consider that dot products are stored in an accumulator that is spread between a plurality of RNS digits. In other words, the accumulated result is not stored in a single digit accumulator but is represented by the combination of all RNS digit accumulator values. This is a unique situation versus binary accumulation. Furthermore, all product summation is performed in a double word, or “wide word” format of the RNS register, and only a single normalization occurs after product summation is complete; this means product summation in RNS is highly accurate, in fact it is precise to a single rounding operation, or as precise as the normalized representation can allow.

If all these advantages were not enough, the new RNS based TPU circuits exhibit yet another highly important property! That is, unit multiplier resource consumption is nearly linear versus precision ~O(n), but in binary arithmetic we find unit multiplier resource consumption is O(n^{2}). This means that IC area for multipliers can be as much as 25% or less than the area required for multipliers for floating point arithmetic.

When all the highly attractive properties of RNS matrix computation are combined, the resulting increase in performance and accuracy are astounding! We’ve reported estimated performance of the RNS TPU versus an equivalent fixed-point binary TPU using FPGA technology at 7.5 time to 9.5 times better. MaiTRIX will be publishing final results soon!

Stay tuned!