In a previous article we compared the performance of different implementations of 3x3 matrix - vector multiplications.
The conclusion was that even large arrays of vectors can be multiplied with high performance if we use highly optimized
numpy functions, in particular the np.dot() function.
transforms
Multiplying by a 3x3 matrix allows us to perform scaling and rotation of vectors, but often we want more. For example, when converting between coordinate systems, like from world to local, we might not only need to rotate and scale, but translation might come in to play as well.
Now you can of course do the 3x3 multiplication first, and then add any translation vector in a second step, but that means we would need to process the array of vectors twice. This could add up quite a bit in processing costs, but there is a way to perform such a transformation in a single step: the combination of scaling, rotation and translation (technically an affine transformation) can be packed into a single, 4x4 matrix which can then be multiplied with a vector of length 4.
In Blender for example, each object has a matrix_world attribute that encodes the scale, rotation and position of an object in world space as a 4x4 matrix. This means that we could for example calculate the position of a vertex coordinate (which are stored in local space, i.e. relative to the origin of the object) in global space by multiplying its coordinate vector by this matrix (or the other way around by inverting the matrix first). The vertex coordinates are vectors of length 3 of course, so they would need to be extend to 4 elements by adding a 1. This would ensure that the translation was factored in when multiplying with the 4x4 matrix as the translation is stored in the 4th column. (if we wanted to transform a normal, we would extend it with a 0, because normals do not change due to translation.)
The is a lot more to matrix multiplication, so my advise would be to read a good book on the subject (perhaps Mathematics for 3D Game Programming and Computer Graphics by Eric Lengyel), but I hope it is clear that 4x4 matrix multiplication is very relevant for any 3D application.
So what would be the performance if we´d want to do this in python?
numbers
Lets compare the numbers from the 3x3 matrix multiplication to those of the 4x4 multiplication. The numbers are the number of seconds it takes to perform a million matrix x vector multiplications (Mops). Lower is better and going to the right we often see a slight improvement (decrease) when working on larger arrays of vectors, but this is probably not significant for very large arrays. Also, some numbers are missing for very slow implementations + large arrays because they take too long.
3x3 — seconds / Mops
The first table is from the previous article:
| Method | 100000 | 200000 | 500000 | 1000000 | 2000000 | 5000000 | 10000000 |
|---|---|---|---|---|---|---|---|
| naive | 2.4320 | 2.4365 | 2.1950 | 2.2037 | 2.1824 | 2.1994 | |
| comprehension | 5.0050 | 5.0010 | 5.0108 | 5.0688 | 5.0245 | 4.9647 | |
| np_dot | 0.5940 | 0.5885 | 0.5820 | 0.5853 | 0.5811 | 0.5871 | |
| array_np_dot | 0.0040 | 0.0035 | 0.0022 | 0.0077 | 0.0035 | 0.0034 | 0.0032 |
| array_np_einsum | 0.0710 | 0.0445 | 0.0400 | 0.0401 | 0.0407 | 0.0408 | 0.0403 |
| array_np_dot_in_place | 0.0030 | 0.0035 | 0.0030 | 0.0033 | 0.0049 | 0.0045 | 0.0044 |
4x4 — seconds / Mops
A similar table for the 4x4 multiplications:
| Method | 100000 | 200000 | 500000 | 1000000 | 2000000 | 5000000 | 10000000 |
|---|---|---|---|---|---|---|---|
| naive_4x | 4.0480 | 3.7010 | 3.6120 | 3.5078 | |||
| comprehension_4x4 | 7.3080 | 7.2580 | 7.3198 | 7.3427 | |||
| np_dot | 0.3200 | 0.3195 | 0.3196 | 0.3195 | |||
| array_np_dot | 0.0060 | 0.0035 | 0.0026 | 0.0036 | 0.0045 | 0.0039 | 0.0038 |
| array_np_einsum | 0.0760 | 0.0460 | 0.0450 | 0.0452 | 0.0463 | 0.0459 | 0.0457 |
| array_np_dot_in_place | 0.0040 | 0.0035 | 0.0036 | 0.0047 | 0.0066 | 0.0057 | 0.0056 |
Slowdown: 4x4 compared to 3x3 (expected: 3.2x slower)
If we divide the results we can see the relative slowdown of 4x4 multiplication compared to 3x3 multiplication:
| Method | 100000 | 200000 | 500000 | 1000000 | 2000000 | 5000000 | 10000000 |
|---|---|---|---|---|---|---|---|
| naive | 1.66 | 1.52 | 1.65 | 1.59 | |||
| comprehension | 1.46 | 1.45 | 1.46 | 1.45 | |||
| np_dot | 0.54 | 0.54 | 0.55 | 0.55 | |||
| array_np_dot | 1.50 | 1.00 | 1.18 | 0.47 | 1.29 | 1.17 | 1.18 |
| array_np_einsum | 1.07 | 1.03 | 1.13 | 1.13 | 1.14 | 1.12 | 1.13 |
| array_np_dot_in_place | 1.33 | 1.00 | 1.20 | 1.42 | 1.35 | 1.25 | 1.27 |
observations
The first thing we note is that 4x4 multiplication is indeed almost always slower and that is in itself no surprise.
After all, a 4x4 matrix - vector multiplication takes 4 x ( 4 multiplications + 3 additions ) = 48 floating point operations,
compared to the 3 x ( 3 multiplications + 2 additions ) = 15 operations. So we would expect a slowdown by a factor of 3.2 based
on the number of floating point operations alone.
But what we see is that even for the pure python naive and comprehension based implementations the slowdown isn't anywhere near 3.2. This is likely because of the relatively large cost of the function calls made in python and the setup cost for the loops and generators.
Even more surprising perhaps is that when we use the np.dot() function to individually multiply each vector with the matrix we see a speed increase instead of a decrease. Because there is no python loop setup here whatsoever and the number of functions calls is the same, something different must be in play here. My conjecture is that processor caching plays an important role here, and perhaps even choices made by the underlying numpy implementation to perform those multiplications in a different way. Hard to tell without looking at
the numpy code (and even then I am probably nowhere knowledgeable enough to say something relevant), just be aware that on a different machine this result might be quite different (test were performed on a AMD Ryzen 7 7700X).
I might redo these timings on even larger arrays and/or with more repetitions to verify this a bit more.
The array based implementations see even less of a slowdown compare to the pure python ones. This is nice, but again not so easy to explain. Why would 3 times as much floating point operations only slow down things by about 25% or so, even when the python function call overhead is the same?
conclusions
4x4 matrix multiplications are almost just as fast 3x3 multiplications, which is nice if you want to perform transforms that also involve translations, like converting global to local coordinates.
However, if we really want to understand why this is so fast, we might want to take a closer look at the numpy code itself.




