matrix multiplication is an excellent example of a operation in Blender that is performed frequently and can benefit enormously from using numpy. If you want to perform transformations like scaling or rotation on vertex coordinates or normals, or any kind of vector, you will probably be using matrix x vector multiplication.
The Blender API has a mathutils module that provides convenient classes like Vector and Matrix, but if you are dealing with millions of vertices you are better of using the numpy package that comes bundled with Blender. Properties like vertex coordinates can easily be retrieved as numpy arrays (see this previous article), which can then be manipulated efficiently with all the available numpy functions.
In this article I'll explore some different implementations of matrix x vector multiplication and measure the performance on large arrays of vectors. The code for these experiments is available as tests/test_matrix_multiplication.py in this GitHub repository
matrix multiplication
Multiplying a vector with a matrix is straight forward: to calculate an element in the result vector, we take the corresponding column in the matrix and calculate the dot product with the input vector, i.e. we multiply each element pair and sum the result.
For example, if we want to calculate the second element in the result vector, we take the second column in the matrix and multiply each element in the column with the corresponding element in the input vector and sum those together.
pure python
A (almost) pure python implementation may look like this:
def multiply_vector_matrix(a, x):
y = np.ndarray(3, dtype=np.float32)
for k in range(3):
y[k] = 0
for j in range(3):
y[k] += a[j, k] * x[j]
return y
We could have used a python list for the input and result vectors and a list of lists for the matrix, but because we want to compare this to numpy functions later, we choose to use ndarray from the beginning, which also allows us to work with the 32 bit floats that Blender uses instead of doubles.
As you can see, the code implementation follows the algorithm sketched in the previous section to the letter and contains a loop with in a loop. As is to be expected this implementation is slow: 100 thousand iterations take almost a quarter of a second, 0.2432s to be precise.
pure python with built-ins and list comprehension
As you might know, loops in python are slow, so why not use the built-in sum() function in combination with list comprehension to save us a loop?
def multiply_matrix_vector_comprehension(a, x):
y = np.ndarray(3, dtype=np.float32)
for k in range(3):
y[k] = sum(a[j, k] * x[j] for j in range(3))
return y
The generator inside the sum saves us a loop, but a bit surprisingly perhaps, this implementation is about twice as slow: 0.5005s for 100 thousand iterations. Apparently creating a generator for such a small number (3) of elements and calling a function is more expensive than just looping and calculating.
numpy
Now lets turn our attention to numpy. Numpy has a dot() function that does exactly what we want. Multiplying a single matrix with a single vector can be implemented like this:
def multiply_matrix_vector_np_dot(a, x):
return np.dot(a,x)
Yes, that is just a single function call, all the looping and calculating is implemented in C/C++ behind the scenes. Performance is therefore almost an order of magnitude faster: 0.0594s for 100 thousand vectors.
Promising, but it wouldn´t be enough if we'd want to work with tens of millions of vectors. Can we do better?
numpy v. stacks of vectors
In the previous implementation, id we wanted to multiply a matrix with a list of vectors we would have to call the function for each individual vector in turn, resulting in a lot of overhead from the function call itself and any loop surrounding it.
But numpy functions are way more powerful than that, they can figure out how to broadcast our 3x3 matrix to apply it to each individual vector in turn, and return the results as an array of vectors again. This saves a boatload of function calls and we also don´t need a loop.
The implementation looks suspiciously like the previous one:
def multiply_matrix_vector_array_np_dot(a, x):
return np.dot(x, a)
This time however x should be an array of vectors, and if you look closely you will see that the arguments passed to np.dot() are reversed. This is necessary to match up the length of the individual vectors (3) with the height of the matrix. (We could have done this in many different ways, for example transposing the individual arrays, but this is the simplest way).
You might call this function something like this:
a = ... # some 3x3 matrix
x = np.array([[1,0,0], [2,0,0,], [3,0,0], [4,0,0], ...]) # a long list of vectors
result = multiply_matrix_vector_array_np_dot(a, x)
The result array would have the same shape as x and hold all the transformed vectors.
If we do this for 100k vectors, we see that the speed increase is enormous compared to calling dot() for each individual vector: almost two orders of magnitude at 0.0004s. This allows us easily to scale up to millions of vectors.
is there more to gain?
Numpy also has an np.einsum() function that allows for a more descriptive way to denote what combination of multiplication and summation operations to perform on arrays. This is often used to perform multiplications of large, complex arrays (tensors) in the realm of machine learning, so maybe we can use this here too:
def multiply_matrix_vector_array_np_einsum(a, x):
return np.einsum("ij,jk->ik", x, a)
We are not going to cover this in any detail in this article, but if you look at the index notation you will see that it expects a list of i vectors with j columns, and multiplies those with a matrix with j rows and k columns, resulting in a list of i vectors of k columns. Because the index j is reused, einsum knows to calculate the dot product of each input vector with the column of the matrix. Sounds confusing perhaps if you are unfamiliar with this kind of notation, so this article might give you a head start.
Nevertheless, although promising, the result disappoints: 0.0071s for 100k vectors, or about an order of magnitude slower than np.dot()
The final option I tried is to see whether storing the result vectors in the original array, so calculating everything in place and perhaps saving a costly allocation of a new array made any difference. May numpy function take an out argument to select a destination array and the code might look like this:
def multiply_vector_matrix_array_np_dot_in_place(a: MAT3x3, x):
x = np.dot(x, a, out=x)
return x
Unfortunately, although it might save memory, performance is exactly the same as for a regular call (with the margin of error).
results
To get a more thorough understanding I have measured the elapsed time for different numbers of vectors as shown in the table below
| number of vectors | color | 100000 | 200000 | 500000 | 1000000 | 2000000 | 5000000 | 10000000 |
|---|---|---|---|---|---|---|---|---|
| naive | red | 0.2432 | 0.4873 | 1.0975 | 2.2037 | 4.3647 | 10.9972 | |
| comprehension | blue | 0.5005 | 1.0002 | 2.5054 | 5.0688 | 10.0490 | 24.8236 | |
| np_dot | yellow | 0.0594 | 0.1177 | 0.2910 | 0.5853 | 1.1621 | 2.9355 | |
| array_np_dot | green | 0.0004 | 0.0007 | 0.0011 | 0.0077 | 0.0069 | 0.0168 | 0.0323 |
| array_np_einsum | orange | 0.0071 | 0.0089 | 0.0200 | 0.0401 | 0.0814 | 0.2042 | 0.4025 |
| array_np_dot_in_place | cyan | 0.0003 | 0.0007 | 0.0015 | 0.0033 | 0.0097 | 0.0227 | 0.0442 |
(elapsed times will be different on different computers; The ten million vector results are missing for the per vector function calls because I don´t have that kind of patience 😁. Color refers to the lines in the graphs below)
This is a bit easier to interpret when we graph those results
All implementations scale linearly with the number of vectors but the versions that make use of numpy's broadcasting capabilities, i.e. do only a single function call are so much faster that they are nearly indistinguishable from the x-axis.
Those performance differences are a bit easier to see if we transform all data to the number of seconds it takes to perform a million matrix x vector operations (not that the vertical axis logarithmic)
The flat lines are again indicative of the linear behavior of the operations: it doesn't make much difference in time to perform a single matrix x vector multiplication, regardless the total number of vectors (except for slight random variations for low numbers of vectors)
conclusion
Python might be slow, but we still can perform computationally expensive operations on large arrays with blazing speed if we leverage the power of numpy. Finding the correct function in the documentation might be a challenge, but because we don´t need python loops, save on individual function calls, and benefit from highly optimized implementations, we can easily manipulate millions of vectors.
final remarks
New versions of numpy also have a np.matvec() function that I did not check because Blender 5.0 bundles a version of numpy (1.26.4) that doesn´t have this function.
Even though einsum() might be slower, in many situations its expressive power is very convenient. Some examples of what can be done with it can be found here.
No AI was hurt in the writing of this article, all text and research was creating using old skool wetware.



