Small Blender Things: 2025

Performance of matrix vector multiplications in numpy

matrix multiplication is an excellent example of a operation in Blender that is performed frequently and can benefit enormously from using numpy. If you want to perform transformations like scaling or rotation on vertex coordinates or normals, or any kind of vector, you will probably be using matrix x vector multiplication.

The Blender API has a mathutils module that provides convenient classes like Vector and Matrix, but if you are dealing with millions of vertices you are better of using the numpy package that comes bundled with Blender. Properties like vertex coordinates can easily be retrieved as numpy arrays (see this previous article), which can then be manipulated efficiently with all the available numpy functions.

In this article I'll explore some different implementations of matrix x vector multiplication and measure the performance on large arrays of vectors. The code for these experiments is available as tests/test_matrix_multiplication.py in this GitHub repository

Warning: long read!

matrix multiplication

Multiplying a vector with a matrix is straight forward: to calculate an element in the result vector, we take the corresponding column in the matrix and calculate the dot product with the input vector, i.e. we multiply each element pair and sum the result.

For example, if we want to calculate the second element in the result vector, we take the second column in the matrix and multiply each element in the column with the corresponding element in the input vector and sum those together.

pure python

A (almost) pure python implementation may look like this:

def multiply_vector_matrix(a, x):
    y = np.ndarray(3, dtype=np.float32)
    for k in range(3):
        y[k] = 0
        for j in range(3):
            y[k] += a[j, k] * x[j]
    return y

We could have used a python list for the input and result vectors and a list of lists for the matrix, but because we want to compare this to numpy functions later, we choose to use ndarray from the beginning, which also allows us to work with the 32 bit floats that Blender uses instead of doubles.

As you can see, the code implementation follows the algorithm sketched in the previous section to the letter and contains a loop with in a loop. As is to be expected this implementation is slow: 100 thousand iterations take almost a quarter of a second, 0.2432s to be precise.

pure python with built-ins and list comprehension

As you might know, loops in python are slow, so why not use the built-in sum() function in combination with list comprehension to save us a loop?

def multiply_matrix_vector_comprehension(a, x):
    y = np.ndarray(3, dtype=np.float32)
    for k in range(3):
        y[k] = sum(a[j, k] * x[j] for j in range(3))
    return y

The generator inside the sum saves us a loop, but a bit surprisingly perhaps, this implementation is about twice as slow: 0.5005s for 100 thousand iterations. Apparently creating a generator for such a small number (3) of elements and calling a function is more expensive than just looping and calculating.

numpy

Now lets turn our attention to numpy. Numpy has a dot() function that does exactly what we want. Multiplying a single matrix with a single vector can be implemented like this:

def multiply_matrix_vector_np_dot(a, x):
    return np.dot(a,x)

Yes, that is just a single function call, all the looping and calculating is implemented in C/C++ behind the scenes. Performance is therefore almost an order of magnitude faster: 0.0594s for 100 thousand vectors.

Promising, but it wouldn´t be enough if we'd want to work with tens of millions of vectors. Can we do better?

numpy v. stacks of vectors

In the previous implementation, id we wanted to multiply a matrix with a list of vectors we would have to call the function for each individual vector in turn, resulting in a lot of overhead from the function call itself and any loop surrounding it.

But numpy functions are way more powerful than that, they can figure out how to broadcast our 3x3 matrix to apply it to each individual vector in turn, and return the results as an array of vectors again. This saves a boatload of function calls and we also don´t need a loop.

The implementation looks suspiciously like the previous one:

def multiply_matrix_vector_array_np_dot(a, x):
    return np.dot(x, a)

This time however x should be an array of vectors, and if you look closely you will see that the arguments passed to np.dot() are reversed. This is necessary to match up the length of the individual vectors (3) with the height of the matrix. (We could have done this in many different ways, for example transposing the individual arrays, but this is the simplest way).

You might call this function something like this:

a = ...  # some 3x3 matrix
x = np.array([[1,0,0], [2,0,0,], [3,0,0], [4,0,0], ...])  # a long list of vectors

result = multiply_matrix_vector_array_np_dot(a, x)

The result array would have the same shape as x and hold all the transformed vectors.

If we do this for 100k vectors, we see that the speed increase is enormous compared to calling dot() for each individual vector: almost two orders of magnitude at 0.0004s. This allows us easily to scale up to millions of vectors.

is there more to gain?

Numpy also has an np.einsum() function that allows for a more descriptive way to denote what combination of multiplication and summation operations to perform on arrays. This is often used to perform multiplications of large, complex arrays (tensors) in the realm of machine learning, so maybe we can use this here too:

def multiply_matrix_vector_array_np_einsum(a, x):
    return np.einsum("ij,jk->ik", x, a)

We are not going to cover this in any detail in this article, but if you look at the index notation you will see that it expects a list of i vectors with j columns, and multiplies those with a matrix with j rows and k columns, resulting in a list of i vectors of k columns. Because the index j is reused, einsum knows to calculate the dot product of each input vector with the column of the matrix. Sounds confusing perhaps if you are unfamiliar with this kind of notation, so this article might give you a head start.

Nevertheless, although promising, the result disappoints: 0.0071s for 100k vectors, or about an order of magnitude slower than np.dot()

The final option I tried is to see whether storing the result vectors in the original array, so calculating everything in place and perhaps saving a costly allocation of a new array made any difference. May numpy function take an out argument to select a destination array and the code might look like this:

def multiply_vector_matrix_array_np_dot_in_place(a: MAT3x3, x):
    x = np.dot(x, a, out=x)
    return x

Unfortunately, although it might save memory, performance is exactly the same as for a regular call (with the margin of error).

results

To get a more thorough understanding I have measured the elapsed time for different numbers of vectors as shown in the table below

number of vectors	color	100000	200000	500000	1000000	2000000	5000000	10000000
naive	red	0.2432	0.4873	1.0975	2.2037	4.3647	10.9972
comprehension	blue	0.5005	1.0002	2.5054	5.0688	10.0490	24.8236
np_dot	yellow	0.0594	0.1177	0.2910	0.5853	1.1621	2.9355
array_np_dot	green	0.0004	0.0007	0.0011	0.0077	0.0069	0.0168	0.0323
array_np_einsum	orange	0.0071	0.0089	0.0200	0.0401	0.0814	0.2042	0.4025
array_np_dot_in_place	cyan	0.0003	0.0007	0.0015	0.0033	0.0097	0.0227	0.0442

(elapsed times will be different on different computers; The ten million vector results are missing for the per vector function calls because I don´t have that kind of patience 😁. Color refers to the lines in the graphs below)

This is a bit easier to interpret when we graph those results

All implementations scale linearly with the number of vectors but the versions that make use of numpy's broadcasting capabilities, i.e. do only a single function call are so much faster that they are nearly indistinguishable from the x-axis.

Those performance differences are a bit easier to see if we transform all data to the number of seconds it takes to perform a million matrix x vector operations (not that the vertical axis logarithmic)

The flat lines are again indicative of the linear behavior of the operations: it doesn't make much difference in time to perform a single matrix x vector multiplication, regardless the total number of vectors (except for slight random variations for low numbers of vectors)

conclusion

Python might be slow, but we still can perform computationally expensive operations on large arrays with blazing speed if we leverage the power of numpy. Finding the correct function in the documentation might be a challenge, but because we don´t need python loops, save on individual function calls, and benefit from highly optimized implementations, we can easily manipulate millions of vectors.

final remarks

New versions of numpy also have a np.matvec() function that I did not check because Blender 5.0 bundles a version of numpy (1.26.4) that doesn´t have this function.

Even though einsum() might be slower, in many situations its expressive power is very convenient. Some examples of what can be done with it can be found here.

No AI was hurt in the writing of this article, all text and research was creating using old skool wetware.

IDMapper add-on ported to Blender 5.0

Blender 5.0 has been released a few days ago and I started testing some add-ons for compatibility. Weightlifter didn´t need any change at all, but IDMapper was a bit more work.

IDMapper

IDMapper a tool to easily create vertex color maps, a.k.a. id maps, proved to be a bit more work, mainly because the Python API for tool settings has changed (used for the brush size) but also because the default layout for vertex painting now has a shelf at the bottom with a tab that obscured the on-screen help, so that needed to be moved up.

All in all it wasn´t that big of a deal, but be aware that this new version will not work with Blender version before 5.0 (the old version is still available of course)

GitHub

The add-on, along with a manual and the old release, is available in this GitHub repository

Weightlifter addon tested on blender 5.0

Blender 5.0 has been released a few days ago so it is time to check of our add-ons are still working

Weightlifter

First in line is Weightlifter and I am happy to say it works on 5.0 without changes.😁

GitHub

The add-on, along with a manual is available in this GitHub repository.

Converting markdown to pdf with chromium and playwright in vscode

This article is a bit different because it is not about Blender, but I thought it might be useful for some people and perhaps search engines will pick it up.

Converting markdown to PDF

I like markdown, I like it a lot. It is quick to type, and because it is text, it is easy to manage in a repo and/or create diffs and it can be read without the need for a special reader. I prefer GitHub flavoured markdown enhanced with Mermaid because I use checkboxes and flowcharts, but if you just want to mix documentation and code samples any markdown flavour is fine, and support for it in VScode is excellent.

However, even though VScode extensions like Markdown PDF make it real easy to convert markdown to a nice looking PDF document, it doesn´t give you convenient control of the styling. You can specify a css stylesheet that is applied to the intermediate HTML that is generated (Markdown-PDF uses chromium to convert HTML to the final PDF), but that stylesheet is used for all markdown documents in the workspace. You can also specify a code highlight theme (courtesy of highlight.js) but here you can only use predefined styles, not provide one of your own.

Besides styling, the generated PDF sometimes does not look exactly like the the intermediate html, likely because Markdown PDF uses the old Chromium headless shell to save the html as PDF. This might not seem like a big deal, but I like to be able to look at the html with Chrome dev tools and tweak the css. If the resulting PDF then doesn´t look exactly like the html, life becomes rather difficult. So I would like to do this conversion myself, using Chromium just like Markdown PDF does, but making sure the new head chromium is used instead of the headless shell.

Distinct steps

So I decided to break up the workflow in distinct steps and create VScode tasks for them:

Convert the active markdown file to an intermediate html file

(Markdown-PDF can do that for use)
Strip any styling and replace that with a css file of our own

This way we have full control of the layout and styling
[optional] Convert local image references to inline data URLs

(to get one self contained file)
Use playwright with headless Chromium to convert the html to PDF

All four steps/tasks can be combined in a single task for convenience, so that executing a single tasks on the active markdown file almost immediately produces a pdf without further human intervention.

Repository

I don´t have a public repository with documents that I can share that contains all the steps mentioned above, but I did create a separate public repo that contains the whole setup.

Convert markdown to PDF

This one is easy, because Markdown PDF already has this option. So all we have to do is create a VScode task, i.e. add the following bit of JSON to the tasks property in the file .vscode/tasks.json

{
    "label": "Export Markdown as HTML",
    "command": [
        "${command:extension.markdown-pdf.html}"
    ],
    "group": {
        "kind": "build",
        "isDefault": false
    },
    "presentation": {
        "reveal": "always",
        "panel": "shared"
    },
    "problemMatcher": []
}

The way we found the id of that specific action is by going to the command pallette (Ctrl + Shift + P), searching for Markdown PDF: export (html) and then clicking on the gear icon (⚙). This will get you to the keyboard shortcuts with the command already selected, and a simple right mouse click → Copy command ID, will give you extension.markdown-pdf.html

Replace the styling

The next task definition looks like this:

{
    "label": "Reformat HTML with custom CSS",
    "type": "shell",
    "command": "python",
    "args": [
        "${workspaceFolder}/bin/reformathtml.py",
        "${file}",
        "--css",
        "${workspaceFolder}/html/style.css"
    ],
    "group": {
        "kind": "build",
        "isDefault": false
    },
    "presentation": {
        "reveal": "always",
        "panel": "shared"
    },
    "problemMatcher": []
}

It calls a custom python script with the active file as an argument and an option to specify the new css file. The script can be found in the repo, but basically all it does is to strip all <style> elements

Inline images

This is an optional step, but if we want the intermediate html file that is created to be completely self contained. The task definition is again simple:

{
    "label": "img2base64: active file",
    "type": "shell",
    "command": "python3 ${workspaceFolder}/bin/img2base64.py ${file} ${file}",
    "presentation": {
        "echo": true,
        "reveal": "always",
        "focus": false,
        "panel": "shared"
    },
    "problemMatcher": []
}

It calls a custom python script with the active file as an argument and that will convert any <img> with a src attribute pointing to a local file, to an inline data uri.

Convert to PDF

The task Convert to PDF takes care of the actual conversion to PDF.

{
    "label": "Convert to PDF",
    "type": "shell",
    "command": "python",
    "args": [
        "${workspaceFolder}/bin/html2pdf.py",
        "${workspaceFolder}/html/${fileBasenameNoExtension}.html",
        "--output",
        "${workspaceFolder}/pdf"
    ],
    "group": {
        "kind": "build",
        "isDefault": false
    },
    "presentation": {
        "reveal": "always",
        "panel": "shared"
    },
    "problemMatcher": []
}

It relies on a script that uses playwright package to convert the html to PDF using headless chromium.

Combining everything

The final tasks just ties those previous tasks together so that we can simply execute all of them on the active markdown file:

{
    "label": "Prepare Markdown for print",
    "dependsOn": [
        "Export Markdown as HTML",
        "Reformat cv.html with custom CSS",
        "Inline images in cv.html",
        "Convert to PDF"
    ],
    "dependsOrder": "sequence",
    "group": {
        "kind": "build",
        "isDefault": true
    },
    "problemMatcher": []
}

Notes on the VScode project

The vscode in the GitHub repository is configured to create a dev container with everything needed included. Because of this, installing all that will take a bit of time when you first build the dev container, because playwright and its dependencies are quite hefty. So have a look at the logging if you think it takes too long, but prepare for a minute or two even on a good internet connection.

The only thing you will have to configure yourself once you have the dev container running, is to make sure the Markdown PDF output folder is set to ../html because the other scripts depend on it. I have configured that in my user settings, but you might want to do that on workspace or even dev container level; just make sure to set it:

"markdown-pdf.outputDirectory": "../html"

With that all set you can test it by opening the file markdown/example.md and run

How to profile a Blender add-on

In this article I want to highlight some details on how to profile a Blender add-on.

Now you can of course simply put some code around a call to the execute() method that measures the elapsed time, but to gain insight in what is taking up the most time we need to do a little bit more, so here I will show how we can use the line-profiler package and how we can make sure that this will not incur any extra overhead on the add-on once it is running in production.

The example used in an add-on wh looked at in a previous article, so it might be a good idea to read that one first if you haven´t already.

The development environment

To make testing easier it is probably a good idea to work in a development container and run Blender as a Python module. This exactly the kind of setup provided by the blenderadds-ng repository that I wrote about previously

With this setup we can use the @profile decorator to provide timing information on a line-by-line basis and write our code in such a way that if the line-profiler package is not available (for example once you distribute your add-on to others) or we don´t enable it explicitely with an environment variable, it is not imported and the decorator resolves to a no-op function.

The line-profiler package

Near the top of our add-on we add some code to see if we want and can import the line-profiler package:

from os import environ

try:
    if environ.get("LINE_PROFILE") == "1":
        from line_profiler import profile
    else:
        profile = lambda x: x
except ImportError:
    profile = lambda x: x

This code will try to import the line_profiler package if the environment variable LINE_PROFILE is set to 1. If that fails (for example because the package isn´t present) or if that enviroment variable is not set, then we assign a lambda function to profile that simply returns it argument. This way, we can always add a @profile decorator to a method or a function, and it will either be profiled or not, but we don't have to change anything about the function or method itself.

In a bash shell, setting an environment variable for just a single run can be done by prepending an assignment:

LINE_PROFILE=1 python3 add_ons/foreach_examples.py

Profiling a function

With the decorator in place, we can simply decorate any function we might want to profile, for example:

@profile
def get_closest_vertex_index_to_camera_naive(
    world_verts: npt.NDArray[np.float32], cam_pos: npt.NDArray[np.float32]
) -> Tuple[int, float | np.floating[Any]]:
    closest_distance = np.inf
    closest_index = -1
    for vertex_index, vertex_pos in enumerate(world_verts):
        direction = vertex_pos - cam_pos
        distance = np.linalg.norm(direction)
        if distance < closest_distance:
            closest_distance = distance
            closest_index = vertex_index
    return closest_index, closest_distance

If this function is called, and profile is the profiler from the line-profiler package, information will be collected for any line of code this is executed.

Printing profiling results

In our test environment do not install an add-on in Blender but we run Blender as a module, so we will have a section in our code that only runs when call it from the command line (note that this code will not be run if we install an add-on in Blender, because the the register() function will be called, but it is safe to leave this code in the add-on). It might look like this:

if __name__ == "__main__":
    ...
    
    register()  # make sure we can call the operator defined elsewhere in the file
    
    ...
    
    bpy.ops.object.foreach_ex("INVOKE_DEFAULT", mode=cli_mode)
    
    ...
    
    unregister()  # unregister anything registered previously.

    if (
        profile
        and hasattr(profile, "print_stats")
        and environ.get("LINE_PROFILE") == "1"
    ):
        profile.print_stats()

We still need to call the register() function that will register our add-on, otherwise we can´t call it, but we can then simply invoke our operator. In this example we registered an operator called foreach_ex that takes a mode argument, but this would be different for your add-on of course. But the point here is, that any method that this operator calls which is decorated with the @profile decorater will be profiled (if the decorator was not replaced by the no-op lambda).

After running the operator, we call the unregister() function to clean up. I don´t think this isn´t necessary when we run Blender as a module, but it doesn´t hurt either.

Finally we check if the profile variable is not None, and if it has a print_stats attribute. It will have one if it is the proper decorator, but not if it is the dummy lambda function. Checking for the LINE_PROFILE environment variable is a bit superfluous here (because if it was set, profile would be the lambda), but it makes it extra clear that we only print the statistics we gathered if they are present at all.

A profile run

If we now run our add-on from the commandline and turn profiling on with

LINE_PROFILE=1 python3 add_ons/foreach_examples.py

We see something like this:

Timer unit: 1e-09 s

  0.00 seconds - /workspaces/blenderaddons-ng/add_ons/foreach_examples.py:65 - get_active_camera_position
  0.01 seconds - /workspaces/blenderaddons-ng/add_ons/foreach_examples.py:55 - to_world_space
  0.01 seconds - /workspaces/blenderaddons-ng/add_ons/foreach_examples.py:41 - get_vertex_positions
  0.02 seconds - /workspaces/blenderaddons-ng/add_ons/foreach_examples.py:73 - get_closest_vertex_index_to_camera_naive
  0.04 seconds - /workspaces/blenderaddons-ng/add_ons/foreach_examples.py:127 - do_execute
Wrote profile results to profile_output.txt
Wrote profile results to profile_output_2025-10-23T110532.txt
Wrote profile results to profile_output.lprof
To view details run:
python3 -m line_profiler -rtmz profile_output.lprof

So for every profiled function we get a summary of the elapsed time, and near the end are some instructions to look at those details.

Profiling details

If we follow the suggestion and run

python3 -m line_profiler -rtmz profile_output.lprof

we get detailed output for each profiled function. I have shown the output here for just our toplevel function.

Total time: 0.0363764 s
File: /workspaces/blenderaddons-ng/add_ons/foreach_examples.py
Function: do_execute at line 127

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   127                                               @profile  # type: ignore (if line_profiler is available we get a complaint here)
   128                                               def do_execute(self, context: Context):
   129                                                   """Expensive part is moved out of the execute method to allow profiling.
   130                                           
   131                                                   Note that no profiling is done if line_profiler is not available or
   132                                                   if the environment variable `LINE_PROFILE` is not set to "1".
   133                                                   """
   134         1          0.7      0.7      0.0          obj: Object = context.active_object
   135                                           
   136         1          4.3      4.3      0.0          if self.mode == "NAIVE":
   137         1       8160.5   8160.5     22.4              arr = get_vertex_positions(obj)
   138                                                   else:
   139                                                       arr = get_vertex_positions_np(obj)
   140                                           
   141         1       5992.8   5992.8     16.5          world_arr = to_world_space(arr, obj)
   142                                           
   143         1          7.2      7.2      0.0          if self.mode == "BROADCAST":
   144                                                       return get_closest_vertex_index_to_camera(world_arr, cam_pos=self.cam_pos)
   145                                                   else:
   146         2      22207.5  11103.8     61.0              return get_closest_vertex_index_to_camera_naive(
   147         1          3.4      3.4      0.0                  world_arr, cam_pos=self.cam_pos
   148                                                       )

It may look complicated because this is the code we used to see the effect of using foreach_get() and NumPy functions and what we call is determined by the mode variable, but since we called this with mode=NAIVE, effectively are executing:

  obj: Object = context.active_object
  arr = get_vertex_positions(obj)
  world_arr = to_world_space(arr, obj)
  return get_closest_vertex_index_to_camera_naive(
      world_arr, cam_pos=self.cam_pos
  )

And as you can see from the profile information, those are the only lines with a significant timing percentage, and we spend about 22%, 16% and 61% of our them in those respectively.

If we instead of using a naive Python loop use a call to foreach_get() instead, we get different results (some lines were removed to reduce clutter)

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   134         1          0.6      0.6      0.0          obj: Object = context.active_object
   135                                           
   136         1          4.4      4.4      0.0          if self.mode == "NAIVE":
   137                                                       arr = get_vertex_positions(obj)
   138                                                   else:
   139         1        147.9    147.9      0.7              arr = get_vertex_positions_np(obj)
   140                                           
   141         1        159.8    159.8      0.7          world_arr = to_world_space(arr, obj)
   142                                           
   143         1          3.9      3.9      0.0          if self.mode == "BROADCAST":
   144                                                       return get_closest_vertex_index_to_camera(world_arr, cam_pos=self.cam_pos)
   145                                                   else:
   146         2      22196.4  11098.2     98.6              return get_closest_vertex_index_to_camera_naive(
   147         1          2.9      2.9      0.0                  world_arr, cam_pos=self.cam_pos
   148                                                       )

We can see that getting vertex positions with the alternative function get_vertex_positions_np() that uses the foreach_get() method is so much more efficient that now almost 99% of the time is spent in the function get_closest_vertex_index_to_camera_naive(). This means we can focus our optimization efforts solely on that function, and ignore the to_world_space() because that one contributes neglibly to the overall run time when presented with NumPy arrays instead of Python arrays.

Caveat

A typical add-on operator will have an execute() method and it might seem logical to add a @profile decorator to it directly, but if we would do that Blender would complain: the wrapped function doesn´t look exactly like what Blender expects (even though it would function the same). The remedy to this is to move all expensive code to its own do_execute() method and add a @profile decorator to that.

So typical code will have the following structure:

class OBJECT_OT_your_operator(bpy.types.Operator):

    ...  # other code ommited for brevity

    @profile
    def do_execute(self, context: Context):
        ... # this is the expensive part we want to profile

    def execute(self, context: Context):
        self.do_execute(context)

Summary

Profiling a Blender add‑on using the line‑profiler is rather easy: wrap functions with a guarded @profile decorator (enabled by LINE_PROFILE=1, or replaced by a dummy if the line-profile package isn´t available) so profiling is optional and has no runtime cost in production. With the line-profiler package we can collect and print line-by-line timings, and use the results to pinpoint and fix hotspots.

Performance of numpy operations in Blender

This article is a follow-up to this one, where we introduced the foreach_get()/foreach_set() methods and had a quick look at NumPy.

I want to present a small add-on that implements different methods to determine the distance from the active camera to the closest vertex in the active mesh. As we will see, all three methods use different approaches for accessing vertex data and calculating distances, which significantly impacts performance and memory trade-offs.

Naive implementation

What It Does

The NAIVE approach iterates over mesh.vertices using a standard Python loop and calculates distances with pure Python's per-vertex Vector math. When we measure performance it provides us with a baseline that we can compare other methods to.

The relevant code is straight forward, to get the vertex coordinates we loop over all vertices and get the co attribute (a Vector object) and return them as a Python list:

def get_vertex_positions(obj: Object) -> list[Vector]:
    mesh: Mesh = obj.data  # type: ignore
    return [v.co for v in mesh.vertices]

The code to determine the closest distance is also very straight forward:

def get_closest_vertex_index_to_camera_naive(
    world_verts: npt.NDArray[np.float32], cam_pos: npt.NDArray[np.float32]
) -> Tuple[int, float | np.floating[Any]]:
    closest_distance = np.inf
    closest_index = -1
    for vertex_index, vertex_pos in enumerate(world_verts):
        direction = vertex_pos - cam_pos
        distance = np.linalg.norm(direction)
        if distance < closest_distance:
            closest_distance = distance
            closest_index = vertex_index
    return closest_index, closest_distance

Note that the type annotation here mentions numpy NDArray, so that it will be possible to pass them, but in the NAIVE code path we will actually pass just Python lists. If we would pass numpy arrays this code would would also work.

We do use np.linalg.norm() here just to make it possible to accept ndarrays; Would we have gone for a completely naive Python/Blender approach we should have used distance = direction.length instead.

Key Trade-offs

Pros: It's the simplest and most readable method, and doesn´t requiry NumPy.
Cons: It incurs high per-vertex Python overhead (for attribute access and method calls), making it slow on large meshes.
Performance: Acceptable for small meshes (hundreds to low thousands of vertices). It scales poorly for 10,000+ vertices.

FOREACH (mesh.foreach_get / foreach_set)

What It Does

The FOREACH method uses mesh.foreach_get to perform a bulk copy of vertex coordinates into a flat, numeric buffer (typically a NumPy array). However, in the provided code path, the closest vertex is still found using a Python loop, meaning only the data transfer is accelerated (but the values are stored in an ndarray; that's why we accept those in the get_closest_vertex_index_to_camera_naive() function)

def get_vertex_positions_np(obj: Object) -> npt.NDArray[np.float32]:
    mesh: Mesh = obj.data  # type: ignore
    coords = np.empty(len(mesh.vertices) * 3, dtype=np.float32)
    mesh.vertices.foreach_get("co", coords)
    return coords.reshape(-1, 3)

Key Trade-offs

Pros: The bulk copy significantly reduces Python attribute overhead for reading vertex data, making it much faster than NAIVE for data transfer.
Cons: If the distance calculation remains in a Python loop, you still keep the performance-limiting Python-loop overhead. This method also requires NumPy.
Performance: A good improvement over NAIVE for medium meshes, but it doesn't achieve the speed of fully vectorized numerical processing.

BROADCAST (NumPy Vectorized)

What It Does

The BROADCAST method also reads vertex coordinates into a NumPy array using foreach_get. Crucially, it then performs distance calculations using vectorized NumPy operations. This means no Python per-vertex loop is used; instead, np.argmin() finds the closest index.

def get_closest_vertex_index_to_camera(
    world_verts: npt.NDArray[np.float32], cam_pos: npt.NDArray[np.float32]
) -> Tuple[int, float | np.floating[Any]]:
    dists = np.linalg.norm(world_verts - cam_pos, axis=1)
    i = np.argmin(dists)
    return i, dists[i]

This function is not only a lot shorter without any Python loop in sight, but it also nicely shows the power of numpy.

world_verts is a N x 3 array, and cam_pos a 1 x 3 array. Numpy's broadcasting rules interpret this to mean we want to subtract the camera position from each vertex position, result in an N x 1 list of distances.

np.argmin() then takes this list of distance and returns the index of the smallest distance.

All this happens inside optimized C/C++ code, sidestepping the Python loop performance penalty completely.

Key Trade-offs

Pros: Distances are computed in highly optimized C code (via BLAS/NumPy internals), resulting in minimal Python overhead. This is the fastest approach for large meshes.
Cons: It requires extra memory for NumPy arrays and temporary arrays (verts_h, world_verts, dists). It also requires NumPy and careful management of array shapes and data types.
Performance: Best for large meshes (tens of thousands of vertices and up). While array allocation overhead may make it slightly slower for tiny meshes, it's generally still acceptable.

Performance results

If we look at the measured performance we see a clear linear dependance on the number of vertices for all methods:

Performance graph

Note that both axes are logarithmic to make it possible to graph the results of many orders of magnitude.

The naive method is too slow to go beyond 1 million vertices: The last point at 1.5M verts already takes more than 4 seconds.

The method that uses foreach_get() as about twice as fast, but that wouldn´t get us anywhere near 10 million vertices.

However, the method that uses numpy operations to calculate distances and figure out what the minimum distance is, is a whole lot faster. The last point on the yellow line clocks in at about 1 second for 25 million vertices, over a 60x improvement on the naive method.

The actual numbers may of course be different on your computer but the overall trend is pretty clear: using numpy is the way to go.

Practical Guidance & Rule of Thumb

Since the code isn´t all that more complicated, always use all the available nummpy functionality. Only in those cases where you are very contrained by memory you might consider the other approaches because the extra buffer for the foreach_get() method and any emporary arrays that numpy creates can add up when working with large numbers of vertices.