Welcome to the glorious age of vibe-tuning!

Link back to main page: Colin P. McNally

vibe-tuning: (noun) A research software engineering practice wherein a third-party library is re-optimized for a specific application and/or compute architecture by an LLM coding agent.^*

While engaged in some computational problem-solving, you are often using a math library which provides sophisticated and verified implementations of the computations that take up most of the run time for your specific problem. But to make a library worth publishing it needs to have a wide applicability to many problems. Thus, in many cases, someone else made a choice between generality and performance.

The fun^** thing is that it’s now sometimes ridiculously easy to re-introduce specialization to the library and win back performance if you have pinned down what specific problem you want to do.

The basic workflow for vibe-tuning goes something like this:

Prompt the coding agent to download the source of the library and build a framework to test unmodified versions against optimized versions.
Define the restricted problem of interest, and prompt the agent to examine the source code of the library and generate a suggestion of optimizations and parameters (including implicit ones such as loop structures) which might be of use given the problem and the architecture being run on. Maybe prompt to use profiling tools.
Pick something reasonable from the agent suggestions and prompt to work through them. Be aware code transformations aren’t commutative (might make sense to do obvious optimizations before auto-tuning).
Rejoice in the glory of vibe-tuned performance!

As an example, I tried looking at nearest-neighbour searches in 15-dimensional space with the nanoflann library. Vibe-tuning with Claude Code (Sonnet 4.6) gave a 1.43× speedup over the stock code (see the appendix). How well vibe-tuning works depends on how well optimized for a specific problem class and architecture the library already is. I also tried to speed up some specific prime lengths of FFT in FFTW, and certain matrix system solves (large circuit simulation matrices) with the SuiteSparse KLU solver, but didn’t get a meaningful speedup.

What happens next? The next coding agent training runs pick up this page and start vibe-tuning before you even asked. Then you’ll need to start guarding against your calculations running with unverified versions of libraries without you even noticing it. So… that’ll be fun.^**

Notes

Make sure the agent does not try to run multiple speed benchmark tests at the same time! Claude Code is sneaky like that and tries to run in parallel to get things done faster.
Get the coding agent process out of the way. One tactic is to prompt the agent to be aware that it’s own process can interfere with benchmark test processes. In my test this resulted in Claude Code pinning it’s process to core 0 and pinning test processes to other cores, but there’s still some conflicts in shared cache this way. A better way would be to run the agent on a different machine, having it execute everything over ssh commands to the compute machine.
You still need to verify the correctness of the results from the modified library. Be wary about floating point correctness.
The architecture details are discoverable by the coding agent.
Since the LLM has been trained on a corpus including lots of code examples, vendor literature, and computer science literature, some generality/performance tradeoffs made in the library code seem to be identifiable in one-shot. Of course you can also try to get the coding agent to run profiling tools, and on my SuiteSparse attempt Claude ran gprof without prompting.

— 2026-03

Appendix: nanoflann patch for d=15 float64 (AVX2/FMA)

Description generated by Claude Sonnet 4.6: The entire optimisation is confined to nanoflann.hpp. No changes to application or benchmark code are required. The patch adds approximately 60 lines to the L2_Simple_Adaptor struct: a new static evalMetricPtr that operates on a contiguous pointer pair, and a modified evalMetric that gathers the 15 database-point coordinates onto the stack before calling it. Both are guarded so that all other scalar types and dimensionalities fall through to the original loop unchanged.

--- src/base/nanoflann.hpp
+++ src/variant-simd-d15/nanoflann.hpp
@@ -46,6 +46,9 @@
 #include <algorithm>
 #include <array>
+#if defined(__AVX2__) && defined(__FMA__)
+#  include <immintrin.h>
+#endif
 #include <atomic>
@@ -618,9 +618,55 @@
     {
     }

+    /* AVX2+FMA kernel: a[0..14] and b[0..14] are contiguous.
+     * Processes 4+4+4 doubles in __m256d registers, 3-element scalar tail. */
+    static DistanceType evalMetricPtr(const T* a, const T* b, size_t size)
+    {
+#if defined(__AVX2__) && defined(__FMA__)
+        if constexpr (std::is_same<T, double>::value)
+        {
+            if (size == 15)
+            {
+                __m256d d0 = _mm256_sub_pd(_mm256_loadu_pd(a),
+                                           _mm256_loadu_pd(b));
+                __m256d d1 = _mm256_sub_pd(_mm256_loadu_pd(a+4),
+                                           _mm256_loadu_pd(b+4));
+                __m256d d2 = _mm256_sub_pd(_mm256_loadu_pd(a+8),
+                                           _mm256_loadu_pd(b+8));
+                __m256d s = _mm256_fmadd_pd(d0, d0,
+                                _mm256_fmadd_pd(d1, d1,
+                                    _mm256_mul_pd(d2, d2)));
+                __m128d lo  = _mm256_castpd256_pd128(s);
+                __m128d hi  = _mm256_extractf128_pd(s, 1);
+                __m128d sum = _mm_hadd_pd(_mm_add_pd(lo, hi),
+                                         _mm_add_pd(lo, hi));
+                double r    = _mm_cvtsd_f64(sum);
+                double t12 = a[12]-b[12]; r += t12*t12;
+                double t13 = a[13]-b[13]; r += t13*t13;
+                double t14 = a[14]-b[14]; r += t14*t14;
+                return static_cast<DistanceType>(r);
+            }
+        }
+#endif
+        DistanceType result = DistanceType();
+        for (size_t i = 0; i < size; ++i)
+        { const T diff = a[i]-b[i]; result += diff*diff; }
+        return result;
+    }
+
     DistanceType evalMetric(
         const T* a, const IndexType b_idx, size_t size) const
     {
+#if defined(__AVX2__) && defined(__FMA__)
+        if constexpr (std::is_same<T, double>::value)
+        {
+            if (size == 15)
+            {
+                double b[15];
+                for (size_t i = 0; i < 15; ++i)
+                    b[i] = data_source.kdtree_get_pt(b_idx, i);
+                return evalMetricPtr(a, b, 15);
+            }
+        }
+#endif
         DistanceType result = DistanceType();
         for (size_t i = 0; i < size; ++i)
         {

* Other meanings may exist like for many English terms, and the idea may have been described before, as with many ideas.

** Other definitions of fun are available.