Welcome to the glorious age of vibe-tuning!

Link back to main page: Colin P. McNally

vibe-tuning: (noun) A research software engineering practice wherein a third-party library is re-optimized for a specific application and/or compute architecture by an LLM coding agent.*

While engaged in some computational problem-solving, you are often using a math library which provides sophisticated and verified implementations of the computations that take up most of the run time for your specific problem. But to make a library worth publishing it needs to have a wide applicability to many problems. Thus, in many cases, someone else made a choice between generality and performance.

The fun** thing is that it’s now sometimes ridiculously easy to re-introduce specialization to the library and win back performance if you have pinned down what specific problem you want to do.

The basic workflow for vibe-tuning goes something like this:

  1. Prompt the coding agent to download the source of the library and build a framework to test unmodified versions against optimized versions.
  2. Define the restricted problem of interest, and prompt the agent to examine the source code of the library and generate a suggestion of optimizations and parameters (including implicit ones such as loop structures) which might be of use given the problem and the architecture being run on. Maybe prompt to use profiling tools.
  3. Pick something reasonable from the agent suggestions and prompt to work through them. Be aware code transformations aren’t commutative (might make sense to do obvious optimizations before auto-tuning).
  4. Rejoice in the glory of vibe-tuned performance!

As an example, I tried looking at nearest-neighbour searches in 15-dimensional space with the nanoflann library. Vibe-tuning with Claude Code (Sonnet 4.6) gave a 1.43× speedup over the stock code (see the appendix). How well vibe-tuning works depends on how well optimized for a specific problem class and architecture the library already is. I also tried to speed up some specific prime lengths of FFT in FFTW, and certain matrix system solves (large circuit simulation matrices) with the SuiteSparse KLU solver, but didn’t get a meaningful speedup.

What happens next? The next coding agent training runs pick up this page and start vibe-tuning before you even asked. Then you’ll need to start guarding against your calculations running with unverified versions of libraries without you even noticing it. So… that’ll be fun.**

Notes

— 2026-03

Appendix: nanoflann patch for d=15 float64 (AVX2/FMA)

Description generated by Claude Sonnet 4.6: The entire optimisation is confined to nanoflann.hpp. No changes to application or benchmark code are required. The patch adds approximately 60 lines to the L2_Simple_Adaptor struct: a new static evalMetricPtr that operates on a contiguous pointer pair, and a modified evalMetric that gathers the 15 database-point coordinates onto the stack before calling it. Both are guarded so that all other scalar types and dimensionalities fall through to the original loop unchanged.

--- src/base/nanoflann.hpp
+++ src/variant-simd-d15/nanoflann.hpp
@@ -46,6 +46,9 @@
 #include <algorithm>
 #include <array>
+#if defined(__AVX2__) && defined(__FMA__)
+#  include <immintrin.h>
+#endif
 #include <atomic>
@@ -618,9 +618,55 @@
     {
     }

+    /* AVX2+FMA kernel: a[0..14] and b[0..14] are contiguous.
+     * Processes 4+4+4 doubles in __m256d registers, 3-element scalar tail. */
+    static DistanceType evalMetricPtr(const T* a, const T* b, size_t size)
+    {
+#if defined(__AVX2__) && defined(__FMA__)
+        if constexpr (std::is_same<T, double>::value)
+        {
+            if (size == 15)
+            {
+                __m256d d0 = _mm256_sub_pd(_mm256_loadu_pd(a),
+                                           _mm256_loadu_pd(b));
+                __m256d d1 = _mm256_sub_pd(_mm256_loadu_pd(a+4),
+                                           _mm256_loadu_pd(b+4));
+                __m256d d2 = _mm256_sub_pd(_mm256_loadu_pd(a+8),
+                                           _mm256_loadu_pd(b+8));
+                __m256d s = _mm256_fmadd_pd(d0, d0,
+                                _mm256_fmadd_pd(d1, d1,
+                                    _mm256_mul_pd(d2, d2)));
+                __m128d lo  = _mm256_castpd256_pd128(s);
+                __m128d hi  = _mm256_extractf128_pd(s, 1);
+                __m128d sum = _mm_hadd_pd(_mm_add_pd(lo, hi),
+                                         _mm_add_pd(lo, hi));
+                double r    = _mm_cvtsd_f64(sum);
+                double t12 = a[12]-b[12]; r += t12*t12;
+                double t13 = a[13]-b[13]; r += t13*t13;
+                double t14 = a[14]-b[14]; r += t14*t14;
+                return static_cast<DistanceType>(r);
+            }
+        }
+#endif
+        DistanceType result = DistanceType();
+        for (size_t i = 0; i < size; ++i)
+        { const T diff = a[i]-b[i]; result += diff*diff; }
+        return result;
+    }
+
     DistanceType evalMetric(
         const T* a, const IndexType b_idx, size_t size) const
     {
+#if defined(__AVX2__) && defined(__FMA__)
+        if constexpr (std::is_same<T, double>::value)
+        {
+            if (size == 15)
+            {
+                double b[15];
+                for (size_t i = 0; i < 15; ++i)
+                    b[i] = data_source.kdtree_get_pt(b_idx, i);
+                return evalMetricPtr(a, b, 15);
+            }
+        }
+#endif
         DistanceType result = DistanceType();
         for (size_t i = 0; i < size; ++i)
         {

* Other meanings may exist like for many English terms, and the idea may have been described before, as with many ideas.

** Other definitions of fun are available.