I almost made a 10x optimization

Working on a high-performance Web accelerator

I have been working on Varnish Cache for a decade and from day one former colleagues of mine persuaded themselves that from then on I was dealing with arcane performance topics on a daily basis so explaining that it was not the case puzzled them for a while.

As it turns out, Varnish Cache is a very small project, and yet there is plenty of variety in the project. Working on Varnish means that you can work on:

networking sockets
networking protocols
multi-threading
IPC of various kinds, including shared memory
terminal-based interfaces (TUI applications)
a programming language compiler (VCL)
a programming language interpreter (VTC)
a homemade test framework
libraries
gzip

And that is only what comes off the top of my head. There are so many things going on for such a small project that amazingly, after a decade, there are still several parts of the code base I never worked on, some of them I never even read.

I eventually worked on a performance bottleneck that almost resulted in a 10x speed increase, significantly reducing the time it would take me to perform certain tasks. It also turned out to be very similar to the first optimization I remember making in a professional setting.

Working on a low-performance Web application

Almost two decades ago, I ported a PHP application to Java, because I had joined a company doing everything in Java, and nobody was competent to work on this small internal PHP application. I was tasked to rewrite it from scratch in Java, and that would serve as a training vector to familiarize myself with the stack I would use next, on a regular Java project.

The result was a slow, slow, slow mess of an application universally hated by everyone who had to use it in the company, which was pretty much everyone. There were several reasons why it was so slow, and this was the opportunity to discover the art of profiling.

For the first problem I solved, I decided to blame Java for being magic and not warning me about the dangers of my ways.

The exception to the simplicity rule

Back then, Java 5 was brand new, and it came with “amazing” features like for-each loops, generics, and other quality-of-life features. The language was verbose and complicated, but it was also mostly unambiguous. It was almost always possible to figure immediately what was what in a code snippet.

Despite being a complicated language, as far as I could observe, javac had always been a simple piece of the tool chain. It ingested Java code and turned it into Java bytecode in such a way that most of the bytecode I have seen had a 1-1 correspondence to its Java code.

The beauty of javac’s straightforward translation of the Java code, without extensive optimization passes, was that the runtime would instead be in charge of optimizations taking advantage of the host system capabilities. In other words, javac could remain simple because the complexity was pushed at the edge. My assumption (that I’m not attempting to verify because at this point I no longer care) is that aggressive optimizations at compile time could have prevented better optimizations at run time, but I digress.

Isolating the performance hot spot

For the first performance problem I solved, I wrote a reproducer for the naive code that ended up eating a significant amount of cycles:

public class Concat {
	public static void main(String... args) {
		String cat = "";
		String sep = "";
		for (String arg: args) {
			cat += sep + arg;
			sep = ", ";
		}
		System.out.println(cat);
	}
}

This program loops over its command line arguments to build a comma-separated representation of them. It’s also branch-less, so the separator is handled as a string instead of, for example, a boolean.

Let’s compile it to bytecode for Java 5:

$ javac -source 5 -target 5 Concat.java
warning: [options] bootstrap class path not set in conjunction with -source 5
error: Source option 5 is no longer supported. Use 8 or later.
error: Target option 5 is no longer supported. Use 8 or later.

Unfortunately this is no longer possible with OpenJDK 21, so let’s try again with Java 8 as suggested by the error message:

$ javac -source 8 -target 8 Concat.java
warning: [options] bootstrap class path not set in conjunction with -source 8
warning: [options] source value 8 is obsolete and will be removed in a future release
warning: [options] target value 8 is obsolete and will be removed in a future release
warning: [options] To suppress warnings about obsolete options, use -Xlint:-options.
4 warnings

And let’s just make sure that it works as advertised:

$ javac Concat.java
$ java Concat a b c
a, b, c

The curious case of the magic bytecode

Looking at the loop, the problem was not obvious. It looked straightforward and nothing hinted at ill-advised operations in a tight loop. I eventually decided to look at the bytecode because that made me look competent, and at least busy and actively working on the problem.

It’s still trivial today with OpenJDK:

$ javap -c Concat.class
Compiled from "Concat.java"
public class Concat {
  public Concat();
    Code:
       0: aload_0
       // Method Object."<init>":()V
       1: invokespecial #1
       4: return

  public static void main(String...);
    Code:
       0: ldc           #7                  // String
       2: astore_1
       3: ldc           #7                  // String
       5: astore_2
       6: aload_0
       7: astore_3
       8: aload_3
       9: arraylength
      10: istore        4
      12: iconst_0
      13: istore        5
      15: iload         5
      17: iload         4
      19: if_icmpge     61
      22: aload_3
      23: iload         5
      25: aaload
      26: astore        6
      // class StringBuilder
      28: new           #9
      31: dup
      // Method StringBuilder."<init>":()V
      32: invokespecial #11
      35: aload_1
      // Method StringBuilder.append:(LString;)LStringBuilder;
      40: invokevirtual #12
      39: aload_2
      // Method StringBuilder.append:(LString;)LStringBuilder;
      40: invokevirtual #12
      43: aload         6
      // Method StringBuilder.append:(LString;)LStringBuilder;
      45: invokevirtual #12
      // Method StringBuilder.toString:()LString;
      48: invokevirtual #16
      51: astore_1
      52: ldc           #20                 // String ,
      54: astore_2
      55: iinc          5, 1
      58: goto          15
      // Field System.out:Ljava/io/PrintStream;
      61: getstatic     #22
      64: aload_1
      // Method java/io/PrintStream.println:(LString;)V
      65: invokevirtual #28
      68: return
}

I was really taken by surprise when I realized that the bytecode was not at all looking like a 1-1 translation of the Java code. My loop (much larger back then) was full of objects and instructions that appeared out of nowhere. And I really hate magic. I can’t reason about magic, I need my code to do what it’s told and not do anything that wasn’t explicitly said, otherwise I can’t build a reliable mental model.

To help myself reason about this bytecode, I manually decompiled it:

public class Concat {
	public static void main(String... args) {
		String cat = "";
		String sep = "";
		for (String arg: args) {
			StringBuilder tmp = new StringBuilder();
			tmp.append(cat);
			tmp.append(sep);
			tmp.append(arg);
			cat = tmp.toString();
			sep = ", ";
		}
		System.out.println(cat);
	}
}

And from then on the problem was obvious. The ‘+’ operator for the String class was backed by a StringBuilder. I’m saying “was” here because OpenJDK uses a different backend for string concatenation in Java 21. Because javac makes a straightforward translation of the Java code, we end up with a lot of churn inside the tight loop, creating and discarding one StringBuilder per iteration.

In my performance problem, I remember seeing several string builders being created and deleted in the same loop, as its body was doing a lot more. This was the occasion to discover that Java strings are immutable and what it means in practice. One conclusion I drew was that the runtime was not able to optimize this case at all. Even with other mechanisms like generational garbage collectors, these short-lived string builders slowed the application down to a crawl. I have no idea what OpenJDK 21 is capable of these days, and I will probably not find out.

Time for performance wizardry

Seeing what the ‘+’ and ‘+=’ operators were doing to the generated bytecode the solution was rather simple. If string concatenation is backed by a string builder, we can avoid the needless churn by moving the string builder out of the loop:

public class Concat {
	public static void main(String... args) {
		StringBuilder cat = new StringBuilder();
		String sep = "";
		for (String arg: args) {
			cat.append(sep);
			cat.append(arg);
			sep = ", ";
		}
		System.out.println(cat);
	}
}

And with a similar change in the code base, this hot spot no longer registered in the profiler. The rest of the hot spots had similar churn problems that became obvious after learning to identify them. From then on that pattern was burned into my brain and I would quickly spot it, taking action immediately to prevent it from turning into a bottleneck.

Spotting this pattern in Varnish

Fast forward years later, as of today also a few years ago, I find myself in a delicate position trying to understand a dramatic performance problem observed by a customer. Varnish is performing exceptionally poorly and they can prove it reliably, and repeatedly. It’s so bad that I end up traveling to work on it in person and on site with them.

Every time they run their benchmark, it results in more than 20GB of logs, and it takes me forever to collect data. I was using my own personal project varnishplot to automate the creation of latency graphs. It would take me forever to conclude that Varnish was reporting very low processing times. Being very attentive to the pitfall I first ran into as a Java developer, I had already noticed the likelihood of a similar occurrences in Varnish logs two years prior, when I initiated my work on varnishplot, and on the spot I decided to have a shot at it.

The Varnish Shared Memory Log

To summarize briefly how logs work in Varnish, we have on one end varnishd creating a memory-mapped file per shared memory segment. On the other end we have programs that can map the same files in memory. Addressing them via the file system is how they become shared memory segments. This particular segment I am interested in is called VSL for Varnish Shared Log, also referred to as shared memory log, or shmlog.

The VSL segment is itself split into several parts inside which actual logs are written. Logs have a structured binary representation, and varnishd will cycle through segments as it writes to them. Inside varnishd, there can be a large amount of threads logging concurrently, so long-running tasks log in a local buffer and ideally flush everything at once at the end of the task.

The circular nature of the live VSL means that consumers like varnishlog or varnishncsa may be overrun by varnishd. Grouping transactions increases the risk of being overtaken. As a result VSL consumers may make copies of log records to keep them available until a complete transaction or transaction group is ready to be processed. In terms of coordination, varnishd threads use a mutex to publish log records, and consumers synchronize in wait-free fashion with memory barriers.

Log consumers can also feed from regular files, written by varnishlog -w. This offers the ability to perform queries and filtering on structured binary logs at rest. This is essentially how I was processing logs from benchmark runs to plot latency histograms.

That should be it for the short digression, back to the problem now.

Pushing the cursor

The logs consumption itself is abstracted behind a cursor interface (VSLC or Varnish Shared Log Cursor). There is a VSM (Varnish Shared Memory) cursor for live logs and a file cursor for VSL files written down to disk. The churn pattern described with the Java anecdote does not appear as clearly as in the Java bytecode, but after asking the author of the current VSL implementation a dozen times when I started having my doubts two years before my benchmark woes I was fairly certain that the same scenario was happening.

The sequence of events was different, but is actually quite similar:

the file cursor opens a file
it reads data in a local buffer
it parses log records
- copies are made to outlive the buffer
once a transaction group is processed, copies are freed

The thing that was bothering me was the need to allocate copies at rest, so I made the hypothesis that regular files could be mapped in memory to work with them directly in place.

In other words, this would take the best of both worlds, working “live” on a memory buffer like the VSM cursor, without the fear of being overtaken like the file cursor. As an added bonus, the code for the mmap cursor ended up being the simplest of the three. A simple design often is a good foundation for good performance.

Almost 10x

With the new mmap cursor I was able to iterate much faster on large dumps of logs from benchmark runs. On my workstation the speed to process logs jumped to 6x faster for transaction groups and 9x for raw log records.

Faster iterations meant that I processed benchmark results more efficiently from this point on, and soon after the answer became obvious: all client requests came from a very small set of file descriptors.

There was a proxy in front of Varnish that would serve several purposes, but it was configured to open two connections to Varnish per client IP address. In this very synthetic benchmark, that proxy was funneling thousands of HTTP requests per second over a handful of TCP connections. Lifting that limit solved the problem and that was the end of a wild goose chase.

Soon after I submitted my change and all was well that ended well. I was hoping that reflecting on this would help me understand another ongoing performance problem, but no luck so far. At this rate I might write something up about the current problem in 2028…