Configuring & Optimizing WebSocket Compression

Posted 14 days back at

Good news, browser support for the latest draft of “Compression Extensions” for WebSocket protocol — a much needed and overdue feature — will be landing in early 2014: Chrome M32+ (available in Canary already), and Firefox and Webkit implementations should follow.

Specifically, it enables the client and server to negotiate a compression algorithm and its parameters, and then selectively apply it to the data payloads of each WebSocket message: the server can compress delivered data to the client, and the client can compress data sent to the server.

Negotiating compression support and parameters #

Per-message compression is a WebSocket protocol extension, which means that it must be negotiated as part of the WebSocket handshake. Further, unlike a regular HTTP request (e.g. XMLHttpRequest initiated by the browser), WebSocket also allows us to negotiate compression parameters in both directions (client-to-server and server-to-client). That said, let's start with the simplest possible case:

GET /socket HTTP/1.1
Connection: Upgrade
Upgrade: websocket
Sec-WebSocket-Version: 13
Sec-WebSocket-Key: dGhlIHNhbXBsZSBub25jZQ==
Sec-WebSocket-Extensions: permessage-deflate
HTTP/1.1 101 Switching Protocols
Upgrade: websocket
Connection: Upgrade
Sec-WebSocket-Accept: s3pPLMBiTxaQ9kYGzzhZRbK+xOo=
Sec-WebSocket-Extensions: permessage-deflate

The client initiates the negotiation by advertising the permessage-deflate extension in the Sec-Websocket-Extensions header. In turn, the server must confirm the advertised extension by echoing it in its response.

If the server omits the extension confirmation then the use of permessage-deflate is declined, and both the client and server proceed without it - i.e. the handshake completes and messages won't be compressed. Conversely, if the extension negotiation is successful, both the client and server can compress transmitted data as necessary:

  • Current standard uses Deflate compression.
  • Compression is only applied to application data: control frames and frame headers are unaffected.
  • Both client and server can selectively compress individual frames: if the frame is compressed, the RSV1 bit in the WebSocket frame header is set.

Selective message compression #

Selective compression is a particularly interesting and a useful feature. Just because we've negotiated compression support, doesn't mean that all messages must be compressed! After all, if the payload is already compressed (e.g. image data or any other compressed payload), then running deflate on each frame would unnecessarily waste CPU cycles on both ends. To avoid this, WebSocket allows both the server and client to selectively compress individual messages.

How do the server and client know when to compress data? This is where your choice of a WebSocket server and API can make a big difference: a naive implementation will simply compress all message payloads, whereas a smart server may offer an additional API to indicate which payloads should be compressed.

Similarly, the browser can selectively compress transmitted payloads to the server. However, this is where we run into our first limitation: the WebSocket browser API does not provide any mechanism to signal whether the payload should be compressed. As a result, the current implementation in Chrome compresses all payloads - if you're already transferring compressed data over WebSocket without deflate extension then this is definitely something you should consider as it may add unnecessary overhead on both sides of the connection.

In theory, in absence of an official API, or a per-message flag to indicate a compressed message, the UA could run a “data randomness” test to see if the data should be compressed. However, this by itself can add non-trivial processing overhead.

Optimizing and scaling Deflate compression #

Compressed payloads can significantly reduce the amount of transmitted data, which leads to bandwidth savings and faster message delivery. That said, there are some costs too! Deflate uses a combination of LZ77 and Huffman coding to compress data: first, LZ77 is used to eliminate duplicate strings; second, Huffman coding is used to encode common bit sequences with shorter representations.

By default, enabling compression will add at least ~300KB of extra memory overhead per WebSocket connection - arguably, not much, but if your server is juggling a large number of WebSocket connections, or if the client is running on a memory-limited device, then this is something that should be taken into account. The exact calculation based on zlib implementation of Deflate is as follows:

        compressor = (1 << (windowBits + 2)) + (1 << (memLevel + 9))
      decompressor = (1 << windowBits) + 1440 * 2 * sizeof(int)
             total = compressor + decompressor

Both peers maintain separate compression and decompression contexts, each of which require a separate LZ77 window buffer (as defined by windowBits), plus additional overhead for the Huffman tree and other compressor and decompressor overhead. The default settings are:

  • compressor: windowBits = 15, memLevel = 8 → ~256KB
  • decompressor: windowBits = 15 → ~44KB

The good news is that permessage-deflate allows us to customize the size of the LZ77 window and thus limit the memory overhead via two extension parameters: {client, server}_no_context_takeover and {client, server}_max_window_bits. Let's take a look under the hood...

Optimizing LZ77 window size #

A full discussion of LZ77 and Huffman coding is outside the scope of this post, but to understand the above extension parameters, let's first take a small detour to understand what we are configuring and the inherent tradeoffs between memory and compression performance.

The windowBits parameter is customizing the size of the “sliding window” used by the LZ77 algorithm. Above video is a great visual demonstration of LZ77 at work: the algorithm maintains a “sliding window” of previously seen data and replaces repeated strings (indicated in red) with back-references (e.g. go back X characters, copy Y characters) - that's LZ77 in a nutshell. As a result, the larger the window, the higher the likelihood that LZ77 will find and eliminate duplicate strings.

How large is the LZ77 sliding window? By default, the window is initialized to 15 bits, which translates to 215 bits (32KB) of space. However, we can customize the size of the sliding window as part of the WebSocket handshake:

GET /socket HTTP/1.1
Upgrade: websocket
Sec-WebSocket-Key: ...
Sec-WebSocket-Extensions: permessage-deflate;
  client_max_window_bits; server_max_window_bits=10
  • The client advertises that it supports custom window size via client_max_window_bits
  • The client requests that the server should use a window of size 210 (1KB)
HTTP/1.1 101 Switching Protocols
Connection: Upgrade
Sec-WebSocket-Accept: ...
Sec-WebSocket-Extensions: permessage-deflate;
  • The server confirms that it will use a 210 (1KB) window size
  • The server opts-out from requesting a custom client window size

Both the client and server must maintain the same sliding windows to exchange data: one buffer for client → server compression context, and one buffer for server → client context. As a result, by default, we will need two 32KB buffers (default window size is 15 bits), plus other compressor overhead. However, we can negotiate a smaller window size: in the example above, we limit the server → client window size to 1KB.

Why not start with the smaller buffer? Simple, the smaller the window the less likely it is that LZ77 will find an appropriate back-reference. That said, the performance will vary based on the type and amount of transmitted data and there is no single rule of thumb for best window size. To get the best performance, test different window sizes on your data! Then, where needed, decrease window size to reduce memory overhead.

Optimizing Deflate memLevel #

The memLevel parameter controls the amount of memory allocated for internal compression state: when set to 1, it uses the least memory, but slows down the compression algorithm and reduces the compression ratio; when set to 9, it uses the most memory and delivers the best performance. The default memLevel is set to 8, which results in ~133KB of required memory overhead for the compressor.

Note that the decompressor does not need to know the memLevel chosen by the compressor. As a result, the peers do not need to negotiate this setting in the handshake - they are both free to customize this value as they wish. The server can tune this value as required to tradeoff speed, compression ratio, and memory - once again, the best setting will vary based on your data stream and operational requirements.

Unfortunately, the client, which in this case is the browser user-agent does not provide any API to customize the memLevel of the compressor: memLevel = 8 is used as a default value in all cases. Similar to the missing per-message compression flag, perhaps this is a feature that can be added to a future revision of the WebSocket spec.

Context takeover #

By default, the compression and decompression contexts are persisted across different WebSocket messages - i.e. the sliding window from previous message is used to encode content of the next message. If messages are similar — as they usually are — this improves the compression ratio. However, the downside is that the context overhead is a fixed cost for the entire lifetime of the connection - i.e. memory must be allocated at the beginning and must be maintained until the connection is closed.

Well, what if we relaxed this constraint and instead allowed the peers to reset the context between the different messages? That's what “no context takeover” option is all about:

GET /socket HTTP/1.1
Upgrade: websocket
Sec-WebSocket-Key: ...
Sec-WebSocket-Extensions: permessage-deflate;
  client_max_window_bits; server_max_window_bits=10;
  client_no_context_takeover; server_no_context_takeover
  • Client advertises that it will disable context takeover
  • Client requests that the server also disables context takeover
HTTP/1.1 101 Switching Protocols
Connection: Upgrade
Sec-WebSocket-Accept: ...
Sec-WebSocket-Extensions: permessage-deflate;
  server_max_window_bits=10; client_max_window_bits=12
  client_no_context_takeover; server_no_context_takeover
  • Server acknowledges the client “no context takeover” recommendation
  • Server indicates that it will disable context takeover

Disabling “context takeover” prevents the peer from using the compressor context from a previous message to encode contents of the next message. In other words, each message is encoded with its own sliding window and Huffman tree.

The upside of disabling context takeover is that both the client and server can reset their contexts between different messages, which significantly reduces the total overhead of the connection when its idle. That said, the downside is that compression performance will likely suffer as well. How much? You guessed it, the answer depends on the actual application data being exchanged.

Note that even without “no_context_takeover” negotiation, the decompressor should be able to decode both types of messages. That said, the explicit negotiation is what allows us to know that it is safe to reset the context on the receiver.

Optimizing compression parameters #

Now that we know what we're tweaking, a simple ruby script can help us iterate over all of the options (memLevel and window size) to pick the optimal settings. For the sake of an example, let's compress the GitHub timeline:

$> curl -o timeline.json
$> ruby compare.rb timeline.json
Original file (timeline.json) size: 30437 bytes
Window size: 8 bits (256 bytes)
   memLevel: 1, compressed size: 19472 bytes (36.03% reduction)
   memLevel: 9, compressed size: 15116 bytes (50.34% reduction)

Window size: 11 bits (2048 bytes)
   memLevel: 1, compressed size: 9797 bytes (67.81% reduction)
   memLevel: 9, compressed size: 8062 bytes (73.51% reduction)

Window size: 15 bits (32768 bytes)
   memLevel: 1, compressed size: 8327 bytes (72.64% reduction)
   memLevel: 9, compressed size: 7027 bytes (76.91% reduction)

The smallest allowed window size (256 bytes) provides ~50% compression, and raising the window to 2KB, takes us to ~73%! From there, it is diminishing returns: 32KB window yields only a few extra percent (see full output). Hmm! If I was streaming this data over a WebSocket, a 2KB window size seems like a reasonable optimization.

Deploying WebSocket compression #

Customizing LZ77 window size and context takeover are advanced optimizations. Most applications will likely get the best performance by simply using the defaults (32KB window size and shared sliding window). That said, it is useful to understand the incurred overhead (upwards of 300KB per connection), and the knobs that can help you tweak these parameters!

Looking for a WebSocket server that supports per-message compression? Rumor has it, Jetty, Autobahn and WebSocket++ already support it, and other servers (and clients) are sure to follow. For a deep-dive on the negotiation workflow, frame layouts, and more, check out the official specification.

P.S. For more WebSocket optimization tips: WebSocket chapter in High Performance Browser Networking.

Optimizing NGINX TLS Time To First Byte (TTTFB)

Posted 14 days back at

Network latency is one of our primary performance bottlenecks on the web. In the worst case, new navigation requires a DNS lookup, TCP handshake, two roundtrips to negotiate the TLS tunnel, and finally a minimum of another roundtrip for the actual HTTP request and response — that's five network roundtrips to get the first few bytes of the HTML document!

Modern browsers try very hard to anticipate and predict user activity to hide some of this latency, but speculative optimization is not a panacea: sometimes the browser doesn't have enough information, at other times it might guess wrong. This is why optimizing Time To First Byte (TTFB), and TLS TTFB in particular due to the extra roundtrips, is critical for delivering a consistent and optimized web experience.

The why and the how of TTFB

According to the HTTP Archive, the size of the HTML document at 75th percentile is ~20KB+, which means that a new TCP connection will incur multiple roundtrips (due to slow-start) to download this file - with IW4, a 20KB file will take 3 extra roundtrips, and upgrading to IW10 will reduce that to 2 extra roundtrips.

To minimize the impact of the extra roundtrips all modern browsers tokenize and parse received HTML incrementally and without waiting for the full file to arrive. Stream processing enables the browser to discover other critical resources, such as references to CSS stylesheets, JavaScript, and other assets as quickly as possible and initiate those requests while waiting for the remainder of the document. As a result, optimizing your TTFB and the content of those first bytes can make a big difference to performance of your application:

  • Don't buffer the entire response on the server. If you have partial content (e.g. page header), then flush it as early as possible to get the browser working on your behalf.
  • Optimize the contents of the first bytes by including references to other critical assets as early as possible.

Measuring "out of the box" NGINX TLS TTFB

With the theory of TTFB out of the way, let's now turn to the practical matter of picking and tuning the server to deliver the best results. One would hope that the default “out of the box” experience for most servers would do a good job… unfortunately, that is not the case. Let's take a closer look nginx:

  • Fresh Ubuntu server in ec2-west (micro instance) with nginx v1.4.4 (stable).
  • The server is configured to serve a single 20KB (compressed) file.
  • The TLS certificate is ~5KB and is using a 2048-bit key.
  • The measurements are done with WebPageTest: 3G profile (300ms delay), Chrome (stable channel), Dulles location (~80ms actual RTT to the EC2 instance on the west coast).

The total client to server roundtrip time is ~380ms. As a result, we would expect a regular HTTP connection to yield a TTFB of ~1140ms: 380ms for DNS, 380ms for TCP handshake, and 380ms for the HTTP request and (instant) response. For HTTPS, we would add another two RTTs to negotiate all the required parameters: 1140ms + 760ms, or ~1900ms (5 RTTs) in total. Well, that's the theory, let's now try the theory in practice!

The HTTP TTFB is right on the mark (~1100ms), but what in the world is going on with HTTPS? The TTFB reported by WebPageTest shows ~2900ms, which is an entire extra second over our expected value! Is it the cost of the RSA handshake and symmetric crypto? Nope. Running openssl benchmarks on the server shows that it takes ~2.5ms for a 2048-bit handshake, and we can stream ~100MB/s through aes-256. It's time to dig deeper.

Fixing the “large” certificate bug in nginx

Looking at the tcpdump of our HTTPS session we see the ClientHello record followed by ServerHello response ~380ms later. So far so good, but then something peculiar happens: the server sends ~4KB of its certificate and pauses to wait for an ACK from the client - huh? The server is using a recent Linux kernel (3.11) and is configured by default with IW10, which allows it to send up to 10KB, what's going on?

After digging through the nginx source code, one stumbles onto this gem. Turns out, any nginx version prior to 1.5.6 has this issue: certificates over 4KB in size incur an extra roundtrip, turning a two roundtrip handshake into a three roundtrip affair - yikes. Worse, in this particular case we trigger another unfortunate edge case in Windows TCP stack: the client ACKs the first few packets from the server, but then waits ~200ms before it triggers a delayed ACK for the last segment. In total, that results in extra 580ms of latency that we did not expect.

Ok, let's try the current mainline nginx release (1.5.7) and see if we fare any better...

Much better! After a simple upgrade the TLS TTFB is down to ~2300ms, which is about 600ms lower than our first attempt: we've just eliminated the extra RTT incurred by nginx and the ~200ms delayed ACK on the client. That said, we are not out of the woods yet — there is still an extra RTT in there.

Optimizing the TLS record size

TLS record size can have a significant impact on the page load time performance of your application. In this case, we run into this issue head first: nginx pumps data to the TLS layer, which in turn creates a 16KB record and then passes it to the TCP stack. So far so good, except that the server congestion window is less than 16KB for our new connection and we overflow the window, incurring an extra roundtrip while the data is buffered on the client. Fixing this requires making a quick patch to the nginx source:

diff nginx-1.5.7/src/event/ngx_event_openssl.c nginx-1.5.7-mtu/src/event/ngx_event_openssl.c
<               (void) BIO_set_write_buffer_size(wbio, NGX_SSL_BUFSIZE);
>               (void) BIO_set_write_buffer_size(wbio, 16384);
diff nginx-1.5.7/src/event/ngx_event_openssl.h nginx-1.5.7-mtu/src/event/ngx_event_openssl.h
< #define NGX_SSL_BUFSIZE  16384
> #define NGX_SSL_BUFSIZE  1400

After applying our two-line change and rebuilding the server our TTFB is down to ~1900ms — that's the 5 RTTs we expected at the start. In fact, it's easy to spot the difference from our previous run: the waterfall now shows the second RTT as content download time (blue section), whereas previously the browser couldn't process the HTML document until the very end. Success! But wait, what if I told you that we could do even better?

Enabling TLS False Start

TLS False Start allows us to eliminate an extra roundtrip of latency within the TLS handshake: the client can send its encrypted application data (i.e. HTTP request) immediately after it has sent its ChangeCipherSpec and Finished records, without waiting for the server to confirm its settings. So, how do we enable TLS False Start?

In short, we need to enable NPN on the server, which in practice means that we need to rebuild nginx against OpenSSL 1.0.1a or higher — nothing more, nothing less. Let's do just that and see what happens...

We started with a ~1800ms overhead for our TLS connection (nearly 5 extra RTTs); eliminated the extra certificate roundtrip after a nginx upgrade; cut another RTT by forcing a smaller record size; dropped an extra RTT from the TLS handshake thanks to TLS False Start. With all said and done, our TTTFB is down to ~1560ms, which is exactly one roundtrip higher than a regular HTTP connection. Now we're talking!

Yes, TLS does add latency and processing overhead. That said, TLS is an unoptimized frontier and we can mitigate many of its costs - it's worth it. Our quick exploration with nginx is a case in point, and most other TLS servers have all the same issues we've outlined above. Let's get this fixed. TLS is not slow, it's unoptimized.

Optimizing Web Font Rendering Performance

Posted 14 days back at

Web font adoption continues to accelerate across the web: according to HTTP Archive, ~37% of top 300K sites are using web fonts as of early 2014, which translates to a 2x+ increase over the past twelve months. Of course, this should not be all that surprising to most of us. Typography has always been an important part of good design, branding, and readability and web fonts offer many additional benefits: the text is selectable, searchable, zoomable, and high-DPI friendly. What's not to like?

Ah, but what about the rendering speed, don't web fonts come with a performance penalty? Fonts are an additional critical resource on the page, so yes, they can impact rendering speed of our pages. That said, just because the page is using web fonts doesn't mean it will (or has to) render slower.

There are four primary levers that determine the performance impact of web fonts on the page:

  1. The total number of fonts and font-weights used on the page.
  2. The total byte size of fonts used on the page.
  3. The transfer latency of the font resource.
  4. The time when the font downloads are initiated.

The first two levers are directly within the control of the designer of the page. The more fonts are used, the more requests will be made and more bytes will be incurred. The general UX best practice is to keep the number of used fonts at a minimum, which also aligns with our performance goals. Step one: use web fonts, but audit your font usage periodically and try to keep it lean.

Measuring web font latencies

The transfer latency of each font file is dependent on its bytesize, which in turn is determined by the number of glyphs, font metadata (e.g hinting for Windows platforms), and used compression method. Techniques such as font subsetting, UA-specific optimization, and more efficient compression (e.g. Google Fonts recently switched to Zopfli for WOFF resources), are all key to optimizing the transfer size. Plus, since we're talking about latency, where the font is served from makes a difference also – i.e. a CDN, and ideally the user's cache!

That said, instead of talking in the abstract, how long does it actually take the visitor to download the web font resource on your site? The best way to answer this question is to instrument your site via the Resource Timing API, which allows us to get the DNS, TCP, and transfer time data for each font - as a bonus, Google Fonts recently enabled Resource Timing support! Here is an example snippet to report font latencies to Google Analytics:

// check if visitor's browser supports Resource Timing
if (typeof window.performance == 'object') {
  if (typeof window.performance.getEntriesByName == 'function') {

  function logData(name, r) {
    var dns = Math.round(r.domainLookupEnd - r.domainLookupStart),
          tcp = Math.round(r.connectEnd - r.connectStart),
        total = Math.round(r.responseEnd - r.startTime);
      ['_trackTiming', name, 'dns', dns],
      ['_trackTiming', name, 'tcp', tcp],
      ['_trackTiming', name, 'total', total]

  var _gaq = _gaq || [];
  var resources = window.performance.getEntriesByType("resource");
  for (var i in resources) {
    if (resources[i].name.indexOf("") != -1) {
      logData("webfont-font", resources[i])
    if (resources[i].name.indexOf("") != -1) {
      logData("webfont-css", resources[i])

The above example captures the key latency metrics both for the UA-optimized CSS file and the font files specified in that file: the CSS lives on and is cached for 24 hours, and font files live on and have a long-lived expiry. With that in place, let's take a look at the total (responseEnd - startTime) timing data in Google Analytics for my site:

For privacy reasons, the Resource Timing API intentionally does not provide a "fetched from cache” indicator, but we can nonetheless use a reasonable timing threshold - say, 20ms - to get an approximation. Why 20ms? Fetching a file from spinning rust, and even flash, is not free. The actual cache-fetch timing will vary based on hardware, but for our purposes we'll go with a relatively aggressive 20ms threshold.

With that in mind and based on above data for visitors coming to my site, the median time to get the CSS file is ~100ms, and ~26% of visitors get it from their local cache. Following that, we need to fetch the required font file(s), which take <20ms at the median – a significant portion of the visitors has them in their browser cache! This is great news, and a confirmation that the Google Fonts strategy of long-lived and shared font resources is working.

Your results will vary based on the fonts used, amount and type of traffic, plus other variables. The point is that we don't have to argue in the abstract about the latency and performance costs of web fonts: we have the tools and APIs to measure the incurred latencies precisely. And what we can measure, we can optimize.

Timing out slow font downloads

Despite our best attempts to optimize delivery of font resources, sometimes the user may simply have a poor connection due to a congested link, poor reception, or a variety of other factors. In this instance, the critical resources – including font downloads – may block rendering of the page, which only makes the matter worse. To deal with this, and specifically for web fonts, different browsers have taken different routes:

  • IE immediately renders text with the fallback font and re-renders it once the font download is complete.
  • Firefox holds font rendering for up to 3 seconds, after which it uses a fallback font, and once the font download has finished it re-renders the text once more with the downloaded font.
  • Chrome and Safari hold font rendering until the font download is complete.

There are many good arguments for and against each strategy and we won't go into that discussion here. That said, I think most will agree that the lack of any timeout in Chrome and Safari is not a great approach, and this is something that the Chrome team has been investigating for a while. What should the timeout value be? To answer this, we've instrumented Chrome to gather font-size and fetch times, which yielded the following results:

Webfont size range Percent 50th 70th 90th 95th 99th
0KB - 10KB 5.47% 136 ms 264 ms 785 ms 1.44 s 5.05 s
10KB - 50KB 77.55% 111 ms 259 ms 892 ms 1.69 s 6.43 s
50KB - 100KB 14.00% 167 ms 882 ms 1.31 s 2.54 s 9.74 s
100KB - 1MB 2.96% 198 ms 534 ms 2.06 s 4.19 s 10+ s
1MB+ 0.02% 370 ms 969 ms 4.22 s 9.21 s 10+ s

First, the good news is that the majority of web fonts are relatively small (<50KB). Second, most font downloads complete within several hundred milliseconds: picking a 10 second timeout would impact ~0.3% of font requests, and a 3 second timeout would raise that to ~1.1%. Based on this data, the conclusion was to make Chrome mirror the Firefox behavior: timeout after 3 seconds and use a fallback font, and re-render text once the font download has completed. This behavior will ship in Chrome M35, and I hope Safari will follow.

Hands-on: initiating font resource requests

We've covered how to measure the fetch latency of each resource, but there is one more variable that is often omitted and forgotten: we also need optimize when the fetch is initiated. This may seem obvious on the surface, except that it can be a tricky challenge for web fonts in particular. Let's take a look at a hands-on example:

@font-face {
  font-family: 'FontB';
  src: local('FontB'), url('') format('woff');
p { font-family: FontA }
<!DOCTYPE html>
  <link href='stylesheet.css' rel='stylesheet'> <!-- see content above -->
    @font-face {
     font-family: 'FontA';
     src: local('FontA'), url('') format('woff');
  <script src='application.js' />
<p>Hello world!</p>

There is a lot going on above: we have an external CSS and JavaScript file, and inline CSS block, and two font declarations. Question: when will the font requests be triggered by the browser? Let's take it step by step:

  1. Document parser discovers external stylesheet.css and a request is dispatched.
  2. Document parser processes the inline CSS block which declares FontA - we're being clever here, we want the font request to go out as early as possible. Except, it doesn't. More on that in a second.
  3. Document parser blocks on external script: we can't proceed until that's fetched and executed.
  4. Once the script is fetched and executed we finish constructing the DOM, style calculation and layout is performed, and we finally dispatch request for fontA. At this point, we can also perform the first paint, but we can't render the text with our intended font since the font request is inflight... doh.

The key observation in the above sequence is that font requests are not initiated until the browser knows that the font is actually required to render some content on the page - e.g. we never request FontB since there is no content that uses it in above example! On one hand, this is great since it minimizes the number of downloads. On the other, it also means that the browser can't initiate the font request until it has both the DOM and the CSSOM and is able to resolve which fonts are required for the current page.

In the above example, our external JavaScript blocks DOM construction until it is fetched and executed, which also delays the font download. To fix this, we have a few options at our disposal: (a) eliminate the JavaScript, (b) add an async attribute (if possible), or (c) move it to the bottom of the page. However, the more general takeaway is that font downloads won't start until the browser can compute the the render tree. To make fonts render faster we need to optimize the critical rendering path of the page.

Tip: in addition to measuring the relative request latencies for each resource, we can also measure and analyze the request start time with Resource Timing! Tracking this timestamp will allow us to determine when the font request is initiated.

Optimizing font fetching in Chrome M33

Chrome M33 landed an important optimization that will significantly improve font rendering performance. The easiest way to explain the optimization is to look at a pre-M33 example timeline that illustrates the problem:

  1. Style calculation completed at ~840ms into the lifecycle of the page.
  2. Layout is triggered at ~1040ms, and font request is dispatched immediately after.

Except, why did we wait for layout if we already resolved the styles two hundred milliseconds earlier? Once we know the styles we can figure out which fonts we'll need and immediately initiate the appropriate requests – that's the new behavior in Chrome M33! On the surface, this optimization may not seem like much, but based on our Chrome instrumentation the gap between style and layout is actually much larger than one would think:

Percentile 50th 60th 70th 80th 90th
Time from Style → Layout 132 ms 182 ms 259 ms 410 ms 820 ms

By dispatching the font requests immediately after first style calculation the font download will be initiated ~130ms earlier at the median and ~800ms earlier at 90th percentile! Cross-referencing these savings with the font fetch latencies we saw earlier shows that in many cases this will allow us to fetch the font before the layout is done, which means that we won't have to block text rendering at all – this is a huge performance win.

Of course, one also should ask the obvious question... Why is the gap between style calculation and layout so large? The first place to start is in Chrome DevTools: capture a timeline trace and check for slow operations (e.g. long-running JavaScript, etc). Then, if you're feeling adventurous, head to chrome://tracing to take a peek under the hood – it may well be that the browser is simply busy processing and laying out the page.

Optimizing web fonts with Font Load Events API

Finally, we come to the most exciting part of this entire story: Font Load Events API. In a nutshell, this API will allow us to manage and define how and when the fonts are loaded – we can schedule font downloads at will, we can specify how and when the font will be rendered, and more. If you're familiar with the Web Font Loader JS library, then think of this API as that and more but implemented natively in the browser:

var font = new FontFace("FontA", "url(", {});
font.ready().then(function() {
  // font loaded.. swap in the text / define own behavior.

font.load(); // initiate immediate fetch / don't block on render tree!

Font Load Events API gives us complete control over which fonts are used, when they are swapped in (i.e. should they block rendering), and when they're downloaded. In the example above we construct a FontFace object directly in JavaScript and trigger an immediate fetch – we can inline this snippet at the top of our page and avoid blocking on CSSOM and DOM entirely! Best of all, you can already play with this API in Canary builds of Chrome, and if all goes well it should find its way into stable release by M35.

Web font performance checklist

Web fonts offer a lot of benefits: improved readability, accessibility (searchable, selectable, zoomable), branding, and when done well, beautiful results. It's not a question of if web fonts should be used, but how to optimize their use. To that end, a quick performance checklist:

  1. Audit your font usage and keep it lean.
  2. Make sure font resources are optimized - see Google Web Fonts tricks.
  3. Instrument your font resources with Resource Timing: measure → optimize.
  4. Optimize the transfer latency and time of initial fetch for each font.
  5. Optimize your critical rendering path, eliminate unnecessary JS, etc.
  6. Spend some time playing with the Font Load Events API.

Just because the page is using a web font, or several, doesn't mean it will (or has to) render slower. A well optimized site can deliver a better and faster experience by using web fonts.

Why is my CDN 'slow' for mobile clients?

Posted 14 days back at

We're using a CDN but when we look at the performance numbers it appears to be far less effective for mobile clients. We're considering disabling it entirely as we're not sure that it's worth it. Someone needs to build a special CDN for mobile, we'd definitely use that to improve latency!

The frequency with which I've been hearing this and similar arguments has been rapidly increasing as more teams are finally focusing on improving performance of their mobile sites. The problem is, while the statement is often based on real data (i.e. the relative performance improvements offered by a CDN are smaller for mobile clients), the conclusion is wrong: the absolute improvements are likely the same for all clients and hence worth every penny. Also, we don't need a "mobile CDN", we need carriers to fix their networks.

The many components of latency

To understand why the relative metrics between desktop and mobile CDN performance are misleading many otherwise well-informed teams, it helps to take a step back and consider the actual topology of the internet, plus what CDNs can and cannot do. In fact, for sake of a concrete example, let's assume the following:

  • Client is located on the west coast; server is located on the east coast.
  • The propagation latency between west and east US coasts is 50 ms.
  • The server response time is 50 ms.
  • Last-mile latency for "tethered" clients: ~18 ms for fiber, ~26 ms for cable, ~44 ms for DSL.
  • Last-mile latency for "wireless" clients: ~50 ms for 4G, ~200 ms for 3G.

The total time to service the request is a combination of all of the latencies in both directions: last-mile → coast to coast (50 ms) → server response (50 ms) → coast to coast (50 ms) → last-mile. For example, if the client happens to be on a fiber connection with a 18 ms last-mile path, then the total time is 18+50+50+50+18, or 186 milliseconds.

The last-mile latencies we're using above are specific to the US and are courtesy of the Measuring Broadband America report (FCC, 2013). Sadly, FCC has not yet published any similar reports for US mobile carriers. As a result, the ~50 and ~200ms numbers are based on specified and cited targets of existing US carriers.

Accelerating content delivery with a CDN

The whole premise of a CDN is to move the bytes as close to the user as possible and to do so CDNs deploy cache servers within various data centers and peering points around the globe. In other words, in the optimal case the CDN servers are located immediately outside the ISP / carrier network: the client makes a request, incurs the cost of last-mile latency to exit the ISP / carrier network and immediately hits a CDN server that returns a response. As a result, CDN minimizes the propagation latency and if a static resource is served, also reduces the server response time by returning a cached asset.

To continue with our earlier example, let's assume that our CDN server is optimally placed (~5 ms path instead of 50 ms coast-to-coast path) and that we are requesting a publicly cacheable asset that can be served by the CDN (~5ms) without hitting the origin. For our fiber client, the new total time is the sum of the last-mile roundtrip plus the CDN response times: 18+5+5+5+18, or 51 ms in total. As a result, adding a CDN took our request time from ~186ms down to ~51ms: a 365% improvement in total latency!

Last-mile Coast-to-Coast Server Response Total (ms) Improvement
Fiber 18 50 50 186
Cable 26 50 50 202
DSL 44 50 50 238
4G 50 50 50 250
3G 200 50 50 550
CDN + Fiber 18 5 5 51 -135 ms (365%)
CDN + Cable 26 5 5 67 -135 ms (301%)
CDN + DSL 44 5 5 103 -135 ms (231%)
CDN + 4G 50 5 5 115 -135 ms (217%)
CDN + 3G 200 5 5 415 -135 ms (133%)

Repeating the same calculation for each connection profile reveals an unfortunate trend: the relative effectiveness of the CDN "declines" as the last-mile latency increases. If you consider the topology of where the CDN servers are typically placed this makes perfect sense: CDN servers are outside the ISP network. That said, note that the absolute latency improvement is still the same regardless of the last-mile profile.

CDNs help minimize propagation and server response times. If your only metric is the relative before and after improvement, then it would appear that a CDN is not doing much for mobile clients - e.g. a mere 33% improvement for 3G clients. In reality, the savings are the same, and the real takeaway is that the last-mile latency of most mobile carriers dominates all other latencies of the resource transfer - yes, a rather sad state of affairs.

Operational and business costs of living at the edge

One obvious strategy to improve the end-to-end latency is to move the cache servers even closer to the client: instead of positioning them outside the ISPs network, could we move them inside? In principle, the answer is yes, and many ISPs already deploy their own cache servers. However, in practice, this is a tricky problem.

First, the number of peering points is relatively small, which allows CDNs to deploy in dozens of well known locations around the globe to provide their service. Also, to do so, they don't have to do any special deals with individual ISPs: typically, the servers are deployed in shared data centers (peering points). By contrast, moving servers into the ISP network would require individual deals with each ISP - not impossible, but obviously a much harder operational and business problem.

For the sake of argument, let's say a CDN does strike a deal with an ISP. Now the CDN needs to deploy many more servers, and ideally as close to their customers as possible (near the radio towers and other aggregation points). Doing so would require a lot of hardware, makes maintenance and upgrades an operations nightmare, and opens a lot of security questions - e.g. would you deploy a TLS termination node within a third-party operated network that you don't have direct access to? In short, it's a cost, security, and a logistics nightmare.

That's not to say this can't be done. Many ISPs have long been trying to move "upmarket" and provide CDN functionality. However, ISPs have a different problem: it's very hard for them to sign clients because most sites are not interested in signing individual deals with every ISP on the planet. In recent news, Verizon acquired EdgeCast... It remains to be seen what they do with it and if this will actually benefit Verizon clients.

Business and operational costs aside, there is nothing particularly special about optimizing CDNs for mobile clients. The root problem is that the last-mile latency of mobile carriers is atrocious - that's what we need to fix. Instead of pushing cache servers closer to the edge, we need transparency into performance of these networks, and we need more competition between carriers to address the underlying last-mile performance problems.

CDNs are not slow for mobile, use them

In short, there is no reason why CDNs are inherently 'slower' for mobile clients: don't confuse relative gains with absolute savings. That said, the actual performance will obviously vary for each CDN provider based on location of their servers and connectivity with various mobile carriers - measure, gather real data, optimize.

Uplink Latency of WiFi and 4G Networks

Posted 14 days back at

The user opens your application on their device and triggers an action requiring that we fetch a remote resource: application invokes the appropriate platform API (e.g. XMLHttpRequest), the runtime serializes the request (e.g. translates it to a well-formed HTTP request) and passes the resulting byte buffer to the OS, which then fragments it into one or more TCP packets and finally passes the buffer to the link layer.

So far, so good, but what happens next? As you can guess, the answer depends on the properties of the current link layer in use on the device. Let's dig a bit deeper...

Transmitting over WiFi

If the user is on WiFi, then the link layer breaks up the data into multiple frames and (optimistically) begins transmitting data one frame at a time: it waits until the radio channel is "silent," transmits the WiFi frame, and then waits for an acknowledgement from the receiver before proceeding with transmission of the next frame. Yes, you've read that right, each frame requires a full roundtrip between the sender and receiver! 802.11n is the first standard to introduce "frame aggregation," which allows multiple frames to be sent in a single transmission.

Of course, not all transmissions will succeed on their first attempt. If two peers transmit at the same time and on the same channel then a collision will happen and both peers will have to retransmit data: both peers sleep for a random interval and then repeat the process. The WiFi access model is simple to understand and implement, but as you can guess, also doesn't provide any guarantees about the latency costs of the transmission. If the network is mostly idle, then transmission times are nice and low, but if the network is congested, then all bets are off.

In fact, don't be surprised to see 100ms+ delays just for the first hop between the WiFi sender and the access point - e.g. see the histogram above, showing 180ms+ first-hop latency tails on my own (home) WiFi network. That said, note that there is no "typical" or "average" uplink WiFi latency: the latency will depend on the conditions and load of the particular WiFi network. In short, expect high variability and long latency tails, with an occasional chance of network collapse if too many peers are competing for access.

If your WiFi access point is also your gateway then you can run a simple ping command to measure your first hop latency.

Uplink scheduling on 4G networks

In order to make better use of the limited capacity of the shared radio channel and optimize energy use on the device, 4G/LTE standards take a much more hands-on approach to scheduling and resource assignment: the radio tower (eNodeB) notifies the device for when it should listen for inbound data, and also tells the device when it is allowed to transmit data. As you can imagine, this can incur a lot of coordination overhead (read, latency), but such is the cost of achieving higher channel and energy efficiency.

  1. The radio network has a dedicated Physical Uplink Control Channel (PUCCH) which is used by the device to notify the radio network that it wants to transmit data: each device has a periodic timeslot (typically on a 5, 10, or 20 ms interval) where it is allowed to send a Scheduling Request (SR) that consists of a single bit indicating that it needs uplink access.

  2. The SR request bit is received by the radio tower (eNodeB) but the SR request on its own is not sufficient to assign uplink resources as it doesn't tell the scheduler the amount of data that the device intends to transfer. So, the eNodeB responds with a small "uplink grant" that is just large enough to communicate the size of the pending buffer.

  3. Once the device receives its first uplink grant, it waits for its turn to transmit (up to ~5 ms), and sends a Buffer Status Report (BSR) indicating the amount of application data pending in its upload buffers. Finally, the eNodeB receives the BSR message, allocates the necessary uplink resources and sends back another uplink grant that will allow the device to drain its buffer.

What happens if additional data is added to the device buffer while the above process is underway? Simple, the device sends another BSR message and waits for new uplink grant! If timed correctly, then the BSR requests and uplink grants can be pipelined with existing data transfers from the device, allowing us to minimize first-hop delays. On the other hand, once the device buffer is drained and then new data arrives, the entire process is repeated once over: SR, uplink grant, BSR, uplink grant, data transmission.

So, what does this all mean in practice? Let's do the math:

  • If the network is configured to use a 10 ms periodic interval for communicating SR messages then we would expect a ~5 ms average delay before the SR request is sent.
  • There are two full roundtrips between the device and the eNodeB to negotiate the uplink resource assignment to transmit pending application data. The latency incurred by these roundtrips will vary for each network, but as a rule of thumb each exchange is ~5 ms.

Add it all up, and we're looking at 20+ ms of delay between application data arriving at the (empty buffer) of the link layer on the device and the same data being available at the link layer of the eNodeB. From there the packet needs to traverse the carrier network, exit onto the public network, and get routed to your server.

Above uplink latency overhead is one reason why low latency applications, such as delivering voice, can be a big challenge over 4G networks. In fact, for voice specifically, there is ongoing work on Voice over LTE (VoLTE) which aims to address this problem. How? Well, one way is to provide a persistent uplink grant: transmit up to X bytes on a Y periodic interval. Believe it or not, today most 4G networks still fall back to old 3G infrastructure to transmit voice!

Optimizing for WiFi and 4G networks

As you can tell, both WiFi and 4G have their challenges. WiFi can deliver low latency first hop if the network is mostly idle: no coordination is required and the device can transmit whenever it senses that the radio channel is idle. On the other hand, WiFi is subject to high variability and long latency tails if the network has many peers competing for access - and most networks do.

By contrast, 4G networks require coordination between the device and the radio tower for each uplink transfer, which translates to higher minimum latency, but the upside is that 4G can reign in the latency tails and provides more predictable performance and reduces congestion.

So, how does all this impact application developers? First off, latency aside, and regardless of wireless technology, consider the energy costs of your network transfers! Periodic transfers incur high energy overhead due to the need to wake up the radio on each transmission. Second, same periodic transfers also incur high uplink coordination overhead - 4G in particular. In short, don't trickle data. Aggregate your network requests and fire them in one batch: you will reduce energy costs and reduce latency by amortizing scheduling overhead.

Minimum Viable Block Chain

Posted 14 days back at

Cryptocurrencies, and Bitcoin in particular, have been getting a lot of attention from just about every angle: regulation, governance, taxation, technology, product innovation, and the list goes on. The very concept of a "peer-to-peer (decentralized) electronic cash system" turns many of our previously held assumptions about money and finance on their head.

That said, putting the digital currency aspects aside, an arguably even more interesting and far-reaching innovation is the underlying block chain technology. Regardless of what you think of Bitcoin, or its altcoin derivatives, as a currency and a store of value, behind the scenes they are all operating on the same basic block chain principles outlined by Satoshi Nakamoto:

We propose a solution to the double-spending problem using a peer-to-peer network. The network timestamps transactions by hashing them into an ongoing chain of hash-based proof-of-work, forming a record that cannot be changed without redoing the proof-of-work. The longest chain not only serves as proof of the sequence of events witnessed, but proof that it came from the largest pool of CPU power... The network itself requires minimal structure.

The block chain is agnostic to any "currency". In fact, it can (and will) be adapted to power many other use cases. As a result, it pays to understand the how and the why behind the "minimum viable block chain":

  • What follows is not an analysis of the Bitcoin block chain. In fact, I intentionally omit mentioning both the currency aspects, and the many additional features that the Bitcoin block chain is using in production today.
  • What follows is an attempt to explain, from the ground up, why the particular pieces (digital signatures, proof-of-work, transaction blocks) are needed, and how they all come together to form the "minimum viable block chain" with all of its remarkable properties.
I have learned long ago that writing helps me refine my own sloppy thinking, hence this document. Primarily written for my own benefit, but hopefully helpful to someone else as well. Feedback is always welcome, leave a comment below!

Securing transactions with triple-entry bookkeeping #

Alice and Bob are stamp collectors. It's nothing serious, and they're mostly in it for the social aspects of meeting others, exchanging stories, and doing an occasional trade. If both parties see something they like, they negotiate right there and then and complete the swap. In other words, it's a simple barter system.

Then, one day Bob shows up with a stamp that Alice feels she absolutely must have in her collection. Except there is a problem, because Bob is not particularly interested in anything that Alice has to offer. Distraught, Alice continues negotiating with Bob and they arrive at a solution: they'll do a one-sided transaction where Bob will give Alice his stamp and Alice will promise to repay him in the future.

Both Bob and Alice have known each other for a while, but to ensure that both live up to their promise (well, mostly Alice), they agree to get their transaction "notarized" by their friend Chuck.

They make three copies (one for each party) of the above transaction receipt indicating that Bob gave Alice a "Red stamp". Both Bob and Alice can use their receipts to keep account of their trade(s), and Chuck stores his copy as evidence of the transaction. Simple setup but also one with a number of great properties:

  1. Chuck can authenticate both Alice and Bob to ensure that a malicious party is not attempting to fake a transaction without their knowledge.
  2. The presence of the receipt in Chuck's books is proof of the transaction. If Alice claims the transaction never took place then Bob can go to Chuck and ask for his receipt to disprove Alice's claim.
  3. The absence of the receipt in Chuck's books is proof that the transaction never took place. Neither Alice nor Bob can fake a transaction. They may be able to fake their copy of the receipt and claim that the other party is lying, but once again, they can go to Chuck and check his books.
  4. Neither Alice nor Bob can tamper with an existing transaction. If either of them does, they can go to Chuck and verify their copies against the one stored in his books.

What we have above is an implementation of "triple-entry bookkeeping", which is simple to implement and offers good protection for both participants. Except, of course you've already spotted the weakness, right? We've placed a lot of trust in an intermediary. If Chuck decides to collude with either party, then the entire system falls apart.

Moral of the story? Be (very) careful about your choice of the intermediary!

Securing transactions with PKI #

Dissatisfied with the dangers of using a "reliable intermediary", Bob decides to do some research and discovers that public key cryptography can eliminate the need for an intermediary! This warrants some explanation…

Public-key cryptography, also known as asymmetric cryptography, refers to a cryptographic algorithm which requires two separate keys, one of which is secret (or private) and one of which is public. Although different, the two parts of this key pair are mathematically linked. The public key is used to encrypt plaintext or to verify a digital signature; whereas the private key is used to decrypt ciphertext or to create a digital signature.

The original intent behind using a third party (Chuck) was to ensure three properties:

  • Authentication: a malicious party can't masquerade as someone else.
  • Non-repudiation: participants can't claim that the transaction did not happen after the fact.
  • Integrity: the transaction receipt can't be modified after the fact.

Turns out, public key cryptography can satisfy all of the above requirements. Briefly, the workflow is as follows:

  1. Both Alice and Bob generate a set public-private keypairs.
  2. Both Alice and Bob publish their public keys to the world.
  3. Alice writes a transaction receipt in plaintext.
  4. Alice encrypts the plaintext of the transaction using her private key.
  5. Alice prepends a plaintext "signed by" note to the ciphertext.
  6. Both Alice and Bob store the resulting output.
Note that step #5 is only required when many parties are involved: if you don't know who signed the message then you don't know whose public key you should be using to decrypt it. This will become relevant very soon...

This may seem like a lot of work for no particular reason, but let's examine the properties of our new receipt:

  1. Bob doesn't know Alice's private key, but that doesn't matter because he can look up her public key (which is shared with the world) and use it to decrypt the ciphertext of the transaction.
  2. Alice is not really "encrypting" the contents of the transaction. Instead, by using her private key to encode the transaction she is "signing it": anyone can decrypt the ciphertext by using her public key, and because she is the only one in possession of the private key this mechanism guarantees that only she could have generated the ciphertext of the transaction.
How does Bob, or anyone else for that matter, get Alice's public key? There are many ways to handle distribution of public keys - e.g. Alice publishes it it on her website. We'll assume that some such mechanism is in place.

As a result, the use of public key infrastructure (PKI) fulfills all of our earlier requirements:

  1. Bob can use Alice's public key to authenticate signed transaction by decrypting the ciphertext.
  2. Only Alice knows her private key, hence Alice can't deny that the transaction took place - she signed it.
  3. Neither Bob nor anyone else can fake or modify a transaction without access to Alice's private key.
Note that for #2, Alice can deny that she is the true owner of the public-private keypair in question - i.e. someone is faking her identity in #1. The keypair to identity association is something your key distribution mechanisms needs to account for.

Both Alice and Bob simply store a copy of the signed transaction and the need for an intermediary is eliminated. The "magic" of public key cryptography is a perfect match for their two-party barter system.

Balance = Σ(receipts) #

With the PKI infrastructure in place, Bob and Alice complete a few additional trades: Alice acquires another stamp from Bob and Bob picks up a stamp from Alice. They each follow the same steps as before to generate signed transactions and append them to their respective ledgers.

The records are secure, but there is a small problem: it's not clear if either party has an outstanding balance. Previously, with just one transaction, it was clear who owed whom (Alice owed Bob) and how much (one red stamp), but with multiple transactions the picture gets really murky. Are all stamps of equal value? If so, then Alice has a negative balance. If not, then it's anyone's guess! To resolve this, Alice and Bob agree on the following:

  • Yellow stamp is worth twice the value of a red stamp.
  • Blue stamp is equal in value to a red stamp.

Finally, to ensure that their new agreement is secure they regenerate their ledgers by updating each transaction with its relative value. Their new ledgers now look as follows:

With that, computing the final balance is now a simple matter of iterating through all of the transactions and applying the appropriate debits and credits to each side. The net result is that Alice owes Bob 2... units of value. What's a "unit of value"? It's an arbitrary medium of exchange that Alice and Bob have agreed on. Further, since "unit of value" doesn't roll off the tongue, Alice and Bob agree to call 1 unit of value as 1 chroma (plural: chroms).

All of the above seems trivial, but the fact that the balance of each party is a function of all of the receipts in the ledger has an important consequence: anyone can compute everyone's balance. There is no need for any trusted intermediaries and the system is trivial to audit. Anyone can traverse the full ledger, verify the trades, and figure out the outstanding balances of each party.

Multi-party transfers & verification #

Next, Bob stumbles across a stamp owned by John that he really likes. He tells John about the secure ledger he is using with Alice and asks him if he would be willing to do a trade where Bob transfers his balance with Alice as a method of payment - i.e. Bob gets the stamp from John, and Alice would owe John the amount she previously owed Bob. John agrees, but now they have dilemma. How exactly does Bob "transfer" his balance to John in a secure and verifiable manner? After some deliberation, they arrive at an ingenious plan:

Bob creates a new transaction by following the same procedure as previously, except that he first computes the SHA-256 checksum (a unique fingerprint) of the encrypted transaction he wants to transfer and then inserts the checksum in the "What" field of the new receipt. In effect, he is linking the new transfer receipt to his previous transaction with Alice, and by doing so, transfers its value to John.

To keep things simple, we'll assume that all transfers "spend" full value of the transaction being transfered. It's not too hard to extend this system to allow fractional transfers, but that's unnecessary complexity at this point.

With the new transaction in place, John makes a copy of the encrypted ledger for his safekeeping (now there are three copies) and runs some checks to verify its integrity:

  1. John fetches Alice's and Bob's public keys and verifies the first three transactions.
  2. John verifies that Bob is transferring a "valid" transaction:
    • The transaction that is being transferred is addressed to Bob.
    • Bob has not previously transfered the same transaction to anyone else.

If all the checks pass, they complete the exchange and we can compute the new balances by traversing the ledger: Bob has a net zero balance, Alice has a debit of 2 chroms, and John has a credit of 2 chroms (courtesy of Alice). Further, John can now take his new ledger to Alice and ask her for payment, and even though Alice wasn't present for their transaction, that's not a problem:

  • Alice can verify the signature of the new transfer transaction using Bob's public key.
  • Alice can verify that the transfer transaction is referencing one of her own valid transactions with Bob.
The above transfer and verification process is a pretty remarkable property of the system! Note that to make it all work, we need two enabling technologies: (a) use of PKI, which enables digital signature verification, and (b) the receipt ledger, which enables us to look at the full transaction history to verify balances and to link previous transactions to enable the transfer.

Satisfied with their ingenuity John and Bob part ways: Bob goes home with a new stamp and John with a new ledger. On the surface, everything looks great, but they've just exposed themselves to a challenging security problem... Can you spot it?

Double-spending and distributed consensus #

Shortly after completing the transaction with John, Bob realizes that they have just introduced a critical flaw into their system and one that he could exploit to his advantage if he acts quickly: both Bob and John have updated their ledgers to include the new transaction, but neither Alice nor anyone else is aware that it has taken place. As a result, there is nothing stopping Bob from approaching other individuals in his network and presenting them with an old copy of the ledger that omits his transaction with John! If he convinces them to do a transaction, just as he did with John, then he can "double-spend" the same transaction as many times as he wants!

Of course, once multiple people show up with their new ledgers and ask Alice for payment, the fraud will be detected, but that is of little consolation - Bob has already run away with his loot!

The double-spend attack was not possible when we only had two participants since in order to complete the transaction you'd verify and update both sets of books simultaneously. As a result, all ledgers were always in sync. However, the moment we added an extra participant we introduced the possibility of incomplete and inconsistent ledgers between all the participants, which is why the double-spend is now possible.

In CS speak, a two-party ledger provides "strong consistency", and growing the ledger beyond two parties requires some form of distributed consensus to mitigate double-spend.

The simplest possible solution to this problem is to require that all parties listed in the ledger must be present at the time when each new transaction is made, such that everyone can update their books simultaneously. An effective strategy for a small-sized group, but also not a scalable one for a large number of participants.

Requirements for a distributed consensus network #

Let's imagine that we want to scale our ledger to all stamp collectors around the globe such that anyone can trade their favorite stamps in a secure manner. Obviously, requiring that every participant must be present to register each transaction would never work due to geography, timezones, and other limitations. Can we build a system where we don't need everyone's presence and approval?

  1. Geography is not really an issue: we can move communication online.
  2. Timezone problems can be solved with software: we don't need each individual to manually update their ledgers. Instead, we can build software that can run on each participant's computer and automatically receive, approve, and add transactions to the ledger on their behalf.

In effect, we could build a peer-to-peer (P2P) network that would be responsible for distributing new transactions and getting everyone's approval! Except, unfortunately that's easier said than done in practice. For example, while a P2P network can resolve our geography and timezone problems, what happens when even just one of the participants goes offline? Do we block all transactions until they're back online?

Note that the "how" of building a P2P network is a massive subject in its own right: protocols and signaling, traversing firewalls and NATs, bootstrapping, optimizing how updates are propagated, security, and so on. That said, the low-level mechanics of building such a network are out of scope of our discussion... we'll leave that as an exercise for the reader.

Turns out, distributed consensus is a well studied problem in computer science, and one that offers some promising solutions. For example, two-phase commit (2PC) and Paxos both enable a mechanism where we only need the majority quorum (50%+) of participants to be present to safely commit a new transaction: as long as the majority has accepted the transaction the remainder of the group is guaranteed to eventually converge on the same transaction history.

That said, neither 2PC nor Paxos are sufficient on their own. For example, how would either 2PC or Paxos know the total number of participants in our P2P stamp-collector network when new participants are joining on a daily basis and others are disappearing without notice? If one of the prior participants is offline, are they offline temporarily, or have they permanently left the network? Similarly, there is another and an even more challenging "Sybil attack" that we must account for: there is nothing stopping a malicious participant from creating many profiles to gain an unfair share of voting power within our P2P network.

If the number of participants in the system was fixed and their identities were authenticated and verified (i.e. a trusted network), then both 2PC and Paxos would work really well. Alas, that is simply not the case in our ever changing stamp collector P2P network. Have we arrived at a dead end? Well, not quite…

One obvious solution to solve our dilemma is to eliminate the "distributed" part from the problem statement. Instead of building a P2P distributed system we could, instead, build a global registry of all stamp collectors that would record their account information, authenticate them and (try to) ensure that nobody is cheating by creating multiple identities, and most importantly, keep one shared copy of the ledger! Concretely, we could build a website where these trades can take place, and the website would then take care of ensuring the integrity and correct ordering of all transactions by recording them in its centralized database.

The above is a practical solution but, let's admit it, an unsatisfying one since it forces us to forfeit the peer-to-peer nature of our ledger system. It places all of the trust in a single centralized system and opens up an entirely new set of questions: what is the uptime, security and redundancy of the system; who maintains the system and what are their incentives; who has administrative access, and so on. Centralization brings its own set of challenges.

Let's rewind and revisit some of the problems we've encountered with our P2P design:

  • Ensuring that every participant is always up to date (strongly consistent system) imposes high coordination costs and affects availability: if a single peer is unreachable the system cannot commit new transactions.
  • In practice we don't know the global status of the P2P network: number of participants, whether individuals are temporarily offline or decided to leave the network, etc.
  • Assuming we can resolve the above constraints, the system is still open to a Sybil attack where a malicious user can create many fake identities and exercise unfair voting power.

Unfortunately, resolving all of the above constraints is impossible unless we relax some of the requirements: the CAP theorem tells us that our distributed system can't have strong consistency, availability, and partition tolerance. As a result, in practice our P2P system must operate under the assumption of weak(er) consistency and deal with its implications:

  • We must accept that some ledgers will be out of sync (at least temporarily).
  • The system must eventually converge on a global ordering (linearizability) of all transactions.
  • The system must resolve ledger conflicts in a predictable manner.
  • The system must enforce global invariants - e.g. no double-spends.
  • The system should be secure against Sybil and similar attacks.

Protecting the network from Sybil attacks #

Achieving consensus in a distributed system, say by counting votes of each participant, opens up many questions about the "voting power" of each peer: who is allowed to participate, do certain peers have more voting power, is everyone equal, and how do we enforce these rules?

To keep things simple, let's say everyone's vote is equal. As a first step, we could require that each participant sign their vote with their private key, just as they would a transaction receipt, and circulate it to their peers - signing a vote ensures that someone else can't submit a vote on their behalf. Further, we could make a rule that only one vote is allowed to be submitted. If multiple votes are signed by the same key then all of them are discarded - make up your mind already! So far so good, and now the hard part…

How do we know if any particular peer is allowed to participate in the first place? If all that's needed is just a unique private key to sign a vote, then a malicious user could simply generate an unlimited number of new keys and flood the network. The root problem is that when forged identities are cheap to generate and use, any voting system is easily subverted.

To solve this problem we need to make the process of submitting the vote "expensive". Either the cost of generating a new identity must be raised, or the very process of submitting a vote must incur sufficiently high costs. To make this concrete, consider some real-world examples:

  • When you show up to vote in your local government election, you are asked to present an ID (e.g. a passport) that is (hopefully) expensive to fake. In theory, nothing stops you from generating multiple fake IDs, but if the costs are high enough (monetary costs of making a fake, risk of being caught, etc), than the cost of running a Sybil attack will outweigh its benefits.
  • Alternatively, imagine that you had to incur some other cost (e.g. pay a fee) to submit a vote. If the cost is high enough, then once again, the barrier to running a large-scale Sybil attack is increased.

Note that neither of the above examples "solves" the Sybil attack completely, but they also don't need to: as long as we raise the cost of the attack to be larger than the value gained by successfully subverting the system, then the system is secure and behaves as intended.

Note that we're using a loose definition of "secure". The system is still open for manipulation, and the exact vote count is affected, but the point is that a malicious participant doesn't affect the final outcome.

Proof-of-work as a participation requirement #

Any user can easily (and cheaply) generate a new "identity" in our P2P network by generating a new private-public keypair. Similarly, any user can sign a vote with their private key and send it into the P2P network - that's also cheap, as the abundance of spam email in our inboxes clearly illustrates. Hence, submitting new votes is cheap and a malicious user can easily flood the network with as many votes as they wish.

However, what if we made one of the steps above expensive such that you had to expend significantly more effort, time, or money? That's the core idea behind requiring a proof-of-work:

  1. The proof-of-work step should be "expensive" for the sender.
  2. The proof-of-work step should be "cheap" to verify by everyone else.

There are many possible implementations of such a method, but for our purposes we can re-use the properties provided by the cryptographic hash functions we encountered earlier:

  1. It is easy to compute the hash value for any given message.
  2. It is expensive to generate a message that has a given hash value.

We can impose a new rule in our system requiring that every signed vote must have a hash value that begins with a particular substring - i.e. require a partial hash collision of, say, two zero prefix. If this seems completely arbitrary, that's because it is - stay with me. Let's walk through the steps to see how this works:

  1. Let's say a valid vote statement is a simple string: "I vote for Bob".
  2. We can use the same SHA-256 algorithm to generate a hash value for our vote.
    • sha256("I vote for Bob") → b28bfa35bcd071a321589fb3a95cac...
  3. The resulting hash value is invalid because it does not start with our required substring of two zeros.
  4. We modify the vote statement by appending an arbitrary string and try again:
    • sha256("I vote for Bob - hash attempt #2") → 7305f4c1b1e7...
  5. The resulting hash value does not satisfy our condition either. We update the value and try again, and again, and… 155 attempts later we finally get:
    • sha256("I vote for Bob - hash attempt #155") → 008d08b8fe...

The critical property of the above workflow is that the output of the cryptographic hash function (SHA-256 in this case) is completely different every time we modify the input: the hash value of the previous attempt does not tell us anything about what the hash value of the next attempt when we increment our counter. As a result, generating a valid vote is not just "hard problem", but also one better described as a lottery where each new attempt gives you a random output. Also, we can adjust the odds of the lottery by changing the length of the required prefix:

  1. Each character of the SHA-256 checksum has 16 possible values: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, a, b, c, d, e, f.
  2. In order to generate a hash with a valid two zero prefix the sender will need 256 (162) attempts on average.
  3. Bumping the requirement to 5 zeros will require more than 1,000,000 (165) attempts on average… Point being, we can easily increase the cost and make the sender spend more CPU cycles to find a valid hash.
How many SHA256 checksums can we compute on a modern CPU? The cost depends on the size of the message, CPU architecture, and other variables. If you're curious, open up your console and run a benchmark: $> openssl speed sha.

The net result is that generating a valid vote is "expensive" for the sender, but is still trivial to verify for the receiver: the receiver hashes the transaction (one operation) and verifies that the checksum contains the required hash collision prefix... Great, so how is this useful for our P2P system? The above proof-of-work mechanism allows us to adjust the cost of submitting a vote such that the total cost of subverting the system (i.e. spoofing enough valid votes to guarantee a certain outcome) is higher than the value gained by attacking the system.

Note that the "high cost to generate a message" is a useful property in many other contexts. For example, email spam works precisely because it is incredibly cheap to generate a message. If we could raise the cost of sending an email message - say, by requiring a proof-of-work signature - then we could break the spam business model by raising costs to be higher than profits.

Building the minimum viable block chain #

We've covered a lot of ground. Before we discuss how the block chain can help us build a secure distributed ledger, let's quickly recap the setup, the properties, and the unsolved challenges within of our network:

  1. Alice and Bob complete a transaction and record it in their respective ledgers.
    1. Once done, Bob has a PKI-protected IOU from Alice.
  2. Bob completes a transaction with John where he transfers Alice's IOU to John. Both Bob and John update their ledgers, but Alice doesn't know about the transaction… yet.
    1. Happy scenario: John asks Alice to redeem his new IOU; Alice verifies his transaction by fetching Bob's public key; if the transaction is valid she pays John the required amount.
    2. Not so happy scenario: Bob uses his old ledger that omits his transaction with John to create a double-spend transaction with Katy. Next, both Katy and John show up at Alice's doorstep and realize that only one of them will get paid.

The double-spend is possible due to the "weak consistency" of the distributed ledger: neither Alice nor Katy know about John and Bob's transaction, which allows Bob to exploit this inconsistency to his advantage. Solution? If the network is small and all participants are known, we can require that each transaction must be "accepted" by the network before it is deemed valid:

  • Unanimous consensus: whenever a transaction takes place the two parties contact all other participants, tell them about the transaction, and then wait for their "OK" before they commit the transaction. As a result, all of the ledgers are updated simultaneously and double-spend is no longer possible.
  • Quorum consensus: to improve processing speed and availability of the network (i.e. if someone is offline, we can still process the transaction) we can relax the above condition of unanimous consensus to quorum consensus (50% of the network).

Either of the above strategies would solve our immediate problem for a small network of known and verified participants. However, neither strategy scales to a larger, dynamic network where neither the total number of participants is known at any point in time, nor their identity:

  1. We don't know how many people to contact to get their approval.
  2. We don't know whom to contact to get their approval.
  3. We don't know whom we are calling.
Note that we can use any means of communication to satisfy above worfklow: in person, internet, avian carriers, etc!

Lacking identity and global knowledge of all the participants in the network we have to relax our constraints. While we can't guarantee that any particular transaction is valid, that doesn't stop us from making a statement about the probability of a transaction being accepted as valid:

  • Zero-confirmation transaction: we can accept a transaction without contacting any other participants. This places full trust on the integrity of the payer of the transaction - i.e. they won't double-spend.

  • N-confirmation transaction: we can contact some subset of the (known) participants in the network and get them to verify our transaction. The more peers we contact, the higher the probability that we will catch malicious parties attempting to defraud us.

What is a good value for "N"? The answer depends on the amount being transferred and your trust and relationship with the opposite party. If the amount is small, you may be willing to accept a higher level of risk, or, you may adjust your risk tolerance based on what you know about the other party. Alternatively, you will have to do some extra work to contact other participants to validate your transaction. In either case, there is a tradeoff between the speed with which the transaction is processed (zero-confirmation is instant), the extra work, and the risk of that transaction being invalid.

So far, so good. Except, there is an additional complication that we must consider: our system relies on transaction confirmations from other peers, but nothing stops a malicious user from generating as many fake identities as needed (recall that an "identity" in our system is simply a public-private keypair, which is trivial to generate) to satisfy Katy's acceptance criteria.

Whether Bob decides to execute the attack is a matter of simple economics: if the gain is higher than the cost then he should consider running the attack. Conversely, if Katy can make the cost of running the attack higher than the value of the transaction, then she should be safe (unless Bob has a personal vendetta and/or is willing to lose money on the transaction... but that's out of scope). To make it concrete, let's assume the following:

  • Bob is transferring 10 chroms to Katy.
  • The cost of generating a fake identity and transaction response is 0.001 chroms: energy costs to keep the computer running, paying for internet connectivity, etc.

If Katy asks for 1001 confirmations, then it no longer makes (economic) sense for Bob to run the attack. Alternatively, we could add a proof-of-work requirement for each confirmation and raise the cost for each valid response from 0.001 chroms to 1: finding a valid hash will take CPU time, which translates to a higher energy bill. As a result, Katy would only need to ask for 11 confirmations to get the same guarantee.

Note that Katy also incurs some costs while requesting each confirmation: she has to expend effort to send out the requests and then validate the responses. Further, if the cost of generating a confirmation and verifying it is one-to-one, then Katy will incur the same total cost to verify the transaction as its value… which, of course, makes no economic sense.

This is why the asymmetry of proof-of-work is critical. Katy incurs low cost to send out requests and validate the responses, but the party generating the confirmation needs to expend significantly more effort to generate a valid response.

Great, problem solved, right? Sort of... in the process we've created another economic dilemma. Our network now incurs a cost to validate each transaction that is of equal or higher value than the transaction itself. While this acts as an economic deterrent against malicious participants, why would any legitimate participant be willing to incur any costs for someone else? A rational participant simply wouldn't, it doesn't make sense. Doh.

Adding "blocks" & transaction fee incentives #

If participants in the network must incur a cost to validate each other's transactions, then we must provide an economic incentive for them to do so. In fact, at a minimum we need to offset their costs, because otherwise an "idle" participant (anyone who is not submitting their own transactions) would continue accruing costs on behalf of the network — that wouldn't work. Also a couple of other problems that we need address:

  1. If the cost of verifying the transaction is equal to or higher than the value of the transaction itself (to deter malicious participants), than the total transaction value is net-zero, or negative! E.g. Bob transfers 10 chroms to Katy; Katy spends 10 chroms to compensate other peers to validate the transaction; Katy is sad.

  2. How does Katy pay for confirmations? If that's its own transaction then we have a recursive problem.

Let's start with the obvious: the transaction fee can't be as high as the value of the transaction itself. Of course, Katy doesn't have to spend the exact value to confirm the transaction (e.g. she can allocate half the value for confirmations), but then it becomes a question of margins: if the remaining margin (value of transaction - verification fees) is high enough, than there is still incentive for fraud. Instead, ideally we would like to incur the lowest possible transaction fees and still provide a strong deterrent against malicious participants. Solution?

We can incentivize participants in the network to confirm transactions by allowing them to pool and confirm multiple transactions at once - i.e. confirm a "block" of transactions. Doing so would also allow them to aggregate transaction fees, thereby lowering the validation costs for each individual transaction.

A block is simply a collection (one or more) of valid transactions - think of it as the equivalent of a page in a physical ledger. In turn, each block contains a reference to a previous block (previous page) of transactions, and the full ledger is a linked sequence of blocks. Hence, block chain. Consider the example above:

  1. Alice and Bob generate new transactions and announces them to the network.
  2. Chris is listening for new transaction notifications, each of which contains a transaction fee that the sender is willing to pay to get it validated and confirmed by the network:
    1. Chris aggregates unconfirmed transactions until he has a direct financial incentive (sum of transaction fees > his cost) to perform the necessary work to validate the pending transactions.
    2. Once over the threshold, Chris first validates each pending transaction by checking that none of the inputs are double-spent.
    3. Once all transactions are validated Chris adds an extra transaction to the pending list (indicated in green in the diagram above) that transfers the sum of advertised transaction fees to himself.
    4. Chris generates a block that contains the list of pending transactions, a reference to the previous block (such that we can traverse the blocks and see the full ledger), and performs the proof-of-work challenge to generate a block hash value that conforms to accepted rules of the network - e.g. partial hash collision of N leading zeros.
    5. Finally, once Chris finds a valid block, he distributes it to all other participants.
  3. Both Alice and Bob are listening for new block announcements and look for their transaction in the list:
    1. Alice and Bob verify integrity of the block - i.e. verify proof-of-work and contained transactions.
    2. If the block is valid and their transaction is in the list, then the transaction has been confirmed!
We made a big leap here. Previously we've only had one type of record in our network - the signed transaction. Now we have signed transactions and blocks. The former is generated by the individuals engaging in trade, and the latter is generated by parties interested in collecting fees by validating and confirming transactions.

Also, note that the above scheme requires some minimum volume of transactions in the system to sustain the incentives for individuals creating the blocks: the more transactions there are, the lower the fees have to be for any single transaction.

Phew, ok, Alice has announced a new transaction and received a valid block from Chris confirming it. That's one confirmation, what about the rest? Also, Chris is (hopefully) not the only participant who is incentivized to work on generating the blocks. What if someone else generates a different block at the same time, and which of those blocks is "valid"? This is where it gets interesting...

Racing to claim the transaction fees #

The remarkable part about introducing the ability to aggregate fees by verifying a block of transactions is that it creates a role for a new participant in the network who now has a direct financial incentive to secure it. You can now make a profit by validating transactions, and where there is profit to be made, competition follows, which only strengthens the network - a virtuous cycle and a clever piece of social engineering!

That said, the incentive to compete to validate transactions creates another interesting dilemma: how do we coordinate this block generation work in our distributed network? The short answer is, as you may have already guessed, we don't. Let's add some additional rules into our system and examine how they resolve this problem:

  1. Any number of participants is allowed to participate ("race") to create a valid block. There is no coordination. Instead, each interested participant listens for new transactions and decides whether and when they want to try to generate a valid block and claim the transaction fees.
  2. When a valid block is generated, it is immediately broadcast into the network.
    1. Other peers check the validity of the block (check each transaction and validity of the block itself), and if valid, add it to their ledgers and then finally rebroadcast it to other peers in the network.
    2. Once added, the new block becomes the "topmost block" of their ledger. As a result, if that same peer was also working on generating a block, then they need to abort their previous work and start over: they now need to update their reference to the latest block and also remove any transactions from their unconfirmed list that are contained in the latest block.
    3. Once the above steps are complete, they start working on a new block, with the hope that they'll be the first ones to discover the next valid block, which would allow them to claim the transaction fees.
  3. … repeat the above process until the heat death of the universe.

The lack of coordination between all the participants working on generating the blocks means there will be duplicate work in the network, and that's OK! While no single participant is guaranteed to claim any particular block, as long as the expected value (probability of claiming the block times the expected payout, minus the costs) of participating in the network is positive, then the system is self-sustaining.

Note that there is also no consensus amongst the peers on which transactions should be validated next. Each participant aggregates their own list and can use different strategies to optimize their expected payoff. Also, due to the nature of our proof-of-work function (finding a partial hash collision for a SHA-256 checksum of the block), the only way to increase the probability of claiming a block is to expend more CPU cycles.

There is one more caveat that we need to deal with: it's possible that two peers will find a valid block at about the same time and begin propagating through the network - e.g. Kent and Chris in the diagram above. As a result, some fraction of the network may end up accepting Kent's block as topmost block, while the rest will take Chris's block. Now what?

Resolving chain conflicts #

Once again, we're going to take a hands-off approach and let the random nature of the block generation process resolve the conflict, albeit with one additional rule: if multiple chains are detected, the participants should immediately switch to and build on top of the longest chain. Let's see how this works in practice:

  1. Some peers will start building new blocks on top of Kent's block, others on top of Chris's block.
  2. At some point, someone will find a new block and begin propagating it through the network.
    1. When other peers receive the new block, the part of the network that was working with a different topmost block will detect that there is now a longer alternative chain, which means that they need to switch to it - e.g. in the above example, the peers who were working with Chris's block stop their work, drop Chris's block, and switch to the longer (Amy + Kent's) chain.
    2. Any transactions that are part of the discarded block but that are not yet confirmed are placed in the pending list and the process starts over.
It's possible that the race condition can persist for multiple blocks, but eventually one branch will race ahead of the other and the rest of the network will converge on the same longest chain.

Great, we now have a strategy to resolve conflicts between different chains in the network. Specifically, the network promises linearizability of transactions by recording them in a linked list of blocks. But, crucially, it makes no promises about an individual block "guaranteeing" the state of any transaction. Consider the example above:

  • Alice sends out her transaction into the network.
  • Chris generates a valid block that confirms her transaction.

Except, there is a fork in the chain and Chris's block is later "removed" as the network converges on Kent's branch of the chain. As a result, even when Alice receives a block with her transaction, she can't be sure that this block won't be undone in the future!

Blocks are never "final" #

No block is "final", ever. Any block can be "undone" if a longer chain is detected. In practice, forks should be detected fairly quickly, but there is still always the possibility of an alternative chain. Instead, the only claim we can make is that the "deeper" any particular block is in the chain, the less likely it is that it will undone. Consequently, no transaction can ever be treated as "final" either, we can only make statements about the probability of it being undone.

  1. 0-confirmation transaction: exchange is performed without waiting for any block to include the transaction.
  2. 1-confirmation transaction: latest valid block includes the transaction.
  3. N-confirmation transaction: there is a valid block that includes the transactions, and there are N-1 blocks that have since been built on top of that block.

If you are willing to accept the risk, you always have the option to go with a 0-confirmation transaction: no transaction fees, no need to wait for confirmations. However, you also place a lot of trust in the opposite party.

Alternatively, if you want to lower your risk, then you should wait for one or more blocks to be built on top of the block that includes your transaction. The longer you wait, the more blocks will be built on top of the block that contains your transaction, the lower the probability of an alternative chain that may undo your transaction.

By "undo" we mean any scenario where one of the participants can make the network accept an alternative transaction transferring funds to any account other than yours - e.g. you complete the transaction, hand over the widget and get a receipt, but the attacker then injects a transaction that "double-spends" those same funds to an alternative account.

Why does the length of the block chain act as a good proxy for "safety" of a transaction? If an attacker wanted to undo a particular transaction, then they will need to build a chain that begins at a block prior to the one where that transaction is listed, and then build a chain of other blocks that is longer than the one currently used by the network. As a result, the deeper the block, the higher the amount of computational effort that would be required to replace it by creating an alternative chain. The longer the chain the more expensive it is to execute an attack.

How many blocks should you wait for before accepting a transaction? There is no one number, the answer depends on the properties of the network (time to generate each block, propagation latency of the transactions and blocks, size of the network, etc), and the transaction itself: it's value, what you know about the other party, your risk profile, and so on.

Properties of the (minimum viable) block chain#

  1. Individual transactions are secured by PKI.
    • Transactions are authenticated: a malicious party can't masquerade as someone else and sign a transaction on their behalf.
      • Authentication is only with respect to the public-private keypair. There is no requirement for "strong authentication" that links the keypair to any other data about the participants. In fact, a single participant can generate and use multiple keypairs! In this sense, the network allows anonymous transactions.
    • Non-repudiation: participants can't claim that the transaction did not happen after the fact.
    • Integrity: transactions can't be modified after the fact.
  2. Once created, transactions are broadcast into the P2P network.
    • Participants form a network where transactions and blocks are relayed amongst all the participating peers. There no central authority.
  3. One or more transactions are aggregated into a "block".

    • A block validates one or more transactions and claims the transaction fees.
      • This allow the transaction fees to remain small relative to the value of each transaction.
    • A valid block must have a valid proof-of-work solution.
      • Valid proof-of-work output is hard to generate and cheap to verify.
      • Proof-of-work is used to raise the cost of generating a valid block to impose a higher cost on running an attack against the network.
    • Any peer is allowed to work on generating a valid block, and once a valid block is generated, it is broadcast into the network.
      • Any number of peers can compete to generate a valid block. There is no coordination. When a fork is detected, it is resolved by automatically switching to the longest chain.
    • Each block contains a link to the previous valid block, allowing us to traverse the full history of all recorded transactions in the network.
  4. Peers listen for new block announcements and merge them into their ledgers.

    • Inclusion of the transaction in a block acts as a "confirmation" of that transaction, but that fact alone does not "finalize" any transaction. Instead, we rely on the length of the chain as a proxy for "safety" of the transaction. Each participant can choose their own level of risk tolerance, ranging from 0-confirmation transactions to waiting for any arbitrary number of blocks.

The combination of all of the above rules and infrastructure provides a decentralized, peer-to-peer block chain for achieving distributed consensus of ordering of signed transactions. That's a mouthful, I know, but it's also an ingenious solution to a very hard problem. The individual pieces of the block-chain (accounting, cryptography, networking, proof-of-work), are not new, but the emergent properties of the system when all of them are combined are pretty remarkable.

Script-injected "async scripts" considered harmful

Posted 14 days back at

Synchronous scripts are bad because they force the browser to block DOM construction, fetch the script, and execute it before the browser can continue processing the rest of the page. This, of course, should not be news, and is the reason why we have been evangelizing the use of asynchronous scripts. Here is the canonical example:

<!-- BAD: blocking external script -->
<script src="//"></script>

<!-- GOOD: remote script is loaded asynchronously -->
    var script = document.createElement('script');
    script.src = "//";

What's the difference? In the "bad" example we block DOM construction, wait to fetch the script, execute it, and then continue to process the rest of the document. In the second example, we begin executing the inline script, which creates a script element pointing to an external resource, add it to the document, and continue processing the DOM. The difference is subtle but very important: script-injected scripts do not block on the network.

Script-inject all the things? Not so fast.

The inline JavaScript solution has a subtle, but very important (and an often overlooked) performance gotcha: inline scripts block on CSSOM before they are executed. Why? The browser does not know what the inline block is planning to do in the script it is about to execute, and because JavaScript can access and manipulate the CSSOM, it blocks and waits until the CSS is downloaded, parsed, and the CSSOM is constructed and available. A hands-on network waterfall is worth thousands words, consider this example:

The above page loads a CSS file at the top of the page and two script-injected "async scripts" at the bottom. In other words, it follows all of the "performance best practices". Except, the scripts themselves can't be executed until the CSSOM is ready, which delays the execution of the inline blocks and consequently the dispatch of the network requests. As a result, the scripts are executed ~3.5s after the page request is initiated.

Note that I'm intentionally forcing a large two second response delay on CSS and one second delay on JavaScript to highlight the dependency between CSS/CSSOM and JavaScript execution.

Now, let's compare this to our "bad" example, which uses two blocking script tags:

Wait a second, what's going on? Both scripts are fetched earlier and are executed ~2.7 seconds after the page request is initiated. Note that the scripts are still executed only once the CSS is available (~2.7 second mark), but because the scripts are already fetched by the time the CSSOM is ready, we can execute them immediately, saving over a second in our processing times. Have we been doing it all wrong?

Before we answer that, let's consider one more example, this time with the "async" attribute:

<script src="" async></script>
<script src="" async></script>

If the async attribute is present, then the script will be executed asynchronously, as soon as it is available. If the async attribute is not present ... then the script is fetched and executed immediately, before the user agent continues parsing the page.

The async attribute on the script tag provides two critical properties: it tells the browser to not block DOM construction, and it does not block script execution on CSSOM. As a result, the scripts are executed as soon as they are downloaded (at ~1.6 seconds) and without having to wait for the CSSOM. A quick summary of our results:

script execution onload
script-injected ~3.7s ~3.7s
blocking script ~2.7s ~2.7s
async attribute ~1.7s ~2.7s

So, why have we been advocating the use of this pattern for so long?

  1. "async" is not supported by a few older browsers: IE 8/9, Android 2.2/2.3. As a result, these older browsers ignore the attribute and treat it as a blocking script. This was a big problem back in the day, but this also brings us to the next point...

  2. All modern browsers have a "preload scanner" (yes, even IE8/9 and Android 2.3/2.2) which is invoked when the document parser is blocked and whose sole responsibility is to "peek ahead" in the document and find resources that should be fetched as soon as possible to unblock the critical rendering path.

The script-injected pattern offers no benefits over <script async>. The reason it exists is because <script async> was not available and preload scanners did not exist back when it was first introduced. However, that era has now passed, and we need to update our best practices to use async attribute instead of script-injected scripts. In short, script-injected "async scripts" considered harmful.

Also, note that the preload scanner will only discover resources that are specified via src/href attributes on script and link tags. The preload scanner cannot and does not execute inline JavaScript blocks, which means that any script-injected assets cannot be discovered by the preload scanner. As a result, the new best practice:

<!-- BAD: the pre async / pre preload scanner era -->
    var script = document.createElement('script');
    script.src = "//";

<!-- GOOD: simpler, faster, and better all around -->
<script src="//" async></script>
<script src="..."> <script async src="...">
Blocks DOM construction Does not block DOM construction
Execution is blocked on CSSOM Execution is not blocked on CSSOM
Preload scanner discoverable Preload scanner discoverable
Ordered execution of scripts Unordered execution
Use where execution order matters, place these scripts at the bottom. Can be placed anywhere, ideal for scripts that can tolerate out-of-order execution.

What about the defer attribute?

defer was introduced prior to async and, in theory, was supposed to guarantee that the scripts would not block the parser, and would be executed before DOMContentLoaded event in the order they were inserted. Unfortunately, in practice, the execution order implementation is buggy and cannot be relied on. That said, defer is still a useful attribute as it unblocks the document parser in older IE browsers (5.5-9) that do not support the async attribute. As a result, we can combine async and defer to further improve performance.

<!-- Modern browsers will use 'async', older browsers will use 'defer' -->
<script src="//" async defer></script>
The defer attribute may be specified even if the async attribute is specified, to cause legacy Web browsers that only support defer (and not async) to fall back to the defer behavior instead of the synchronous blocking behavior that is the default.

But wait, what about…

To be clear, that's not to say that all inline JavaScript should be avoided. There is time and place where it may, in fact, be the right solution. A few things to consider and to keep in mind:

  1. Async attribute makes no guarantees about execution order: scripts are executed as they arrive, their order and location in the document has no effect. As a result, if you have dependencies, can you eliminate them? Alternatively, can you defer their execution, or make the order of execution a non-issue? The asynchronous function queuing pattern is a good one to investigate.

  2. The asynchronous function queuing pattern requires that I initialize some variables, which implies that I need an inline script block, are we back to square one? No. If you place your inline block above any CSS declarations - yes, you read that right, JavaScript above CSS in document <head> - then the inline block is executed immediately. The problem with inline blocks is that they must block on CSSOM, but if we place them before any CSS declarations, then they are executed immediately.

  3. Wait, should I just move all of my JavaScript above the CSS then? No. You want to keep your <head> lean to allow the browser to discover your CSS and begin parsing the actual page content as soon as possible - i.e. optimize the content you deliver in your first round trip to enable the fastest possible page render.

Jul 23, 2014: Updated with recommendation to use defer+async to improve performance in older browsers.
Sep 29, 2014: Google+ widgets have been updated to provide async+defer snippets.

Optimizing Webfont Selection and Synthesis

Posted 14 days back at

A "webfont" is a set of resources, not just a single download. A single resource that includes all stylistic variants, which we may not need, plus all the characters, which may go unused, would simply be too large — tens of megabytes for a font with good unicode and stylistic coverage!

As a result, the CSS @font-face rule allows us define a font family that is composed of many individual resources: regular font weight plus multiple bold and oblique variants, each of which can be responsible for a particular unicode range. In turn, each of these variants is backed by a separate font resource, which enables us to split and minimize the number of bytes required to get the critical text pixels on the screen.

@font-face {
  font-family: 'Awesome Font';
  font-style: normal;
  font-weight: 400; /* regular */
  src: url('/fonts/awesome-normal.woff2') format('woff2');
  unicode-range: U+0370-03FF; /* greek */

@font-face {
  font-family: 'Awesome Font';
  font-style: bold;
  font-weight: 700; /* bold */
  src: url('/fonts/awesome-bold.woff2') format('woff2');
  unicode-range: U+0370-03FF; /* greek */

body { font-family: 'Awesome Font' }
.title { font-weight: 600 }
.subtitle { font-style: italic }

The above example declares the "Awesome Font" family that is composed of two resources that cover the same set of Greek characters (U+0370-03FF) but offer two different "weights": normal (400), and bold (700). However, what happens if one of our CSS rules specifies a different font weight, or sets the font-style property to italic?

Font selection algorithm

When a weight is specified for which no face exists, a face with a nearby weight is used. In general, bold weights map to faces with heavier weights and light weights map to faces with lighter weights (see the font matching section below for a precise definition).

The @font-face rule is specifically designed to provide the browser with a flexible selection mechanism: if an exact stylistic font match is not available the browser will substitute the closest match, and if none is available, it will synthesize its own font variant. For the curious, the font matching algorithm gets into the gritty details. For example, the selection rules for font-weight are:

  • If the desired weight is less than 400, weights below the desired weight are checked in descending order followed by weights above the desired weight in ascending order until a match is found.
  • If the desired weight is greater than 500, weights above the desired weight are checked in ascending order followed by weights below the desired weight in descending order until a match is found.
  • If the desired weight is 400, 500 is checked first and then the rule for desired weights less than 400 is used.
  • If the desired weight is 500, 400 is checked first and then the rule for desired weights less than 400 is used.

As a result, our .title selector in above example would end up rendering text with the 700 variant in place of the requested 600 weight. However, what about the .subtitle, which needs an oblique font?

Font synthesis

If no suitable match is found, the browser will attempt to synthesize the necessary font variant. The details of how to generate the bold and oblique variants are not specified, hence the results will vary from browser to browser, and will also be highly dependent on the font. As a hands-on example, consider what happens with Open Sans:

Google fonts provides ten different variants, each at ~15KB for the latin character set, or ~150KB in total. Alternatively, the rightmost column provides a preview of the synthesized variants generated from the same (400 weight) variant. As you can tell, there is a noticeable difference, plus a number of additional limitations. The browser:

  • Can't make a bold font lighter.
  • Can't make an oblique font "straight".
  • Has limited ability to synthesize bold(er) fonts.
  • Has limited ability to synthesize oblique fonts and may produce wrong shapes - e.g. inspect the lowercase "f" in the synthesized vs. actual columns.

Typography geeks will provide a much longer list of why the synthesized version is inferior, but it's not just the looks:

Authors should also be aware that synthesized approaches may not be suitable for scripts like Cyrillic, where italic forms are very different in shape. It is always better to use an actual italic font rather than rely on a synthetic version.

The browser is simply applying some geometric transformations to the letter shapes and hoping for the best. Often, the results are pretty good, but we do need to be careful about how and where font synthesis is used. On that note, it is worth highlighting that the CSS specification does provide the font-synthesis property to control font synthesis behavior. However, while all browsers support font synthesis, to date only Firefox has implemented the CSS API:

.selector {
  font-synthesis: none;           /* disable synthesis */
  font-synthesis: weight;        /* bold synthesis only */
  font-synthesis: style;           /* italic synthesis only */
  font-synthesis: weight style; /* bold + italic synthesis (default) */

Font synthesis as performance optimization?

The obvious question to ask is whether we can rely on font synthesis in lieu of an extra resource download? After all, each font resource adds extra bytes and may delay rendering of page content.

The synthesized version of oblique Roboto in example above may not look as good, but with a little fiddling (e.g. letter-spacing: -0.3px) can nonetheless produce reasonable results while avoiding the extra font request.

As a practical example, this site is using Open Sans with two variants: 400 for regular text and 700 for bold. I've tried eliminating the 700 variant and using the synthesized version in the past, and while it didn't look as good (to me), I didn't hear any complaints from the visitors. In the end, I still reverted to requesting both (personal preference), but continued to use the synthesized oblique fonts, which is what you're reading this very moment - surprise!

Your mileage may vary, but it's always a good exercise to periodically sit down and audit what fonts your site is using and ask whether you actually need all of them. In the best case, you can consolidate your styles to use fewer variants, and where you can't, consider if you can rely on the browser to synthesize some font variants on your behalf!

Extensible Web Resource Loading Manifesto

Posted 14 days back at

A modern web-page is assembled from dozens of assets with different content-types. The requests for these assets are initiated in one of three ways: the document parser discovers a tag with a resource URL (e.g. img[src]), via a dynamic request initiated via Javascript (e.g. XHR, DOM-injected element), or via a CSS declaration.

There are no specified rules for how or why certain resource fetches should be prioritized, or deferred. That said, all browser vendors have implemented some form of heuristic and predictive fetching prioritization to accelerate page loading - e.g. HTML, CSS, and Javascript are considered critical; images and other asset types are less so; some browsers limit number of concurrent image downloads; CSS assets are lazyloaded by default, and so on. Introduction of preload parsers have further codified these content-type heuristics into de-facto fetching rules.

These content-type defaults are reasonable for the majority of cases, but they are insufficient for delivering an extensible and perf-friendly platform. In fact, the lack of app-level control over these fetching decisions is the root cause of the vast majority of resource loading performance problems that developers have to fight with today - delayed fetches, unwanted lazyloading, eager fetching of non-critical resources, and so on.

Extensible resource fetching

To deliver on the extensible web promise, each resource fetch must provide the following features:

  1. The browser may initialize fetch defaults for each resource, just as it does today.
  2. The developer must be able to:
    • Modify default fetch settings of all requests initiated via JS, CSS, and HTML. 1
    • Define the preloader policy for any resource declared in CSS and HTML. 2
    • Define the fetch dispatch policy for any resource declared in CSS and HTML. 3

Every element, function (CSS or JS), or other means of initiating a resource fetch must meet these criteria to provide a consistent, extensible, and a perf-friendly platform. If you are defining a new platform feature that initiates a resource fetch, you should be required to explain how the above is possible; each relevant working group needs to tackle the question of how to effectively retrofit above requirements to existing mechanisms.

  1. We need a consistent and feature-equivalent interface in HTML, CSS, and Javascript. E.g...
  2. It should be possible to extend the preload scanner with custom logic. E.g...
    • opt-in a new custom element, or opt-out any existing HTML element.
    • opt-in a CSS resource for preloader processing, and so on.
  3. Opt-out from CSS lazyloading and browser-specific heuristics that may delay fetch dispatch.

A few broken examples...

  • Image fetching
    • There is no way to modify default image fetch settings
    • Image CSS declarations have no opt-out strategy from lazyloading
    • Image fetch may be delayed in some browsers due to page structure heuristics
  • Font fetching
    • There is no way to modify default font fetch settings
    • Font CSS declarations have no opt-out from CSS lazyloading
    • Font CSS declarations have no opt-in for preloader processing
  • <X> type fetching (e.g. JSON payload)
    • Must use Javascript, missing declarative fetch mechanism
    • Javascript execution requirement precludes preloader processing
    • There is no way modify default XHR/Fetch settings

Some font resources are guaranteed to be used, and lazyloading is a performance bottleneck. Some images can be as critical as scripts and stylesheets. Javascript execution precludes early resource dispatch, which causes further prioritization and performance problems, and so on. This is not an exhaustive list, not even close. Rather, it's an illustration of the kinds of problems that many developers and applications are encountering on a daily basis.

At best, inflexible defaults and heuristics limit developer's ability to improve and adapt the default resource loading strategy to fit their needs. At worst, these defaults backfire and hurt performance of certain pages and applications. One size does not fit all.

Note that the upcoming ServiceWorker API does not address any of the above issues. ServiceWorker can influence how requests are dispatched, but only after it receives the request from the browser - that's already too late.

Laying out the work ahead

A detailed discussion on how to implement all of the above is outside the scope of this post and will require a lot of collective thought and discussion from site developers, spec wranglers, and browser engineers. That said, the high-level requirements and the key pieces of the required infrastructure are roughly as follows:

  • Fetch API attempts to unify and explain resource fetching in the web platform.
    • Fetch needs a mechanism to control transport-layer resource allocation.
    • Fetch settings initialization must be exposed in JS, CSS, and HTML.
    • Fetch API needs to be implemented and exposed to developers.
  • Web platform needs a declarative (preloader friendly) mechanism to match the Javascript Fetch API.
    • rel=preload attempts to address this problem and needs to implemented and exposed to developers.
  • Web platform needs to define an API for interfacing with the preload scanner.
    • Preload scanner should be extensible - e.g. can handle custom elements.
    • Preload scanner should be smart enough to process HTML and CSS.
    • Preload scanner logic needs to be spec'd and consistently implemented.
  • Web platform needs to provide control over resource dispatch policies.
    • CSS needs to provide lazyload opt-out mechanisms - e.g. as parameters to url(), or some such.
    • Browser vendors need to provide an opt-out mechanism from custom heuristic-driven dispatch policies.

With the above in place, we would finally have a solid foundation for an extensible resource loading model. From there, layer on Streams API, Service Worker, and other goodies - the future is bright.

How to Stop Voodoo Programming


Voodoo programming is when we write code that we don’t really understand. We know we shouldn’t do it. But how do we stop?

An example story

We are running tests and making changes to our code when we see a new error.

We decide we’ll upgrade all the Ruby gems (dependencies). Same error. We upgrade to the latest version of Ruby. We check Twitter and email while it installs. Same error.

We ask our teammates for help in a chat room. Most folks are unavailable but someone replies:

“Sorry, I haven’t seen that before.”

Finally, we copy the error and paste it into Google. The first result is a Stack Overflow question. The first answer to the question includes a code block but no explanation or links to reference material.

We exclaim:

That solution is so weird! Totally obscure behavior.

We don’t have time to dig deeper, though. So, we copy and paste the code into the appropriate place in our codebase. The error disappears and the tests pass.

Pleased, we check the code into version control, open a GitHub pull request, and go to lunch.

We get back from lunch expecting to see a “Looks good to me.” comment from a teammate congratulating us for solving a difficult problem but instead they commented:

Why does this solution work?

They smelled something funny and caught us. They’re worried about the code continuing to work and how those working on the codebase in the future will know how to maintain it.

Without fully understanding the problem, we can’t hope to solve it. We also wasted our teammates' time in the chat room and code review, and our clients' time by not using tracer bullets.

How do we stop voodoo programming?

Write a failing test

We have a great start: a failing (or erroring) test.

Read the backtrace

The first step is to read the backtrace of the failing test. We can open the file and go to the line number that threw the error. We may immediately see the problem.

Explore the source code of our dependencies

If not, we may be able to use our text editor’s “go to file” or “go to definition” features.

Working with Ruby, JavaScript, CSS, and similar tools, we have access to the source of languages, frameworks, and libraries upon which we’ve written our application code. There’s no good reason not to explore the source.

In Vim’s normal mode, we could use the gf command, the 2gf command (jumps to bundled gems in Ruby files), the :find command (which is assisted by tab completion), the :tag method_name command to jump to method definition, or grep really fast.

:help gf
:help :find
:help 'tags'

Use debugging statements

When we’re in the source code at a line from the backtrace, we often need more context about our program’s data at runtime.

In those cases, we put debugging statements in our code and in the code of those underlying languages, frameworks, and libraries. We most commonly use debugger for JavaScript, byebug for Ruby, and byebug and pry-rails for Rails.

This lets us stop the program and examine or change the data to better understand control flow through the code.

Read the manual

Running our tests with debugging statements at various levels of our code is likely going to provide us with the understanding we need.

As we’re getting closer to solving the problem, we may want to also Read The Manual of a language, framework, or library. The documentation may give us explanations and examples of their appropriate use by client code.

Write good commit messages

Without a good commit message, our teammates reviewing our pull requests may not understand the change. Future maintainers of the codebase may not understand the reason for the code, adding further time and cost to the work to be done.

Our reading of the manual may provide us with URLs to documentation which we can reference in the commit message.

Contribute patches to open source

On occasion, we will find a bug not in our application code, but in one of the languages, frameworks, or libraries upon which we depend.

At a minimum, we can report an issue to the project with our tests and backtraces.

Even better, we may have been able to fix the bug while we were placing debugging statements in the source code. In that case, we may be able to contribute a full patch to the project, including tests and a good commit message with links to further documentation.


We can stop voodoo programming by writing failing tests, reading the backtrace, exploring the source code of our dependencies, using debugging statements, reading the manual, writing good commit messages, and contributing to open source.

What’s next?

If you enjoyed this post, read our Back to Basics series.

Writing Vim Syntax Plugins


One of Vim’s most powerful features is the vast built-in and available language support. With the rise of new languages and the desire to write them in Vim with proper syntax highlighting, it’s useful to understand how to create our own syntax plugins for Vim in order to write, extend, or contribute to open source syntax plugins.

Syntax Plugin Structure

Syntax plugins, like all Vim plugins, follow a basic folder structure. This file structure became mainstream with the release of pathogen and works with all modern Vim plugin managers such as Vundle and vim-plug. I’m going to be using swift.vim as an example of how syntax plugins fit together. Here is the relevant file structure (using tree):

├── ftdetect
│   └── swift.vim
├── ftplugin
│   └── swift.vim
├── syntax
│   └── swift.vim
└── indent
    └── swift.vim


The first configuration file we encounter is in the ftdetect folder (:help ftdetect). In this folder we have one or more files named after the filetype they apply to. In this example we have a swift.vim file. In ftdetect files we decide whether or not we want to set (or override) the filetype of a file in different situations. For swift.vim this file looks like this:

" ftdetect/swift.vim
autocmd BufNewFile,BufRead *.swift setfiletype swift

Here we’re using setfiletype on any files named *.swift whenever we create a new file or finish reading an existing file. setfiletype is a friendlier way to do set filetype=swift. If the filetype has already been set setfiletype will not override it. Depending on what our syntax plugin does that might not always be what we want.


Next we have the ftplugin folder (:help ftplugin). This lets us execute arbitrary Vimscript or set other settings whenever the file is sourced. Files in Vim’s runtimepath (:help rtp) nested in ftplugin folders are sourced automatically when the filename (before .vim) matches the current filetype setting (we can read this setting with :set ft). So after our ftdetect file sets the file type as swift, Vim will source the swift.vim file nested in the ftplugin folder. In swift.vim that file looks like this:

" ftplugin/swift.vim
setlocal commentstring=//\ %s
" @-@ adds the literal @ to iskeyword for @IBAction and similar
setlocal iskeyword+=?,!,@-@,#
setlocal tabstop=2
setlocal softtabstop=2
setlocal shiftwidth=2
setlocal completefunc=syntaxcomplete#Complete

All of these settings are things that apply to how we want Vim to act when editing Swift files. The most interesting here is when we set iskeyword. Here we’re adding a few symbols so that when we execute commands such as diw those symbols will be treated as just another character in the word instead of treating them as delimiters breaking words.


Now we get to the meat of syntax plugins. Files nested under the syntax folder named after the current filetype setting are sourced in order to actually highlight the file. Syntax files contain rules in order to match patterns in our file. Here are a few examples:

" syntax/swift.vim
" Match TODO comments
syntax keyword swiftTodos TODO XXX FIXME NOTE

" Match language specific keywords
syntax keyword swiftKeywords
      \ if
      \ let
      \ nil
      \ var

Both of these examples define new syntax matches for keywords. Keyword matches should be used for the language’s reserved words that should always be highlighted as such. Matches can also get more complicated:

" syntax/swift.vim
" Match all Swift number types
syntax match swiftNumber "\v<\d+>"
syntax match swiftNumber "\v<\d+\.\d+>"
syntax match swiftNumber "\v<\d*\.?\d+([Ee]-?)?\d+>"
syntax match swiftNumber "\v<0x\x+([Pp]-?)?\x+>"
syntax match swiftNumber "\v<0b[01]+>"
syntax match swiftNumber "\v<0o\o+>"

Here we’ve defined syntax matches with the match argument allowing us to match regular expressions (see :help magic for what’s going on here). These specific regular expressions match all of Swift’s number types. Some examples of matched numbers with these:


Syntax matching can get even more complicated when we start matching function declarations and other statements. For example:

" syntax/swift.vim
" Match strings
syntax region swiftString start=/"/ skip=/\\"/ end=/"/ oneline contains=swiftInterpolatedWrapper
syntax region swiftInterpolatedWrapper start="\v\\\(\s*" end="\v\s*\)" contained containedin=swiftString contains=swiftInterpolatedString
syntax match swiftInterpolatedString "\v\w+(\(\))?" contained containedin=swiftInterpolatedWrapper

Here we’re matching Swift strings that are wrapped in double quotes. In the first line we start out by matching the swiftString region which defines start, end, and optional skip patterns. Here the skip pattern allows us to have escaped double quotes in our strings. We also pass some other arguments here. oneline specifies that this pattern can only span a single line. contains specifies what other syntax groups we’ve defined can exist within our region. In this case a string can contain interpolated arguments similar to Ruby. In Swift interpolated strings look like this:

"2 + 2 = \(2 + 2)"

The second line shows how we define the region enclosed within the interpolated string. Our start region specifies the starting \( pattern along with any number of optional spaces while our end region matches spaces and the closing ). We specify that this is contained within our strings so it is not matched elsewhere and that it contains our third syntax group. In our final group we specify what can be contained within the interpolated portion of our string.

When writing syntax files for entire languages, we’ll have to create matches using each of these types of groups to match different things. The hardest part of this is dealing with collisions. We have to be very aware that what we’re changing won’t break another portion of our highlighting.

Once we have our matches we need to tell Vim how to highlight them in a way it understands. Here are the highlighting associations for swift.vim:

" syntax/swift.vim
" Set highlights
highlight default link swiftTodos Todo
highlight default link swiftShebang Comment
highlight default link swiftComment Comment
highlight default link swiftMarker Comment

highlight default link swiftString String
highlight default link swiftInterpolatedWrapper Delimiter
highlight default link swiftNumber Number
highlight default link swiftBoolean Boolean

highlight default link swiftOperator Operator
highlight default link swiftKeywords Keyword
highlight default link swiftAttributes PreProc
highlight default link swiftStructure Structure
highlight default link swiftType Type
highlight default link swiftImports Include
highlight default link swiftPreprocessor PreProc

Here we are linking one of our match group names to a Vim-specific keyword based on what our group matches. We can see all available matches under :help group-name where they are highlighted based off our current color scheme. While it may be tempting to link a group to a type based on its appearance in our color scheme, it’s more important to maintain the match’s semantic meaning so they will work as expected in other user’s color schemes.


Finally we have the indent folder. Again, it contains a file named after the filetype for how indentation should work as we’re writing in Vim. Personally I think handling indentation is the hardest part of writing syntax plugins. The goal in these files is to do one specific thing. Set our indentexpr (:help inde) to a function that calculates and returns the level of indent for the current line. While that may sound simple, taking the context of that line into account can be extremely difficult. Also, just like with syntax files, we have to worry about collisions. Let’s look at some of this file for swift.vim.

" indent/swift.vim
setlocal indentexpr=SwiftIndent()

function! SwiftIndent()
  let line = getline(v:lnum)
  let previousNum = prevnonblank(v:lnum - 1)
  let previous = getline(previousNum)

  if previous =~ "{" && previous !~ "}" && line !~ "}" && line !~ ":$"
    return indent(previousNum) + &tabstop


Here we’re setting the indentexpr to our function. Then inside our function we’re getting some context around the line we’re trying to indent. The first thing we notice is the v:lnum variable (:help lnum-variable). This is a predefined variable (:help vim-variable) that is set automatically inside indent functions. It gives us the line number for the current line to be indented. We can then use it along with some other Vimscript to get the text on that line along with the text on the last non-blank line above it. Using this information, along with some pattern matching, we can build out cases where our lines should be indented more, less, or the same as the surrounding lines.

Wrapping up

Combining these four Vim plugin file types you can contribute to syntax plugins you have found bugs in, create syntax plugins for additions to languages or entire languages. Doing so you can help Vim continue to stay awesome as new programming languages come and go.

What’s next?

Episode #500 - October 3rd, 2014

Posted 17 days back at Ruby5, rails-disco, onboarding your junior devs, being a better Rubyist, staging environments, and the anti-pattern of absolutes all in this episode of the Ruby5.

Listen to this episode on Ruby5

Sponsored by New Relic

New Relic is _the_ all-in-one web performance analytics product. It lets you manage and monitor web application performance, from the browser down to the line of code. With Real User Monitoring, New Relic users can see browser response times by geographical location of the user, or by browser type.
This episode is sponsored by New Relic

Have a Chromecast? Want to easily throw custom content up on it and control it programatically? Then check out this new project from Groupon!

Onboarding Junior Developers

Brittany Martin from NineFold wrote up her thoughts about onboarding new developers and Kate Heddleston's talk on the subject from GoGaRuCo.
Onboarding Junior Developers

10 Skills to Make You a Better Rubyist

The folks at AmberBit published a blog post about some of the key skills that will make you a better Rubyist!
10 Skills to Make You a Better Rubyist

Adding a Staging Environment to Rails

You'd be surprised how many shops don't keep proper staging environments. If you are concerned about yours or don't even have one, check out this blog post from Maxime Garcia.
Adding a Staging Environment to Rails

Absolutes as an AntiPattern

The only absolute is that there are no absolutes. This blog post from Brandon Hilkert talks about using absolutes when we talk about software and why you shouldn't.
Absolutes as an AntiPattern

Baseimage-docker 0.9.15 released

Posted 17 days back at Phusion Corporate Blog

Baseimage-docker is a special Docker image that is configured for correct use within Docker containers. It is Ubuntu, plus modifications for Docker-friendliness. You can use it as a base for your own Docker images. Learn more at the Github repository and the website, which explain in detail what the problems are with the stock Ubuntu base image, and why you should use baseimage-docker.

Changes in this release

  • Fixed the setuid bit on /usr/bin/sudo. This problem was caused by Docker bug #6828.

Using baseimage-docker

Please learn more at the README.

The post Baseimage-docker 0.9.15 released appeared first on Phusion Corporate Blog.

Upcoming Events for October, 2014


Each month we organize and attend events at our offices, and around the world. Here is a summary of what’s happening this month. We often have thoughtbot stickers, and sometimes shirts. We’d love to see you at some of the events, don’t be shy.


Austin Startup Week. October 6th - 10th.
We’ll be participating in events throughout the week.

Keep Ruby Weird. Friday, October 24th.
We’re co-organizing this new Austin Ruby conference. We’d love to see you there.

Railsbridge: Women’s Ruby and Ruby On Rails Workshop. October 25th & 26th.
Desi McAdam will be leading this.

Austin on Rails. Tuesday, October 28th.
We’ll see you there!

We’re also at Cafe Bedouins most Tuesdays.


Boston Ruby Group Project Night. Tuesday, October 7th.
Hosted at our office.

Boston Ruby Group Presentation Night. Tuesday, October 14th.
Folks from thoughtbot will be attending.

NSCoder Night. Thursday, October 16th.
Hosted at our office.

Boston Ruby Women Monthly Meetup. Thursday, October 23rd.
Folks from thoughtbot will be there, and we also organize and sponsor this event.


Creative Mornings. Friday, October 3rd.
Several of us will be attending.

Denver.rb. Tuesday, October 14th.
Our own Anthony Navarre will be speaking about HATEOAS - a technique of REST that helps push logic away from your clients.

New York City

SassConf. October 2nd & 3rd.
Designers from thoughtbot NYC and Philly will be there with brand-new Bourbon shirts and stickers. Say “hello” and grab one.

Brooklyn Beta. October 8th & 9th.
Designers from thoughtbot in NYC, Boston, and Denver will be attending.


Refresh Philly. Thursday, October 9th.
We organize and sponsor this event, so we’ll see you there!

Philly.rb. Tuesday, October 14th.
Our own Sean Doyle will be presenting.

CocoaLove. October 24th - 26th.
Gordon will be attending this new iOS/Mac conference.

Build Guild Philly. Wednesday, October 29th.
We host and organize this monthly event at a local watering hole.


Triangle Ruby Brigade. Tuesday, October 14th.
We’ll be attending this event. We also recently launched a redesign of the Triangle.rb website.

Stacked. Tuesday, October 21st.
Our own Christian Reuter will be speaking about Bourbon.

Triangle Ember. Tuesday, October 21st.
This is the first meetup of the Triangle Ember group, which we organize and host.

We’re running a free UI/UX course, with American Underground. Tuesday, October 28th.

San Francisco

TradeCraft. Friday, October 3rd.
Carolann Bonner will be giving a talk to UX students on how to enhance a user experience through incorporating Gestalt principles.

Taco Tuesdays. Tuesday, October 21st.
A product design talk series, hosted at our office.

VimSF. Thursday, October 23rd.
A Vim user meetup, hosted at our office.


Stockholm Ruby: Build a Web app in 2 hours: free Rails intro course. Wednesday, October 1st.
We help organize and run these quarterly installfests and intro workshops.

Ruby Project Night Monday, October 6th.
We co-organize and sponsor this monthly event. We hope to see you there.

CocoaHeads Stockholm. Monday, October 6th.
See you there!

Stockholm Vim: Scientific Python in Vim and A Comparison of Contemporary Editors.. Monday, October 13th.
We co-organize and host the exciting event. We hope to see you there.

NSCoder Night. Tuesday, October 14th.
We organize and host this event.

Front End Hack Night. Monday, October 20th.
We organize and host this event in our office.

Mob programming Monday, October 27th.
We help organize this, and at least one of us will be here.


We’ll also be attending

Did we miss something? Is there something great we should attend? Let us know.

Design Sprint Case Study: Merck Development Portal


We recently completed a four-day product design sprint with Merck. They are actively building their development team and looking for a central place to share Merck specific development knowledge and tools. There is a potential for aspects like APIs, code snippets, and general procedural knowledge to be shared, thereby saving the company developer time and money. Their hope is to encourage developers to more readily share their knowledge and code across teams and contents.

Day One – Understand, Diverge

Since the team had only a general idea of what to build, we ran a number of activities to better understand what problems this project should solve. We played Who, What, Where, When, 5 Whys and completed a card sorting exercise. We ran these activities to refine the goal for the project and what would make it successful.

Card Sorting

These games nurtured a healthy debate on whether the application’s primary job to be done was to share code and knowledge among developers or if the priority was to show how much money is being saved with code reuse for managers. The assumption was that if managers have buy in, they would force developers to use the application. If we got developers interested, the application would encourage education and reuse of code. We would be able to calculate financial savings later.

We decided that the primary job was to spread knowledge and code, and that quantifying time and money savings would eventually grow from this primary goal. We recorded the conflict so we could return to it later, if necessary.

After starting mind maps, crazy eights and storyboards, the debate resurfaced regarding the primary problem we were trying to solve. Because it was later in the day, I recorded a new challenge statement to address the contrarian thought and suggested that everyone break early to contemplate the issue overnight, individually. My intention for taking a step back from the project was that it would allow the team to return refreshed the following day.


Day Two – Diverge, Converge

The following morning, we dove back in. We briefly discussed our two problem statements, and how this major conflict and assumptions would be a highlight of our user interviews and problem / solution tests. Together we converged on a third and final challenge statement, which was both shorter and simpler than the first two.

Make it easy to share development work and knowledge at Merck to spread knowledge, save developer time and resources and quantify that savings.

We reflected on the critical path that we were following with the first statement and analyzed the similarities and potential differences that this new challenge statement introduced. We examined the storyboards that we completed the prior day and found that much of our previous work would carry over nicely to our new statement.


Despite our progress, there was one thing that I felt was holding us back. It wasn’t clear to me what Merck wanted developers to share, and what form this would take and whether developers would want to share it. We completed another card sorting exercise to organize the team’s thoughts about the types of posts and content they sought. We organized them based on what we felt was most important.

Now with a clear goal and a better understanding, we completed several more rounds of mind maps, crazy 8s and storyboards. With each round we slowly reached a consensus on what the user flow would be for each of the steps in our critical path. At the end of each round, we reflected back to our challenge statement to make sure that we were still on track.

Because we had completed so many group exercises, developing the layout of each page went relatively smoothly and quickly. We wireframed each screen on a whiteboard based on the sketching we had completed in previous storyboards. We then listed the assumptions we had been making throughout the two days and drew conclusions regarding which aspects we should try to validate with some user research before the user interviews and problem / solution tests.


Day 3 – Prototype

I spent the entire day creating a prototype using Sketch and Invision. I was easily able to recreate our whiteboard wireframes in Sketch and place the exported files into a folder to be synced. The primary issue that I had with Invision was the lack of support for any kind of form input. It was also time-consuming to design states for buttons and form inputs for each wireframe.


Day 4 – Test

User interview and testing proved very useful. We both validated and invalidated a majority of our assumptions about developers at Merck and what our application would accomplish. We learned that developers were already sharing a lot of code and knowledge within Merck through their personal network. The longer that the developer had been with Merck, the more they had shared, and the more knowledgeable they were about who they considered experts in particular fields.

Outside of the company, however, most developers were not participating in online discussions, blogs or forums. This was one of our key indicators on whether they might be willing to contribute within the company. The user interviews and problem / solution tests did identify some problems and invalidated some assumptions that we had. Users weren’t clear on the types of posts that they could make, and the barrier for entry was a lot higher than we previously thought.

Design Sprint Wrap-up

In the end, we didn’t feel strongly enough that the solution we created in the prototype would be worth the time to build as it was. By testing out the quick prototype we saved Merck several weeks of development in the wrong direction and in the end potentially saved the product. After some discussion, we agreed to do another shortened sprint to tackle the problems we found with the last prototype. This way we could continue to refine some of the ideas that came out of our diverge sessions on day one and two with the new found information from testing.