An Optimisation Challenge - Going fast with machine learning, image recognition, and web performance

I’m going to take you on a journey of optimization to show you how to optimize a web crawler, image recognition, and machine learning model for speed. For this we need a hypothetical problem.

The challenge Link to heading

There’s a website where tickets to see your favourite artist are released daily at 3 pm. But every time you try, you are too slow. You need to be the first; how fast could you optimize your process if you fully automated it?


You refresh a web page at precisely 3 pm, click through a page to select the date, another page to choose your seat at the concert, and finally fill in a captcha to submit your booking. Each page must be visited, you cannot skip to the end of this process.

Analysis Link to heading

We need to understand where to speed up to optimize our chances of getting a ticket.

  • Network Latency - Let’s say the server is hosted in the US, and we’re in London. Time to make a round trip to the server: 100ms x 4 requests

  • Page load - Time to fully load a single page in Chrome (dom ready): 80ms x 3 pages (the last page loading isn’t time sensitive as we’ve already submitted the captcha)

  • Protocols - 3-way TCP handshake, SSL, and DNS lookup: 350ms.

  • Entering captcha - Manually typing 8 characters: 2,500ms

  • Submitting each page - Clicking buttons on a page: 1000ms per page.

  • Total time ~6,500ms

The initial optimization - the low-hanging fruit. Link to heading

The best way to improve our chances is to focus on the largest area of time, which would take the least effort to reduce. For us, that is the human interaction part of clicking the buttons on a web page.

Submitting the form - From 3000ms to 60ms Link to heading

Optimizing clicking through the web pages is an easy place to start. We can automate moving through the pages with minimal scripting using an extension like Greasemonkey. It can reduce each page interaction from 1000ms to 20ms. There is an additional - but hard-to-measure - benefit of scripting, which is refreshing the page at exactly the right time the appointment is released instead of manually trying to hit the refresh button at exactly 3 pm (well, 3 pm minus half the round trip latency of 50ms)

Protocols DNS/SSL/TCP 3-way Handshake - 350ms to 0ms Link to heading

Although protocols are a tiny proportion of the overall time at around 350ms, reducing their impact is straightforward. In the browser, we can cache the DNS. We can also ensure the connection is open long enough to prevent repeating the 3-way handshake and ssl process. It stays open for about 60 seconds, so making a request 30 seconds before helps ensure the connection remains open, eliminating 350ms of potential latency.

Initial optimization run. Link to heading

We’ve saved 3,290ms - or three seconds - suitable for general optimization, but we can go much faster.


Second attempt - Decoding the image Link to heading


The ticket booking process requires filling in a captcha to secure your ticket. This should be the next focus for optimization as it takes the next most significant amount of time.


Above is the captcha - an 8-character randomly generated string of alphanumeric characters placed on a static background. There is only one way to decode a captcha rapidly and consistently: Teach a machine to do it for you.

captcha process Here’s the process we’ll follow to do that: Take the captcha, split apart the letters from the background, then convert the image into individual letters to send to a neural net to make a prediction.

Training data Link to heading

To train a machine learning model on this captcha, you’ll need around 50,000 labelled images. Whilst you can get many captchas from the ticket website, the process won’t scale because every image needs to be labelled, so renaming the file from download.png to the characters it contains, e.g. 8ah3muxe.png. To manually label it would take nearly three days of continuous typing. But we can generate the data more rapidly than that and less monotonously.

To generate a captcha, you need a pristine copy of the background to combine the background image with the letters on top. A pristine background is never available, but you can create one with around ten real samples, some image averaging, and tweaking in Photoshop.


To be able to generate the image consistently, you need to be able to reverse engineer the logic behind the letters.

The Captcha:

  • The image is 200 pixels wide, with the first character starting 20px from the left.
  • There are eight letters, with a basic alphanumeric character set (0-9, a-z), spaced approximately 20 pixels apart. The font used is Verdana at 54pt, with an averaged grey colour (#AFAFAF in hex).
  • Individual letters are as wide as 28px, meaning they can overlap. They can also be rotated from 0 to 45 degrees clockwise from the baseline.

Using the logic above combined with PIL, we can quickly generate a convincing match:

Original Generated
 8ah3muxe  8dqsfhmn

Now we can generate all the training data required.

Computer Vision Link to heading


To decode a captcha, the computer must understand the individual letters rather than all eight letters simultaneously, which means splitting the image into individual letters.


To do that, you can load up the image in OpenCV, subtract the background using the pristine sample we created earlier and remove all the colour.


Subtracting the background is a good start, but it leaves us with artifacts everywhere.


Using OpenCV’s morph, threshold, and contour functionality, we can clean it up further. (Note: You might notice the “a” and “h” have a line through them; the background subtraction process causes this. But it is fine as the model can adapt to it given enough data.)


The final step is cropping each letter to end up with a single binary image that can be used to train the computer. Each of those 8 individual letters will be sent to the predictive machine learning model.

For the number “8” above, parts of the following letter “a” are included in the output, which is fine as given enough data, the model learns it is irrelevant.

Now we have 100,000s of individual letters - it is time to train the model.

Training process Link to heading

For training, a Convolutional Neural Network (CNN) is perfect for letter recognition (click here for a visualization of how CNNs work as applied to character recognition: Splitting the data into two parts, 75% for training and 25% for validation, helps us quickly and accurately validate our training process using the loss function.

You will also need some real validation data, as the validation images we generated in training do not perfectly match the actual ones. We’ll need 100 manually labelled actual samples from the website and use that as the source of truth for final validation.

The first time you train it, you will likely get low accuracies verifying on the 100 labelled real captchas, but if it’s above 10% you’re on the right path. To increase, you should focus on ironing out flaws in the OpenCV pipeline so the generated validation images are closer to the authentic captchas.

Once done, you should focus on improving the neural network itself - It has to have 36 neuron outputs, each representing the 36 possible characters you will come across. For character decoding, a combination of 9 layers, three of which are convolution layers, results in around 80% accuracy. Switching the gradient descent algorithm to Adam and it hits 90%, which is more than enough for this challenge.

Decoder web service Link to heading

We must then hook that up to a simple websocket python web service, as it allows the browser to download the image, and send it across to the decoder web service. The decoder’s job is to tie together the OpenCV pipeline and neural network to get a prediction to send back to the browser


Now the 2500ms will drop down to around 250ms, a 10x increase, but there is much more we can do.


Final optimization Link to heading

Let us look what’s left to optimize:

Captcha time

Web Crawler - 300ms to 2ms Link to heading

Chrome (page load & submitting the page) was a small proportion of the time taken at the start, but now it is a much more significant part. We’ve reached the local performance maxima of the browser. As the webpage is just a bunch of forms with no JavaScript, we don’t need the browser at all - we just need a web crawler.

Go is an excellent language to write a web crawler in for a few reasons: It has a robust HTTP implementation, it compiles directly to machine code without an interpreter making it incredibly fast, and the simple memory management is ideal for these types of use cases where you can trade a bit of additional memory for increased speed without having to think much about optimization, i.e. it’s probably faster than a comparative implementation in Rust.

A web crawler implementation is over 100x faster than a browser from 80ms to 500ns per page. This is because a browser must transform the text received into a dom representation, apply browser styling, render that on a web page, and then initialize scripts - none of which a crawler has to do.

With a crawler, we also have complete control what is received and sent. So if the vital data is within the first packet downloaded from a website, you don’t have to process the rest. You can also send back only what is required, ideally keeping it well under the standard MTU size, fitting it nicely into a single TCP packet.

Image decoding - 250ms to 12.5ms Link to heading

Turning our attention back to the image decoding - The image decomposition in OpenCV only takes around 10ms for processing, the rest is the inference process.

We can increase model performance up to 100x with often the same level of accuracy by using quantization. Quantization is the process of converting the model to use a lower bit representation for weights and biases. e.g. from 32bit to 8bit. Doing this improves the inference process so it only takes 2.5ms to process eight characters, down from 250ms.

Network latency - 400ms to 4ms Link to heading

Finally, the last hurdle is network latency. Connection to London from the US, where the server is located, is around 6000km (4000 miles) which over four requests adds 400ms.

If we move everything to the same data centre as the server, which for our example, is US EAST-2 (Ohio), we can drop network latency to around 1ms or less.

Greasemonkey to optimised

Final Link to heading

Putting it altogether this is the overall architecture and flow:


And this is how fast it is:


16ms of compute, 4ms of network latency - Quick enough to get our tickets