Bypassing server cache when digests don't match

Published: August 7, 2023.

I'm building an application that keeps track of a file that is hosted somewhere on a public web server (which is out of my control).

The web server serves the file that I'm interested in, which we'll call cool-file.txt, along with an MD5 digest of the file called cool-file.txt.md5. This MD5 digest is like a small 'fingerprint' of the actual file. cool-file.txt could be many megabytes in size, but the fingerprint is only a few bytes.

MD5 digests like this are usually used for verifying the integrity of downloaded files: After downloading the file, you can compute the digest yourself, and check if it matches the digest from the server.

Another reason for using the digest is to save on resources. If you want to know if cool-file.txt has changed, you can simply retrieve the (small) digest, and see if that changed. This way you don't have to redownload the (potentially huge) file every time you want to check if it changed.

The problem

However, I noticed that when retrieving the file and computing the MD5 digest myself, it did not match the digest from the server!

This is problematic because of two reasons:

My application's integrity validation fails because my digest does not match the server's digest.
When cool-file.txt changes, my application wouldn't notice it because it will keep receiving the old digest.

The cause

It turned out that the file was recently updated, but I did not yet receive the new MD5 digest because the old digest was cached at the web server's end. That is to say: the web server receives a request for the digest, and as an optimization, it doesn't read the updated digest from its file system (🛎️), but it returns the old digest that it still memorized!

Fix attempt #1 ❌

Unfortunately, trying to bypass the cache with a "Cache-Control: no-cache" header did not work, because the server did not respect that header.

Ideally, when a server receives a request with that header included, it should disable the cache and give the latest result. But a server can choose to ignore it.

Fix attempt #2 ✅

I tried adding a query component to the URL. This component is ignored by most (if not all) web servers when serving static files. So, instead of

https://example.org/cool-file.txt.md5

I used

https://example.org/cool-file.txt.md5?hello

and it worked! Apparently, the cache in the web server works with some sort of lookup table that allows it to look up the memorized responses for certain URLs (including the query component).

Of course, for this fix to work consistently, the part after the ? should be different every time, because otherwise the web server can just return another outdated digest that it memorized for the https://example.org/cool-file.txt.md5?hello request.

For this reason, my application now just appends ? and a UUID to the URL of the digest (and the file itself as well!).

Discussion

This doesn't feel like a great fix. With every request I do now, this will probably create an entry in the cache in the web server which consumes memory on their end.

But hey, the cache at the web server is out of my control, so I had to make do with it somehow!

What do you think? Should I contact the administrator of the web server to better invalidate their caches?