Updated: 4th January 2012

HTTP Caching Developers Guide.

Why Caches Exist

Web servers store and serve resources. Often these resources will rarely change. Types of resources which rarely change include images, CSS files and JavaScript files. Very often web clients (typically browsers) will request the same resource from a web server again and again and again. An example of this might be images in a navigation or a logo in a header. Every page on a website would typically have these same images. So every time you click a link and view a different page on the website your browser is requesting these exact same images again and again. This is wasteful in a number of ways. Caches help reduce this waste.

Caches store local copies of resources such as images, JavaScript and HTML files. This way when the same resource is requested again and again there is the option available to use the local copy instead of getting it from the web server every time. This lowers page response times greatly, reduces bandwidth consumption and reduces load on the destination web server.

Types of Caches

There are basically two types of caches, browser caches and proxy caches. A browser cache is the simplest kind, it's built into most web browsers and will keep copies of resources in your web browser. It is considered to be a private cache because it is only accessible by one client, your browser. Unless you are working on a shared computer such as a internet cafe you will be the only user with access to the cache.

Proxy servers are HTTP applications which sit somewhere in between the client (typically a browser) and the destination web server. When browsing the web the HTTP requests sent by your browser will frequently be directed through proxy servers on the way to the destination server. The HTTP responses on the way back from the destination server will go back through the same proxy servers. The mechanisms of this is beyond the scope of this article, suffice to say that most requests and responses go through one or more proxy serves on their journey between your browser and the web server. There are many different types of proxy servers but for this article we only care about the caching kind, proxy caches. Like browser caches, proxy caches store copies of resources. However unlike browser caches, proxy caches are said to be public. That is the resources they store are accessible by many different clients. Proxy caches are typically deployed by ISPs and Web Hosting companies as well as Content Delivery Networks (CDNs). They are very efficient because they build up a large pool of cached resources over time and many users can benefit from them.

How HTTP Caching works

Freshness

Resources change over time. A html file for example might be edited on a monthly basis. Caches need to know how long they can keep on serving their copies of resources (known as resource instances) before they are considered stale. The two HTTP headers which web servers use to specify expiration dates are the Expires and Cache-Control headers.

Both of these headers specify the date and time the resource expires. Expires works by specifying an absolute date/time value for example Expires: Sat, 17 Dec 2011, 19:10:22 GMT, in this case the resource instance expires on 17/12/11. The Cache-Control header specifies a relative time in seconds for example Cache-Control: max-age=600, in this case the resource instance is fresh for the next 10 minutes.

Cache-Control was introduced in HTTP/1.1 and is considered to be more reliable because it uses relative time instead of a fixed time. Fixed times require precise synchronisation between clocks on the client and server. Relative times do not. But both fulfil the same purpose, to specify how long a resource can be considered to be fresh.

Server Revalidation using Conditional Requests

As long as the resource instance hasn't expired the cache can serve the resource to clients freely. However what happens if a client requests a resource from a cache but the resource has expired? In this case the cache must check with the web server to see if the resource has changed before serving it. You wouldn't want to be serving an out of date resource to the client after all. If the resource has not changed the cache is free to serve the resource to the client. If it has changed however the cache must update its local copy of the resource with the current copy from the web server before serving it to the client.

The HTTP mechanism behind this process is the conditional request. A conditional request is a HTTP request which the web server will only fulfil if a specified condition is met. There are a few different types of conditional requests but for the purposes of caching there are essentially two types, If-Modified-Since and If-None-Match.

If-Modified-Since and Last-Modified

A conditional request is structured just like a normal GET request with the addition of a conditional header such as If-Modified-Since. A request with the header If-Modified-Since: Sat, 17 Dec 2011, 19:10:22 GMT will return the resource if the resource has changed since the date specified. In order to use this header the cache must have a date with the last known modification time of the resource in question. HTTP responses contain a header Last-Modified which specify exactly that. Caches will use the value of the Last-Modified header from the cached resource instance as the date in the If-Modified-Since request. If the date in the If-Modified since headers matches the last modified date of the resource on the web server it means the resource hasn't changed so the server will return a 304 Not Modified HTTP Response without a body. If the dates don't match it means the resource has changed so the server will return the latest copy of the resource along with an updated Last-Modified header. In either case the Cache knows its copy is now fresh and can send it on to the client. Also note that in either case the server should send updated Expires/Cache-Control headers so the cache knows how long to keep serving the resource instance before another conditional request is needed.

If-None-Match and ETags

There are some situations where the Last-Modified date isn't perfect such as:

  • when a modification date changes but the data inside the resource doesn't;
  • where web servers don't know the last modification dates of certain resources;
  • where a resource might change in a very small way which doesn't really warrant caches being forced to download the latest version.

ETags are an alternative to Last-Modified which were introduced in HTTP/1.1 to help resolve these issues. An ETag is just a string attached to the resource. It can be anything you want but generally it will be some kind of version number. Whenever the resource changes in a significant way the version number changes as well. The If-None-Match header uses the ETag value of the cached resource instance. For example a server processing a request with the header If-None-Match: "version 3.1" will return the resource only if the "version 3.1" ETag no longer matches the resource on the server. If it matches a 304 Not Modified HTTP Response is returned without a body. If the ETags don't match it means the resource has changed so the server will return the latest copy of the resource along with an updated ETag. In either case the Cache knows its copy is now fresh and can send it on to the client. Also note that in either case the server should send updated Expires/Cache-Control headers so the cache knows how long to keep serving the resource instance before another conditional request is needed.

An important point to note about ETags is they can cause problems in multi server hosting setups unless configured properly. These types of hosting setups are becoming more and more common therefore unless you really need to use ETags it's probably better to disable them completely and rely on Last-Modified. I won't go into the details of how to disable them as it is different for every web server but a quick google search will provide the answer for your particular server. If you do use them make sure you investigate them carefully to make sure you use them properly in multi server environments. You may also want to look into weak vs strong validators.

Restricting Caching

What about when you need to restrict caching? You might have a resource which changes on a daily basis and should really not be cached at all. HTTP/1.1 provides the answer with the Cache-Control header. We have seen this header used already as an alternative to Expires, it can also be used to restrict caching.

Cache-Control: no-store - Caches are forbidden to store the resource.
Cache-Control: no-cache - Can be stored but only served to the client after a server revalidation using a conditional request.
Cache-Control: must-revalidate - Caches are permitted to serve fresh copies of a resource instance but must revalidate if the copy is stale.

So the no-store value forbids a cache from storing a resource in any way. A no-cache value forces a revalidation every single time the resource is requested even if it is still fresh, as we have seen normally a cache will serve the fresh resource without revalidating. The must-revalidate value seems to just enforce what caches should be doing already, revalidating stale resources. I think there are some rare occasions where caches are actually allowed to serve stale resources, must-revalidate stops this behavior.

You can also set an Expires header with a past value to have the same effect as Cache-Control: no-cache for older HTTP/1.0 caches.

Cache settings for various resources

So now we understand the mechanisms behind HTTP caching how do we use them effectively? Following are some suggestions about cache settings for various types of resources. These are just guidelines, it really depends on your website.

Static resources which rarely change

Images, CSS and JavaScript files are the prime examples of this kind of resource. For these types of resources it makes sense to set the expiration date of the resources far into the future. A value of 6 months into the future seems reasonable. This way you will get the maximum benefit from proxy and browser caches.

You might be thinking so what happens when one of these resources does change, surely the caches are going to keep serving their old copies until they expire which could take months! To get around this the simplest solution is to simply rename the file when you change it. You will obviously need to also change any HTML/CSS which references the resource. So for example you might set a logo.png to have an Expires value far into the future. If you then edit the logo.png you could rename it to logo1.png and update all img tags and css properties which reference it. This way the caches treat it as a completely different resource and they will all fetch it from the web server.

Expires: <gmt date 6 months from now>
Cache-Control: max-age=15728400

Static HTML pages

HTML pages which are manually edited are also unlikely to change very often. However modifying the filenames of these resources when a change occurs can be more difficult. People might have created bookmarks to these pages and changing the filename will break the bookmarks. Also search engine spiders might have indexed pages which means you probably don't want to change the filenames. For these types of resources I would still suggest to set an Expires header into the future but only for a few days. That way you still take advantage of some caching benefits but any changes will be updated in the proxy and browser caches reasonably quickly.

Expires: <gmt date 3 days from now>
Cache-Control: max-age=259200

Dynamic pages

These are html pages generated by server side scripting languages such as PHP, Perl, Python or .NET just to mention a few. These pages will often change for each request and therefore should really not be cached at all. The best thing to do is disable caching using the Cache-Control header and setting a past Expires header.

Expires: <gmt date in the past>
Cache-Control: no-cache, must-revalidate

Sensitive resources

By this I mean a resource which is for one users eyes only. Generally these will be dynamically generated from server side scripts based on a users identity. An example of this might be a page with a list of invoices which requires a login to access. You can imagine a situation where a proxy caches a page of Bobs invoices. Then Mark makes a request which goes through the same proxy and the proxy serves Bobs list of invoices to Marks browser. Also you have no idea who is controlling and working on the proxy servers between the client and your webserver, what's to stop an employee taking a look at private cached data?

The obvious thing to stop this is to use the Cache-Control header and set a past Expires header as well. However does this really guarantee the resource won't be cached? Well no it won't, there is nothing stopping a proxy cache from ignoring your Cache-Control headers and caching the response. Of course they shouldn't do this and in most cases they won't cache the response but there is no iron clad guarantee. The only way to stop this happening is to serve the resource over HTTPS. This will ensure the request/response are encrypted and can't be viewed by any proxy servers.

One thing to note is this won't necessarily stop the browser cache from caching the resource. If you want to do this you should also use Cache-Control and set a past Expires header. I think most browsers will only cache resources over HTTPS during the lifetime of a user session so when the browser closes any cache should be thrown out. However to be safe it's best to set the headers anyway.

Of course any data of a highly sensitive nature should be password protected and encrypted over HTTPS anyway even without considering this caching issue so it's a bit of a redundant point. But it just enforces the need to always serve highly sensitive data over HTTPS.

It's also worth noting that you should consider cookie data. Some cookie data is highly sensitive such as cookies storing session ids or cookies storing personal data such as a persons name and address for example. So any server side scripts which set sensitive cookie data should also be secured using HTTPS and the cache headers mentioned above. Caches are not supposed to cache any Set-Cookie headers but we want to be safe. Additionally it's probably a good idea to set the Secure attribute of any sensitive cookies to make sure they are only transmitted over HTTPS requests.

Expires: <gmt date in the past>
Cache-Control: no-store, no-cache, must-revalidate
Set-Cookie: PHPSESSID=763ghs76GH3gh356a; path=/; secure

How to configure Caching

So this is all well and good but as a website developer how do I actually set these various headers? Just like with disabling ETags the answer depends on your web server and also on your server side scripting language. These are too numerous to go into detail. However for the most popular web server Apache you can use the mod_expires Apache module which will send both an Expires and Cache-Control header in an easy to control manner. Additionally with PHP you can set HTTP headers using the header() function, this is useful for setting Cache-Control: no-cache and a past Expires header on dynamic resources to limit caching.

Client Caching controls and Browser Reload

Web clients such as browsers can also set Cache-Control headers. These can be used to force a revalidation even if a resource instance is still fresh. Browsers use these headers when you do a reload (F5) or a hard reload (ctrl+F5). Each browser is slightly different in what it sends but the basic idea is a hard reload will bypass all caches and get the resource from the webserver directly. So sometimes when you modify a resource and don't change it's name you will have to do a hard reload to get the latest instance. Be aware that this will not guarantee everyone else will see the latest resource, you will have to change the filename for that.

Further information

http://www.http-guide.com/ - HTTP, The definitive guide. This is a great book which covers every aspect of HTTP in detail including Caching.

http://stevesouders.com/hpws/rules.php - High Performance Web Sites. Fourteen simple rules to make your website faster.

http://redbot.org/ - RED is a robot that checks HTTP resources to see how they'll behave, pointing out common problems and suggesting improvements.

https://addons.mozilla.org/en-US/firefox/addon/httpfox/ - A HTTP analyzer addon for Firefox.