Squid as a reverse proxy

Using the Squid cache as a reverse proxy can save traffic and bandwidth and increase the speed of your web server enormously. But only a few people make use of this useful function.

The Squid cache has been popular and well-known as an open source solution for years. And although the reverse proxy function was also implemented a long time ago, it is rarely used. In this case, a reverse proxy would be a real stroke of luck for website operators.

Normally, Squid is used as a web cache in companies or as an accelerating proxy in small LANs with slow Internet access. As is often the case with good ideas, the idea behind the Squid web cache is actually quite simple: On the inside of your LAN, you install a computer with Squid as a proxy for web requests - or for ftp and other protocols supported by Squid - in front of the Internet access point.

The Internet application programs in the LAN are then assigned the Squid computer as a proxy for the intended protocols. For example, the browsers are configured to send http requests to Squid instead of the LAN's Internet gateway.

Squid accepts these requests and either delivers the response from its own cache, or fetches it from the original URL if it does not yet know the page or the cache entry is out of date. On the one hand, this results in requests from clients on the LAN being answered much faster if the same page has already been requested. On the other hand you save bandwidth and traffic to the outside.

Advantages of Squid

The interesting thing about Squid is that you can also use it for client requests from the other side: Squid also acts as a reverse proxy. This clearly sets Squid apart from other web proxies: Microsoft's ISA Server 2004 can also be used as a reverse proxy, but with rather moderate flexibility compared to Squid.

The installation of the reverse proxy is similarly clearly structured as that of the normally operating proxy: The web server operator installs a Squid before the actual web server. The DNS entries for the web offer are to be changed so that they no longer point to the web server itself, but to the Squid computer. Client requests thus land on Squid instead of on the web server itself.

However, the advantages of this configuration are not as obvious as they are with a normal web cache. Nevertheless, the use of Squid makes itself dramatically felt - there are two important effects in this usage scenario as well. Which one is more important depends on the nature of the web site.

Less load on the web server

Squid, configured as a reverse proxy, accepts all client requests. It basically handles this request in the same way as a normal proxy: it checks whether it already knows an answer for this request. If the answer is not yet outdated, then it delivers it directly - there is no further round trip to the actual web server. Only if the answer is outdated or not available, it forwards the request to the actual web server. This results in a dramatically reduced number of requests to the web server. At first glance, this is nothing particularly exciting: After all, you don't save any requests overall, because Squid still has to receive and process them.

But: Squid can answer a request many times faster than the web server. This is especially true for requests that are answered by programs or scripts (PHP, CGI, ASP, ...). With Squid, a lookup in a large hash table is essentially enough. The web server, on the other hand, needs at least one disk access. It equips the loaded file with additional http header information before delivering it.

In practice, the web server will also start its own process (or thread) to process the file before it is delivered: For example, the PHP engine then runs there to dynamically assemble the page, in the worst case from the results of a database query. This consumption of CPU power and memory is of course clearly noticeable.

Squid, on the other hand, needs only a single process, which it uses to search its cache: While a typical web server access with simple tools like top (under Unix) is already noticeable, the load by Squid remains practically zero.

Long story short: Using Squid saves considerable computing power on web and database servers. So you can serve significantly more users with the same web server hardware than without Squid. And this in turn saves additional time-consuming maintenance work.

Decoupling

The web server itself will almost certainly always run with other machines in a DMZ: At least one database server, and probably backup hardware as well, will have to be located in this DMZ.

This almost inevitably means that the database server and the web server must not be spatially separated very far from each other: Accesses of the web server to the database server will certainly want to be possible with at least 100 Mbit/s, which would cause high costs for connections via the Internet. Ultimately, this means that the entire hardware of the web presence must be set up in close proximity to each other.

However, this does not apply to Squid. While it's nice if it too has a fast connection to the web server - since only a fraction of the data actually needs to be transported between Squid and the web server, it's not absolutely necessary.

This means that Squid can actually be placed in a different location than the web server - and this may again save on line costs. It is much easier to move with a single Squid server than with the whole DMZ. Therefore it is much easier to find a place for Squid that is as cheap as possible in terms of bandwidth and traffic. The DMZ, on the other hand, is placed in a location that is as organizationally favorable as possible. This flexibility is not possible without Squid.

Squid decouples the DMZ from the Internet

In addition to the facts mentioned so far, there is another detail concerning the parameters of Squid: If Squid is located in the immediate vicinity of the web server, it will be configured in such a way that, if possible, no hard disk accesses take place on Squid at all. The program should then either answer queries directly from RAM, or try to reach the web server quickly.

Since the actual application of Squid is a web cache, the program can of course also swap objects on the hard disk. So if Squid is physically separated from the web server, it is recommended to keep objects on disk as long as possible. If the website only needs a few short-lived objects, then in practice it is also possible to place Squid in a data center with high bandwidth. For the connection of the web server itself, an economy line is quite sufficient in this case.

Configuring Squid

So much for the theory - now for the practice. The installation of Squid under Linux (and other Unix derivatives) is done in the usual way: For some distributions there are ready binaries available, but normally you should compile Squid for your own system.

The complete documentation, the source code and a lot of additional information can be found on the site Squid-Cache.org.

After unpacking the tarball, use make install to create the Squid configuration file squid.conf. It is much more extensive than you would expect - but fortunately you only need a small part of the available configuration options for operation as a reverse proxy. Here are the necessary settings in squid.conf.

http_port: With this option you set on which port and IP address Squid should wait for incoming connections. Since Squid is supposed to act as a reverse proxy for an HTTP server, the port number is logically 80. The IP address is simply the public IP address of the system. So basically you need an entry like

http_port aaa.bbb.ccc.ddd:80

You make this basic setting relatively high up in the configuration file, after that there is nothing for a long time that matters for operation as a reverse proxy. The next relevant section is called "Options which affect the Cache-Size".

cache_mem: With this option you define the main memory to be used. By default, squid.conf contains the entry 8 MB - apparently there is not much to cache at squid-cache.org. Of course, this value is clearly too low. For your cache to work optimally, it is best to equip the computer with enough RAM to accommodate as much of the web site as possible. Since the computer also needs some memory for other things, you can't give all the RAM to Squid, though. You should reserve between 128 and 256 MByte for the system and give the rest to Squid. At the website already mentioned, Squid runs on a computer with 2 GByte and the cache_men setting 1800 MB.

maximum_object_size and maximum_object_size_in_memory: These two options specify the maximum size of an object for it to be cached at all, and the maximum size of an object for Squid to keep it in main memory. These values depend on the gig, of course. In some cases, it may be better to keep a few large objects in RAM - for example, if they are requested very frequently. In other cases, it is better to keep many small objects in memory. Examine your site to see how large 90 to 95 percent of all objects are - and then use that value as the threshold for the options.

cache_dir: Use this option to specify the size and location of the cache directory. Whether objects should be swapped to disk at all, however, depends on the spatial relationship between the web server and Squid, as mentioned earlier. If for some reason you don't want Squid to page data to disk at all, specify a size of 0 for the cache. Otherwise, there is nothing wrong with giving Squid as much space as it needs on the disk.

Cache attributes in the web server

A Squid configured in this way will already take a lot of work off the web server. However, this is not the end of the story. As mentioned at the beginning, Squid only delivers responses from the cache if they are not yet outdated.

Of course, this raises the question of how Squid can know whether the shelf life of an object has expired. And that's relatively easy to answer: It can't know, of course.

So in order for Squid to work efficiently, the web server has to provide the objects with appropriate expiration information. This works differently from web server to web server, but the form of the information is always the same. An object delivered by the web server may contain four tags in its HTTP header that control caching: "Last-Modified", "Expires", "Cache-Control" and "Pragma".

Last-Modified specifies when an object was last modified. Squid can use this information to weigh whether to request an object again from the server. However, it is better not to leave this decision to Squid, but to use the Expires tag. Because with this you can determine yourself when the object expires. Only when this is the case, Squid will request the object again from the server. So for images that normally never change on the server, you can specify an infinite shelf life, for example.

By the way, assigning the expires tag is useful even if you don't want to run Squid. After all, many client browsers use this information to decide whether or not to re-request a given object.

The Cache-Control and Pragma tags serve the same purpose: they inform caches whether a page should or should not be cached at all. So the two tags are mainly important for pages whose content should not be cached.

Set caching for Apache

For Apache, you specify the assignment of these tags for directories or files via httpd.conf, where you can also specify global defaults. For example, to make all GIF files that are not tagged with a dedicated expiration date expire only in the distant future, use the two lines:

ExpiresActive On ExpiresByType image/gif "modification plus 7200 days".

ExpiresByType can be used multiple times, of course. To specify a shelf life within a container, use ExpiresDefault. The markup contains a string as a parameter that specifies the shelf life. For example, to specify a shelf life of about ten years after the object was last modified, use:

. ExpiresDefault "modification plus 3600 days".

Ali.as
	Main