archiving a WordPress blog to flat html files

For the last few years I have running WordPress blogs for ‘the Setbuild Project’, which lasts 3 months each year, the blogs are only actively used for that short period and since it’s a new group of people doing the project each year I end up making a new install of WordPress for each year. This is ok, but has left a lot of mess on the server and has an growing maintenance cost.

I want to keep the old journals available for reference as an archive, but need them to be entirely ‘locked’ and protected from being attacked or vandalised. Wouldn’t it be great if I could simply flatten the WordPress/php/mysql down to a bunch of static html files?

After googling around a bit I found a bunch of people asking how to do this but no really useful answers. So through some experimentation I figured out a strategy for doing this. I post here the process I used, hopefully this is a useful reference and maybe someone can come up with a simpler way.

Before we start, this is not entirely straightforward to do and once done your WordPress blog will loose a lot of it’s current features. To be clear, the site becomes completely static; you can’t add posts, edit, or do anything to the site through the admin interface. Any dynamic elements of the site will break. Search will no longer work, existing comments will be archived, new ones cannot be added. Any widgets or other content loaded dynamically from elsewhere  will likely be frozen at the moment you produce the archive.

In exchange for all that your site no longer relies on php and mysql running on the server, it’ll put less strain on your webhost to keep it going (and be more portable). But perhaps most important it will no longer be a target for malicious scripters so you won’t need to worry about keeping it patched and up to date.

In brief the process is this:

  1. Prepare the site to be flattened, remove any dynamic content.
  2. Get your permalinks sorted.
  3. Download the site.
  4. Tidy up the downloaded files.

1. Prepare the Site

The first step is to adjust the appearance of the site to leave less bits broken once the site has been flattened. Exactly what you need to do depends on your specific site, in my case I had to deal with Comments, the sidebar Search field and Login buttons, and some plugins.

  • Turn off comments on all posts.
  • Turn off Caching. If you’re running WP-supercache or another cache’ing plugin on your site (you should be!) turn it off first, it’ll make the rest of these steps easier. (Nb. with WP-supercache, make sure you turn off caching before disabling the plugin: set No Caching in the Supercache preferences, Delete all cached files, then disable the plug-in.)
  • If you have a Search widget in the sidebar of your site get rid of it (if you haven’t setup custom widgets and still have a search function you can probably hide it easily by adding display:none; to the appropriate selector in your theme’s CSS file).
  • Remove the ‘Meta – Login’ link, in fact you may as well remove the entire Meta section from the sidebar.
  • Remove/disable RSS feeds, this is probably not necessary but I spent a little time commenting out the various feed links in the theme header.php and footer.php.
  • Turn off any plugins which are not relevant to the soon-to-be-flat version of your site. In my case this was ReCapture, Akismet, Facebook Connect.

2. Fix Permalinks
This is important!
Once your site looses it’s php’ness any links which contain a query string (a ? in the address bar) will break.
If you’re using the default ‘pretty’ permalinks option you should be fine, but if you never got around to setting this up, do it now.

3. Downloading the site
Most of the google results for ‘convert wordpress site to html’ ended with a comment along the lines of “use wget”. Since I’m a Mac user and the wget command line tool isn’t installed by default, I had two options install wget (install Macports, install a bunch of libraries, install wget) or find something else. Turns out there is a pretty nice opensource tool called GetLeft which is designed for recursively downloading webpages following their links, downloading more webpages. Better yet, it is a GUI tool and an Mac native version is maintained.
Download GetLeft. Once installed, tell GetLeft to start downloading at the root level of your site, give it a folder to save to, and press Go. A little while later (over an hour when I did it) it will have downloaded your site.

Nb. I ran into a few problems, the first couple of attempts for some reason GetLeft only managed to download error pages where there should have been content. With a bit of meddling this seemed to be a permalink problem, reseting these to WordPress defaults fixed it and the site downloaded fine. YMMV.

3 ½. You can stop now!
If you prepped the site well and GetLeft has done it’s thing you should now have a folder with a fully working flattened HTML copy of your WordPress blog on your local computer.

the downloaded site, with working but less than ideal folder structure

4. Tidy up downloaded files
This step might not be necessary; If your site wasn’t too messy GetLeft might do all the link adjustments needed.

In my case GetLeft did a pretty excellent job of updating links in pages as it downloaded them to suit the new file extensions. But there was one notable problem to fix, absolute links within the same domainname. These are links in the pages which contain the whole www… address of the item they’re pointing to and are on the same domainname as the site downloaded. For most WP sites these are only likely to be a few theme images which are specified via CSS styles. In addition to this my site used a custom theme which I’d made in a lazy way and had a couple of ‘hard’ links to other areas of the site outside of WP.

GetLeft’s solution to some of these absolute links is to add folders beneath the root folder of the site you’ve downloaded all the way back to the server name (see screengrab). This wasn’t ideal for me, so I had to do a couple of careful Find and Replace actions across all the .html files in the downloaded site. For this I used the excellent TextWrangler by BareBonesSoftware.

The Find&Replace actions you’re going to need to do will, once again be individual to your site. Mine looked like the image on the right.

The aim of these is to remove the additional folders structure (actually in the case of the screenshot it was to update links that GetLeft didn’t catch).

If this all works out successfully you should be left with a lightweight, fully working, static html ‘freeze’ of your WordPress blog. You can then upload this back to your webserver and put it online. If you’re replacing the live WordPress site with the archive, it’s a good idea to make a backup copy of the entire wordpress install and a dump of the MySQL database before uploading the static site.

If you’ve gotten this far, you might also be interested in:

 

2 Comments

  1. Posted March 6, 2012 at 10:59 am | Permalink

    Yeah, def worth checking out wgets features for this; you can also change all links to relative paths etc for running locally/redeploying, pretty much all in one line of code too. Would’ve thought OSX had this baked in but ho hum.

    Nice useful post 🙂

  2. Posted August 25, 2012 at 10:42 pm | Permalink

    thank you for this detailed description. Useful.

Post a Comment

Your email is never shared. Required fields are marked *

*
*