Static migration

Piers Cawley

Finally catching up with Dominus

If you’re reading this, I finally got my act together and shifted this blog over from Publify (Née Typo) to a static site generated by Hugo.

I wish I’d done it earlier.

I also wish I’d not decided that Textile was the right lightweight markup language for me back in 2003 when I started this blog and that I’d not used platform specfic techniques for code highlighting. Converting all that to Github Flavoured Markdown was a bear, massively helped by pandoc and some very hacky text munging in Perl. Some stuff is still looking very ugly though; I chopped and changed my code formatting over the years, so as time and inclination allows, I’ll try and get stuff fixed by hand.

Why move?

The straw that broke this camel’s back was when I updated Publify and Ruby and spent the next few hours wrestling with Apache and Passenger trying to get the site back up. When the error messages told me that things weren’t working because I didn’t have a javascript runtime installed on my host, I knew that the time had come to just go with a system of files in a directory, served up by a simple minded webserver.

No comments

This first pass at migration has eliminated all the comments. They’re still in my database, but I’ve not worked out how to get them imported into the site’s structure yet. Again, as time and inclination allows, I’ll try and port them over. Who knows, I might even add Disqus comments to the site and let someone else deal with the pain of spam management.

Blogging schedule

I’m making no promises about restarting blogging I’m afraid. The shift to static generation has been an itch I’ve needed to scratch and, as I was building another site with Hugo anyway, now seemed like the opportune moment to move things over.

  • 0 likes
  • 0 reposts
  • 0 replies
  • 1 mention

Other Mentions

  • Way back in 2016, I migrated this site from its Publify

    Publify is the Rails based blogging engine that started out as Typo, which I ended up maintaining for a while before handing it off to Frédéric de Villamil, who is still on the current maintenance team. Go him!

    incarnation to a static site generated from markdown files by Hugo. In that migration, I fucked up and truncated a buttload of posts and didn’t realise what I’d done until long after (about a week ago now) I had misplaced the database that the site had originally been generated from.

    Oops.

    However, the Internet is still a marvellous place, blessed with useful sites like The Wayback Machine, which lets the interested reader browse historic versions of web pages. Which means, provided a page got noticed by archive.org’s crawlers, I can fetch a page from back before I fucked up and, with a little bit of massaging, turn it into something that Hugo can understand and get the whole article back again.

    I can even recover the comments, which I had deliberately left out of the initial import, thinking “I’ll get around to importing those as well one day!” Thanks to my ADHD, that never quite happened. Unless you count the current activity, of course.

    “Provided” is doing a lot of work there, and some posts definitely got missed, but something is better that nothing.

    I’ve reached the point in my recovery process that I’ve started to hate the ad hoc way I was grabbing stuff from the archive. I want to only grab an archived page if the archived version is from before I buggered things up. So I’ve written some Emacs lisp. Obviously.

    Here’s what I’d like to write.

    (with-wayback-page-from-before url 20160618
      (web-tidy-buffer)
      (fixup-escaped-typo:code-blocks)
      (convert-to-org-mode)
      (restrict-to-article-and-comments)
      (fixup-comments)
      (org-string-nw-p
       (buffer-substring-no-properties (point-min) (point-max))))
    

    The idea being that I fetch the archived version of the post into a temporary buffer where I can run it through HTML Tidy and a few extr HTML cleanup steps

    An ever expanding list of cleanup steps. Every time I have to tidy something up by hand for the second time, I add something to the cleanup pipeline.

    before converting it to Org format with Pandoc, continuing the cleanup in org mode (I prefer Emacs org mode tooling to its HTML tooling) and returning a nice clean string to insert into the org capture buffer.

    I plan to return to the cleanup in future articles, but we’re just concerned with with-wayback-page-from-before for now.

    If you want to see the full code (along with way more stuff), you’ll find it in my dotemacs repo on Github. I’m exceedingly unlikely to turn this into a full wayback.el package, but you’re more than welcome to use it as a starting point.

    Here’s what that looks like:

    (defmacro with-wayback-page-from-before (url date &rest body)
      (declare (indent 2) (debug t))
      (let ((capture-url (make-symbol "capture-url")))
        `(when-let* ((,capture-url (wayback-get-capture-before ,url ,date)))
           (with-temp-buffer
             (request ,capture-url
               :sync t
               :success (cl-function
                         (lambda (&key data &allow-other-keys)
                           (insert data))))
             ,@body))))
    

    It’s a macro that uses wayback-get-capture-before to find the URL of the most recent capture of our target url before the given date, fetches it into a temporary buffer and executes the body of the macro. A common Emacs pattern.

    do M-x describe-function and type with- and check out the completions to see just how common

    The real meat lies in wayback-get-capture-before, which uses the Wayback CDX Server API to discover the capture url we’re interested in. There are other, simpler to use Wayback machine APIs, but they only let us find the closest capture to our date of interest, and we want to find the most recent capture that’s strictly before our date and that requires the CDX API. I’ve been a little lazy and used the request package to do the web request stuff because I prefer its API to the native url-retrieve in vanilla Emacs.
    (defvar wayback-cdx-endpoint "https://web.archive.org/cdx/search/cdx"
      "The endpoint for the Wayback Machine's CDX server.")
    
    (defvar wayback-cdx-json-parser
      (apply-partially 'json-parse-buffer :array-type 'list)
      "Parser for json data returned from the CDX server.")
    
    (defun wayback-get-capture-before (url date)
      "Use the CDX applet to find any version of URL captured before DATE string.
    Returns nil if there's no such capture"
      (let ((capture-url nil))
        (request wayback-cdx-endpoint
          :params `((url . ,url)
                    (to . ,(if (or (numberp date)
                                   (stringp date))
                               date
                             (format-time-string "%Y%m%d%H%M%S" date)))
                    (collapse . digest)
                    (output . json)
                    (fl . "timestamp,original")
                    (limit . -1))
          :parser wayback-cdx-json-parser
          :sync t
          :success (cl-function
                    (lambda (&key data &allow-other-keys)
                      (setq capture-url
                            (pcase (cadr data)
                              (`() nil)
                              (`(,timestamp ,target-url)
                               (s-lex-format "https://web.archive.org/web/${timestamp}/${target-url}")))))))
        capture-url))
    

    The CDX API’s JSON response format is derived from a CSV style text file. We’re only really interested in the timestamp and the “original” url that our target url resolved to, so we set (fl . "timestamp,original") in the request parameters and limit the results to the most recent one ((limit . -1)) before ((to . ...)) the date we’re interested in. That gives us:

    [["timestamp", "original"],
     ["20250212175727", "https://bofh.org.uk/2016/06/19/static-migration/"]
    

    You can tell it comes from something CSV like, can’t you?

    The JSON response gets parsed into a Lisp list and we extract the interesting bits using a pcase statement

    (pcase (cadr data)
      (`() nil)
      (`(,timestamp ,target-url)
       (s-lex-format ...)))
    

    which passes an empty list through, or grabs the timestamp and target-url from the second entry in the results list and uses s-lex-format to generate a wayback machine URL. Easy.

    This has the makings of a more general package, but that’s very much a back burner project. It does what I need, and does it well enough that I can consider this yak shaved and get on with the job of recovering my truncated blog posts. I’ll continue on my way of not releasing anything that anyone might want me to support.