Talking to the Wayback Machine

Piers Cawley

Way back in 2016, I migrated this site from its Publify incarnation to a static site generated from markdown files by Hugo. In that migration, I fucked up and truncated a buttload of posts and didn’t realise what I’d done until long after (about a week ago now) I had misplaced the database that the site had originally been generated from. Oops.

However, the Internet is still a marvellous place, blessed with useful sites like The Wayback Machine, which lets the interested reader browse historic versions of web pages. Which means, provided a page got noticed by archive.org’s crawlers, I can fetch a page from back before I fucked up and, with a little bit of massaging, turn it into something that Hugo can understand and get the whole article back again.

Way back in 2016, I migrated this site from its Publify

Publify is the Rails based blogging engine that started out as Typo, which I ended up maintaining for a while before handing it off to Frédéric de Villamil, who is still on the current maintenance team. Go him!

incarnation to a static site generated from markdown files by Hugo. In that migration, I fucked up and truncated a buttload of posts and didn’t realise what I’d done until long after (about a week ago now) I had misplaced the database that the site had originally been generated from.

Oops.

However, the Internet is still a marvellous place, blessed with useful sites like The Wayback Machine, which lets the interested reader browse historic versions of web pages. Which means, provided a page got noticed by archive.org’s crawlers, I can fetch a page from back before I fucked up and, with a little bit of massaging, turn it into something that Hugo can understand and get the whole article back again.

I can even recover the comments, which I had deliberately left out of the initial import, thinking “I’ll get around to importing those as well one day!” Thanks to my ADHD, that never quite happened. Unless you count the current activity, of course.

“Provided” is doing a lot of work there, and some posts definitely got missed, but something is better that nothing.

I’ve reached the point in my recovery process that I’ve started to hate the ad hoc way I was grabbing stuff from the archive. I want to only grab an archived page if the archived version is from before I buggered things up. So I’ve written some Emacs lisp. Obviously.

Here’s what I’d like to write.

(with-wayback-page-from-before url 20160618
  (web-tidy-buffer)
  (fixup-escaped-typo:code-blocks)
  (convert-to-org-mode)
  (restrict-to-article-and-comments)
  (fixup-comments)
  (org-string-nw-p
   (buffer-substring-no-properties (point-min) (point-max))))

The idea being that I fetch the archived version of the post into a temporary buffer where I can run it through HTML Tidy and a few extr HTML cleanup steps

An ever expanding list of cleanup steps. Every time I have to tidy something up by hand for the second time, I add something to the cleanup pipeline.

before converting it to Org format with Pandoc, continuing the cleanup in org mode (I prefer Emacs org mode tooling to its HTML tooling) and returning a nice clean string to insert into the org capture buffer.

I plan to return to the cleanup in future articles, but we’re just concerned with with-wayback-page-from-before for now.

If you want to see the full code (along with way more stuff), you’ll find it in my dotemacs repo on Github. I’m exceedingly unlikely to turn this into a full wayback.el package, but you’re more than welcome to use it as a starting point.

Here’s what that looks like:

(defmacro with-wayback-page-from-before (url date &rest body)
  (declare (indent 2) (debug t))
  (let ((capture-url (make-symbol "capture-url")))
    `(when-let* ((,capture-url (wayback-get-capture-before ,url ,date)))
       (with-temp-buffer
         (request ,capture-url
           :sync t
           :success (cl-function
                     (lambda (&key data &allow-other-keys)
                       (insert data))))
         ,@body))))

It’s a macro that uses wayback-get-capture-before to find the URL of the most recent capture of our target url before the given date, fetches it into a temporary buffer and executes the body of the macro. A common Emacs pattern.

do M-x describe-function and type with- and check out the completions to see just how common

The real meat lies in wayback-get-capture-before, which uses the Wayback CDX Server API to discover the capture url we’re interested in. There are other, simpler to use Wayback machine APIs, but they only let us find the closest capture to our date of interest, and we want to find the most recent capture that’s strictly before our date and that requires the CDX API. I’ve been a little lazy and used the request package to do the web request stuff because I prefer its API to the native url-retrieve in vanilla Emacs.

(defvar wayback-cdx-endpoint "https://web.archive.org/cdx/search/cdx"
  "The endpoint for the Wayback Machine's CDX server.")

(defvar wayback-cdx-json-parser
  (apply-partially 'json-parse-buffer :array-type 'list)
  "Parser for json data returned from the CDX server.")

(defun wayback-get-capture-before (url date)
  "Use the CDX applet to find any version of URL captured before DATE string.
Returns nil if there's no such capture"
  (let ((capture-url nil))
    (request wayback-cdx-endpoint
      :params `((url . ,url)
                (to . ,(if (or (numberp date)
                               (stringp date))
                           date
                         (format-time-string "%Y%m%d%H%M%S" date)))
                (collapse . digest)
                (output . json)
                (fl . "timestamp,original")
                (limit . -1))
      :parser wayback-cdx-json-parser
      :sync t
      :success (cl-function
                (lambda (&key data &allow-other-keys)
                  (setq capture-url
                        (pcase (cadr data)
                          (`() nil)
                          (`(,timestamp ,target-url)
                           (s-lex-format "https://web.archive.org/web/${timestamp}/${target-url}")))))))
    capture-url))

The CDX API’s JSON response format is derived from a CSV style text file. We’re only really interested in the timestamp and the “original” url that our target url resolved to, so we set (fl . "timestamp,original") in the request parameters and limit the results to the most recent one ((limit . -1)) before ((to . ...)) the date we’re interested in. That gives us:

[["timestamp", "original"],
 ["20250212175727", "https://bofh.org.uk/2016/06/19/static-migration/"]

You can tell it comes from something CSV like, can’t you?

The JSON response gets parsed into a Lisp list and we extract the interesting bits using a pcase statement

(pcase (cadr data)
  (`() nil)
  (`(,timestamp ,target-url)
   (s-lex-format ...)))

which passes an empty list through, or grabs the timestamp and target-url from the second entry in the results list and uses s-lex-format to generate a wayback machine URL. Easy.

This has the makings of a more general package, but that’s very much a back burner project. It does what I need, and does it well enough that I can consider this yak shaved and get on with the job of recovering my truncated blog posts. I’ll continue on my way of not releasing anything that anyone might want me to support.

  • 2 likes
  • 1 repost
  • 0 replies
  • 1 mention

Likes

Reposts

Other Mentions

  • Way back in 2016, I migrated this site from its Publify

    Publify is the Rails based blogging engine that started out as Typo, which I ended up maintaining for a while before handing it off to Frédéric de Villamil, who is still on the current maintenance team. Go him!

    incarnation to a static site generated from markdown files by Hugo. In that migration, I fucked up and truncated a buttload of posts and didn’t realise what I’d done until long after (about a week ago now) I had misplaced the database that the site had originally been generated from.

    Oops.

    However, the Internet is still a marvellous place, blessed with useful sites like The Wayback Machine, which lets the interested reader browse historic versions of web pages. Which means, provided a page got noticed by archive.org’s crawlers, I can fetch a page from back before I fucked up and, with a little bit of massaging, turn it into something that Hugo can understand and get the whole article back again.

    I can even recover the comments, which I had deliberately left out of the initial import, thinking “I’ll get around to importing those as well one day!” Thanks to my ADHD, that never quite happened. Unless you count the current activity, of course.

    “Provided” is doing a lot of work there, and some posts definitely got missed, but something is better that nothing.

    I’ve reached the point in my recovery process that I’ve started to hate the ad hoc way I was grabbing stuff from the archive. I want to only grab an archived page if the archived version is from before I buggered things up. So I’ve written some Emacs lisp. Obviously.

    Here’s what I’d like to write.

    (with-wayback-page-from-before url 20160618
      (web-tidy-buffer)
      (fixup-escaped-typo:code-blocks)
      (convert-to-org-mode)
      (restrict-to-article-and-comments)
      (fixup-comments)
      (org-string-nw-p
       (buffer-substring-no-properties (point-min) (point-max))))
    

    The idea being that I fetch the archived version of the post into a temporary buffer where I can run it through HTML Tidy and a few extr HTML cleanup steps

    An ever expanding list of cleanup steps. Every time I have to tidy something up by hand for the second time, I add something to the cleanup pipeline.

    before converting it to Org format with Pandoc, continuing the cleanup in org mode (I prefer Emacs org mode tooling to its HTML tooling) and returning a nice clean string to insert into the org capture buffer.

    I plan to return to the cleanup in future articles, but we’re just concerned with with-wayback-page-from-before for now.

    If you want to see the full code (along with way more stuff), you’ll find it in my dotemacs repo on Github. I’m exceedingly unlikely to turn this into a full wayback.el package, but you’re more than welcome to use it as a starting point.

    Here’s what that looks like:

    (defmacro with-wayback-page-from-before (url date &rest body)
      (declare (indent 2) (debug t))
      (let ((capture-url (make-symbol "capture-url")))
        `(when-let* ((,capture-url (wayback-get-capture-before ,url ,date)))
           (with-temp-buffer
             (request ,capture-url
               :sync t
               :success (cl-function
                         (lambda (&key data &allow-other-keys)
                           (insert data))))
             ,@body))))
    

    It’s a macro that uses wayback-get-capture-before to find the URL of the most recent capture of our target url before the given date, fetches it into a temporary buffer and executes the body of the macro. A common Emacs pattern.

    do M-x describe-function and type with- and check out the completions to see just how common

    The real meat lies in wayback-get-capture-before, which uses the Wayback CDX Server API to discover the capture url we’re interested in. There are other, simpler to use Wayback machine APIs, but they only let us find the closest capture to our date of interest, and we want to find the most recent capture that’s strictly before our date and that requires the CDX API. I’ve been a little lazy and used the request package to do the web request stuff because I prefer its API to the native url-retrieve in vanilla Emacs.
    (defvar wayback-cdx-endpoint "https://web.archive.org/cdx/search/cdx"
      "The endpoint for the Wayback Machine's CDX server.")
    
    (defvar wayback-cdx-json-parser
      (apply-partially 'json-parse-buffer :array-type 'list)
      "Parser for json data returned from the CDX server.")
    
    (defun wayback-get-capture-before (url date)
      "Use the CDX applet to find any version of URL captured before DATE string.
    Returns nil if there's no such capture"
      (let ((capture-url nil))
        (request wayback-cdx-endpoint
          :params `((url . ,url)
                    (to . ,(if (or (numberp date)
                                   (stringp date))
                               date
                             (format-time-string "%Y%m%d%H%M%S" date)))
                    (collapse . digest)
                    (output . json)
                    (fl . "timestamp,original")
                    (limit . -1))
          :parser wayback-cdx-json-parser
          :sync t
          :success (cl-function
                    (lambda (&key data &allow-other-keys)
                      (setq capture-url
                            (pcase (cadr data)
                              (`() nil)
                              (`(,timestamp ,target-url)
                               (s-lex-format "https://web.archive.org/web/${timestamp}/${target-url}")))))))
        capture-url))
    

    The CDX API’s JSON response format is derived from a CSV style text file. We’re only really interested in the timestamp and the “original” url that our target url resolved to, so we set (fl . "timestamp,original") in the request parameters and limit the results to the most recent one ((limit . -1)) before ((to . ...)) the date we’re interested in. That gives us:

    [["timestamp", "original"],
     ["20250212175727", "https://bofh.org.uk/2016/06/19/static-migration/"]
    

    You can tell it comes from something CSV like, can’t you?

    The JSON response gets parsed into a Lisp list and we extract the interesting bits using a pcase statement

    (pcase (cadr data)
      (`() nil)
      (`(,timestamp ,target-url)
       (s-lex-format ...)))
    

    which passes an empty list through, or grabs the timestamp and target-url from the second entry in the results list and uses s-lex-format to generate a wayback machine URL. Easy.

    This has the makings of a more general package, but that’s very much a back burner project. It does what I need, and does it well enough that I can consider this yak shaved and get on with the job of recovering my truncated blog posts. I’ll continue on my way of not releasing anything that anyone might want me to support.