Planet Scheme

Monday, June 5, 2023

GNU Guix

From development environments to continuous integration—the ultimate guide to software development with Guix

Guix is a handy tool for developers; guix shell, in particular, gives a standalone development environment for your package, no matter what language(s) it’s written in. To benefit from it, you have to initially write a package definition and have it either in Guix proper, in a channel, or directly upstream as a guix.scm file. This last option is appealing: all developers have to do to get set up is clone the project's repository and run guix shell, with no arguments—we looked at the rationale for guix shell in an earlier article.

Development needs go beyond development environments though. How can developers perform continuous integration of their code in Guix build environments? How can they deliver their code straight to adventurous users? This post describes a set of files developers can add to their repository to set up Guix-based development environments, continuous integration, and continuous delivery—all at once.

Getting started

How do we go about “Guixifying” a repository? The first step, as we’ve seen, will be to add a guix.scm at the root of the repository in question. We’ll take Guile as an example in this post: it’s written in Scheme (mostly) and C, and has a number of dependencies—a C compilation tool chain, C libraries, Autoconf and its friends, LaTeX, and so on. The resulting guix.scm looks like the usual package definition, just without the define-public bit:

;; The ‘guix.scm’ file for Guile, for use by ‘guix shell’.

(use-modules (guix)
             (guix build-system gnu)
             ((guix licenses) #:prefix license:)
             (gnu packages autotools)
             (gnu packages base)
             (gnu packages bash)
             (gnu packages bdw-gc)
             (gnu packages compression)
             (gnu packages flex)
             (gnu packages gdb)
             (gnu packages gettext)
             (gnu packages gperf)
             (gnu packages libffi)
             (gnu packages libunistring)
             (gnu packages linux)
             (gnu packages pkg-config)
             (gnu packages readline)
             (gnu packages tex)
             (gnu packages texinfo)
             (gnu packages version-control))

(package
  (name "guile")
  (version "3.0.99-git")                          ;funky version number
  (source #f)                                     ;no source
  (build-system gnu-build-system)
  (native-inputs
   (append (list autoconf
                 automake
                 libtool
                 gnu-gettext
                 flex
                 texinfo
                 texlive-base                 ;for "make pdf"
                 texlive-epsf
                 gperf
                 git
                 gdb
                 strace
                 readline
                 lzip
                 pkg-config)

           ;; When cross-compiling, a native version of Guile itself is
           ;; needed.
           (if (%current-target-system)
               (list this-package)
               '())))
  (inputs
   (list libffi bash-minimal))
  (propagated-inputs
   (list libunistring libgc))

  (native-search-paths
   (list (search-path-specification
          (variable "GUILE_LOAD_PATH")
          (files '("share/guile/site/3.0")))
         (search-path-specification
          (variable "GUILE_LOAD_COMPILED_PATH")
          (files '("lib/guile/3.0/site-ccache")))))
  (synopsis "Scheme implementation intended especially for extensions")
  (description
   "Guile is the GNU Ubiquitous Intelligent Language for Extensions,
and it's actually a full-blown Scheme implementation!")
  (home-page "https://www.gnu.org/software/guile/")
  (license license:lgpl3+))

Quite a bit of boilerplate, but now someone who’d like to hack on Guile just needs to run:

guix shell

That gives them a shell containing all the dependencies of Guile: those listed above, but also implicit dependencies such as the GCC tool chain, GNU Make, sed, grep, and so on. The chef’s recommendation:

guix shell --container --link-profile

That gives a shell in an isolated container, and all the dependencies show up in $HOME/.guix-profile, which plays well with caches such as config.cache and absolute file names recorded in generated Makefiles and the likes. The fact that the shell runs in a container brings peace of mind: nothing but the current directory and Guile’s dependencies is visible inside the container; nothing from the system can possibly interfere with your development.

Level 1: Building with Guix

Now that we have a package definition, why not also take advantage of it so we can build Guile with Guix? We had left the source field empty, because guix shell above only cares about the inputs of our package—so it can set up the development environment—not about the package itself.

To build the package with Guix, we’ll need to fill out the source field, along these lines:

(use-modules (guix)
             (guix git-download)  ;for ‘git-predicate’
             …)

(define vcs-file?
  ;; Return true if the given file is under version control.
  (or (git-predicate (current-source-directory))
      (const #t)))                                ;not in a Git checkout

(package
  (name "guile")
  (version "3.0.99-git")                          ;funky version number
  (source (local-file "." "guile-checkout"
                      #:recursive? #t
                      #:select? vcs-file?))
  …)

Here’s what we changed:

  1. We added (guix git-download) to our set of imported modules, so we can use its git-predicate procedure.
  2. We defined vcs-file? as a procedure that returns true when passed a file that is under version control. For good measure, we add a fallback case for when we’re not in a Git checkout: always return true.
  3. We set source to a local-file—a recursive copy of the current directory ("."), limited to files under version control (the #:select? bit).

From there on, our guix.scm file serves a second purpose: it lets us build the software with Guix. The whole point of building with Guix is that it’s a “clean” build—you can be sure nothing from your working tree or system interferes with the build result—and it lets you test a variety of things. First, you can do a plain native build:

guix build -f guix.scm

But you can also build for another system (possibly after setting up offloading or transparent emulation):

guix build -f guix.scm -s aarch64-linux -s riscv64-linux

… or cross-compile:

guix build -f guix.scm --target=x86_64-w64-mingw32

You can also use package transformation options to test package variants:

# What if we built with Clang instead of GCC?
guix build -f guix.scm \
  --with-c-toolchain=guile@3.0.99-git=clang-toolchain

# What about that under-tested configure flag?
guix build -f guix.scm \
  --with-configure-flag=guile@3.0.99-git=--disable-networking

Handy!

Level 2: The repository as a channel

We now have a Git repository containing (among other things) a package definition. Can’t we turn it into a channel? After all, channels are designed to ship package definitions to users, and that’s exactly what we’re doing with our guix.scm.

Turns out we can indeed turn it into a channel, but with one caveat: we must create a separate directory for the .scm file(s) of our channel so that guix pull doesn’t load unrelated .scm files when someone pulls the channel—and in Guile, there are lots of them! So we’ll start like this, keeping a top-level guix.scm symlink for the sake of guix shell:

mkdir -p .guix/modules
mv guix.scm .guix/modules/guile-package.scm
ln -s .guix/modules/guile-package.scm guix.scm

To make it usable as part of a channel, we need to turn our guix.scm file into a module: we do that by changing the use-modules form at the top to a define-module form. We also need to actually export a package variable, with define-public, while still returning the package value at the end of the file so we can still use guix shell and guix build -f guix.scm. The end result looks like this (not repeating things that haven’t changed):

(define-module (guile-package)
  #:use-module (guix)
  #:use-module (guix git-download)   ;for ‘git-predicate’
  …)

(define-public guile
  (package
    (name "guile")
    (version "3.0.99-git")                          ;funky version number
    …))

;; Return the package object define above at the end of the module.
guile

We need one last thing: a .guix-channel file so Guix knows where to look for package modules in our repository:

;; This file lets us present this repo as a Guix channel.

(channel
  (version 0)
  (directory ".guix/modules"))  ;look for package modules under .guix/modules/

To recap, we now have these files:

.
├── .guix-channel
├── guix.scm → .guix/modules/guile-package.scm
└── .guix
    └── modules
       └── guile-package.scm

And that’s it: we have a channel! (We could do better and support channel authentication so users know they’re pulling genuine code. We’ll spare you the details here but it’s worth considering!) Users can pull from this channel by adding it to ~/.config/guix/channels.scm, along these lines:

(append (list (channel
                (name 'guile)
                (url "https://git.savannah.gnu.org/git/guile.git")
                (branch "main")))
        %default-channels)

After running guix pull, we can see the new package:

$ guix describe
Generation 264  May 26 2023 16:00:35    (current)
  guile 36fd2b4
    repository URL: https://git.savannah.gnu.org/git/guile.git
    branch: main
    commit: 36fd2b4920ae926c79b936c29e739e71a6dff2bc
  guix c5bc698
    repository URL: https://git.savannah.gnu.org/git/guix.git
    commit: c5bc698e8922d78ed85989985cc2ceb034de2f23
$ guix package -A ^guile$
guile   3.0.99-git      out,debug       guile-package.scm:51:4
guile   3.0.9           out,debug       gnu/packages/guile.scm:317:2
guile   2.2.7           out,debug       gnu/packages/guile.scm:258:2
guile   2.2.4           out,debug       gnu/packages/guile.scm:304:2
guile   2.0.14          out,debug       gnu/packages/guile.scm:148:2
guile   1.8.8           out             gnu/packages/guile.scm:77:2
$ guix build guile@3.0.99-git
[…]
/gnu/store/axnzbl89yz7ld78bmx72vpqp802dwsar-guile-3.0.99-git-debug
/gnu/store/r34gsij7f0glg2fbakcmmk0zn4v62s5w-guile-3.0.99-git

That’s how, as a developer, you get your software delivered directly into the hands of users! No intermediaries, yet no loss of transparency and provenance tracking.

With that in place, it also becomes trivial for anyone to create Docker images, Deb/RPM packages, or a plain tarball with guix pack:

# How about a Docker image of our Guile snapshot?
guix pack -f docker -S /bin=bin guile@3.0.99-git

# And a relocatable RPM?
guix pack -f rpm -R -S /bin=bin guile@3.0.99-git

Bonus: Package variants

We now have an actual channel, but it contains only one package. While we’re at it, we can define package variants in our guile-package.scm file, variants that we want to be able to test as Guile developers—similar to what we did above with transformation options. We can add them like so:

;; This is the ‘.guix/modules/guile-package.scm’ file.

(define-module (guile-package)
  …)

(define-public guile
  …)

(define (package-with-configure-flags p flags)
  "Return P with FLAGS as addition 'configure' flags."
  (package/inherit p
    (arguments
     (substitute-keyword-arguments (package-arguments p)
       ((#:configure-flags original-flags #~(list))
        #~(append #$original-flags #$flags))))))

(define-public guile-without-threads
  (package
    (inherit (package-with-configure-flags guile
                                           #~(list "--without-threads")))
    (name "guile-without-threads")))

(define-public guile-without-networking
  (package
    (inherit (package-with-configure-flags guile
                                           #~(list "--disable-networking")))
    (name "guile-without-networking")))


;; Return the package object defined above at the end of the module.
guile

We can build these variants as regular packages once we’ve pulled the channel. Alternatively, from a checkout of Guile, we can run a command like this one from the top level:

guix build -L $PWD/.guix/modules guile-without-threads

Level 3: Setting up continuous integration

This channel becomes even more interesting once we set up continuous integration (CI). There are several ways to do that.

You can use one of the mainstream continuous integration tools, such as GitLab-CI. To do that, you need to make sure you run jobs in a Docker image or virtual machine that has Guix installed. If we were to do that in the case of Guile, we’d have a job that runs a shell command like this one:

guix build -L $PWD/.guix/modules guile@3.0.99-git

Doing this works great and has the advantage of being easy to achieve on your favorite CI platform.

That said, you’ll really get the most of it by using Cuirass, a CI tool designed for and tightly integrated with Guix. Using it is more work than using a hosted CI tool because you first need to set it up, but that setup phase is greatly simplified if you use its Guix System service. Going back to our example, we give Cuirass a spec file that goes like this:

;; Cuirass spec file to build all the packages of the ‘guile’ channel.
(list (specification
        (name "guile")
        (build '(channels guile))
        (channels
         (append (list (channel
                         (name 'guile)
                         (url "https://git.savannah.gnu.org/git/guile.git")
                         (branch "main")))
                 %default-channels))))

It differs from what you’d do with other CI tools in two important ways:

  • Cuirass knows it’s tracking two channels, guile and guix. Indeed, our own guile package depends on many packages provided by the guix channel—GCC, the GNU libc, libffi, and so on. Changes to packages from the guix channel can potentially influence our guile build and this is something we’d like to see as soon as possible as Guile developers.
  • Build results are not thrown away: they can be distributed as substitutes so that users of our guile channel transparently get pre-built binaries!

From a developer’s viewpoint, the end result is this status page listing evaluations: each evaluation is a combination of commits of the guix and guile channels providing a number of jobs—one job per package defined in guile-package.scm times the number of target architectures.

As for substitutes, they come for free! As an example, since our guile jobset is built on ci.guix.gnu.org, which runs guix publish in addition to Cuirass, one automatically gets substitutes for guile builds from ci.guix.gnu.org; no additional work is needed for that.

Bonus: Build manifest

The Cuirass spec above is convenient: it builds every package in our channel, which includes a few variants. However, this might be insufficiently expressive in some cases: one might want specific cross-compilation jobs, transformations, Docker images, RPM/Deb packages, or even system tests.

To achieve that, you can write a manifest. The one we have for Guile has entries for the package variants we defined above, as well as additional variants and cross builds:

;; This is ‘.guix/manifest.scm’.

(use-modules (guix)
             (guix profiles)
             (guile-package))   ;import our own package module

(define* (package->manifest-entry* package system
                                   #:key target)
  "Return a manifest entry for PACKAGE on SYSTEM, optionally cross-compiled to
TARGET."
  (manifest-entry
    (inherit (package->manifest-entry package))
    (name (string-append (package-name package) "." system
                         (if target
                             (string-append "." target)
                             "")))
    (item (with-parameters ((%current-system system)
                            (%current-target-system target))
            package))))

(define native-builds
  (manifest
   (append (map (lambda (system)
                  (package->manifest-entry* guile system))

                '("x86_64-linux" "i686-linux"
                  "aarch64-linux" "armhf-linux"
                  "powerpc64le-linux"))
           (map (lambda (guile)
                  (package->manifest-entry* guile "x86_64-linux"))
                (cons (package
                        (inherit (package-with-c-toolchain
                                  guile
                                  `(("clang-toolchain"
                                     ,(specification->package
                                       "clang-toolchain")))))
                        (name "guile-clang"))
                      (list guile-without-threads
                            guile-without-networking
                            guile-debug
                            guile-strict-typing))))))

(define cross-builds
  (manifest
   (map (lambda (target)
          (package->manifest-entry* guile "x86_64-linux"
                                    #:target target))
        '("i586-pc-gnu"
          "aarch64-linux-gnu"
          "riscv64-linux-gnu"
          "i686-w64-mingw32"
          "x86_64-linux-gnu"))))

(concatenate-manifests (list native-builds cross-builds))

We won’t go into the details of this manifest; suffice to say that it provides additional flexibility. We now need to tell Cuirass to build this manifest, which is done with a spec slightly different from the previous one:

;; Cuirass spec file to build all the packages of the ‘guile’ channel.
(list (specification
        (name "guile")
        (build '(manifest ".guix/manifest.scm"))
        (channels
         (append (list (channel
                         (name 'guile)
                         (url "https://git.savannah.gnu.org/git/guile.git")
                         (branch "main")))
                 %default-channels))))

We changed the (build …) part of the spec to '(manifest ".guix/manifest.scm") so that it would pick our manifest, and that’s it!

Wrapping up

We picked Guile as the running example in this post and you can see the result here:

These days, repositories are commonly peppered with dot files for various tools: .envrc, .gitlab-ci.yml, .github/workflows, Dockerfile, .buildpacks, Aptfile, requirements.txt, and whatnot. It may sound like we’re proposing a bunch of additional files, but in fact those files are expressive enough to supersede most or all of those listed above.

With a couple of files, we get support for:

  • development environments (guix shell);
  • pristine test builds, including for package variants and for cross-compilation (guix build);
  • continuous integration (with Cuirass or with some other tool);
  • continuous delivery to users (via the channel and with pre-built binaries);
  • generation of derivative build artifacts such as Docker images or Deb/RPM packages (guix pack).

At the Guix headquarters, we’re quite happy about the result. We’ve been building a unified tool set for reproducible software deployment; this is an illustration of how you as a developer can benefit from it!

Acknowledgments

Thanks to Attila Lendvai, Brian Cully, and Ricardo Wurmus for providing feedback on an earlier draft of this post.

About GNU Guix

GNU Guix is a transactional package manager and an advanced distribution of the GNU system that respects user freedom. Guix can be used on top of any system running the Hurd or the Linux kernel, or it can be used as a standalone operating system distribution for i686, x86_64, ARMv7, AArch64 and POWER9 machines.

In addition to standard package management features, Guix supports transactional upgrades and roll-backs, unprivileged package management, per-user profiles, and garbage collection. When used as a standalone GNU/Linux distribution, Guix offers a declarative, stateless approach to operating system configuration management. Guix is highly customizable and hackable through Guile programming interfaces and extensions to the Scheme language.

by Ludovic Courtès at Monday, June 5, 2023

Tuesday, May 30, 2023

Joe Marshall

Raymarching in Lisp

It turns out there’s a nice functional variation of raytracing called raymarching. The algorithms involved are simple and elegant. The learning curve is shallow and you can generate great looking images without hairy trig or linear algebra.

We’ll follow the example of Georges Seurat and simply compute the color independently for each of myriads of pixels. This is efficiently done in parallel in real time on a GPU, but then you have to use shader language and I want to use Lisp. It is insanely inefficient to do this serially on the CPU in Lisp, but still fast enough to render an image in a couple of seconds.

Imagine you have some scene you want to render. There is a volume of 3-dimensional space the scene occupies. Now imagine we know for every point in 3-dimensional space how far away that point is from the nearest surface. This is a scalar value that can be assigned to any point. It is zero for every point that lies on a surface, positive for points above surfaces, and negative for points below. This is the SDF (Signed Distance Field). The SDF is all we need to know to generate a raytraced image of the scene.

We’ll use the SDF to feel our way through the scene. We’ll start at the tip of the ray we’re tracing. We don’t know where the surface is, but if we consult the SDF, we can determine a distance we can safely extend the ray without hitting any surface. From this new point, we can recur, again stepping along no further than the SDF at this new location permits. One of two things will happen: we either step forever or we converge on a surface.

(defun raymarch (sdf origin direction)
  (let iter ((distance 0)
             (count 0))
    (let* ((position (+ origin (* direction distance)))
           (free-path (funcall sdf position)))
      (if (< free-path +min-distance+)
          position  ;; a hit, a very palpable hit
          (unless (or (> count +max-raymarch-iterations+)
                      (> free-path +max-distance+))
            (iter (+ distance free-path) (+ count 1)))))))

To convert an SDF to a Seurat function, we trace an imaginary ray from our eye, through the screen, and into the scene. The ray origin is at your eye, and we’ll say that is about 3 units in front of the window. The ray will travel 3 units to the screen and hit the window at point (i,j), so the ray direction is (normalize (vector i j 3)). We march along the ray to find if we hit a surface. If we did, we compute the amount of light the camera sees using the Lambert shading model.

(defun sdf->seurat (sdf)
  (let ((eye-position (vector 0 0 -4))
        (light-direction (normalize (vector 20 40 -30))))
    (lambda (i j)
      (let* ((ray-direction (normalize (vector i j 3)))
             (hit (raymarch sdf eye-position ray-direction)))
        (if hit
            (* #(0 1 0) (lambert sdf hit light-direction))
            (vector 0 0 0))))))

Lambert shading is proportional to the angle between the surface and the light falling on it, so we take the dot product of the light direction with the normal to the surface at the point the light hits it. If we know the SDF, we can approximate the normal vector at a point by probing the SDF nearby the point and seeing how it changes.

(defun lambert (sdf hit light-direction)
  (dot (pseudonormal sdf hit) light-direction))

(defun pseudonormal (sdf position)
  (let ((o (funcall sdf position))
        (dsdx (funcall sdf (+ #(0.001 0 0) position)))
        (dsdy (funcall sdf (+ #(0 0.001 0) position)))
        (dsdz (funcall sdf (+ #(0 0 0.001) position))))
      (normalize (vector (- dsdx o) (- dsdy o) (- dsdz o)))))

These are all you need to generate good looking 3-d images from a SDF. Now the SDFs for primitive geometric shapes are pretty simple. Here is the SDF for a sphere.

(defun sdf-sphere (position radius)
  (lambda (vector)
    (- (length (- vector position)) radius)))

and the SDF for the ground plane

(defun sdf-ground (h)
  (lambda (vector)
    (+ (svref vector 1) h)))

Given the SDF for two objects, you can use higher order functions to compose them into a scene. Taking the minimum of two SDFs will give you the union of the shapes. Taking the maximum will give you the intersection of two shapes. Other higher order functions on SDFs can blend two SDFs. This has the effect of morphing the shapes together in the image.

I like this approach to raytracing because the arthimetic is straightforward and obvious. You only need the simplest of vector arithmetic, and you don’t need linear algebra or matrix math to get started (although you’ll want project matrixes later on when you want to move your camera around). I’m more comfortable with recursive functions than 3x3 matrices.

This approach to raytracing is best done on a graphics card. These algorithms are pretty straightforward to code up in shader language, but shader language is fairly primitive and doesn’t have higher order functions or closures. Code written in shader language has to be converted to not use closures and HOFs.

by Joe Marshall (noreply@blogger.com) at Tuesday, May 30, 2023

Monday, May 22, 2023

Phil Dawes

Gambit-C namespaces

One of the first issues I had when evaluating scheme as a possible replacement for Python as my hobby language was its apparent lack of module/library/namespace system. How do people possibly build big programs? I wondered.

Now it turns out most (all?) scheme implementations have features to deal with modules and libraries. Gambit's is particularly nebulous in that it doesn't appear to be documented anywhere. Anyway, here's how it appears to work. I'm sure somebody will correct me if I've got something wrong:

Gambit has the 'namespace' primitive, with which you can declare that certain definitions belong in certain namespaces. Here's an example:


> (namespace ("f#" foo) ("b#" bar baz))

This means (AFAICS): "any further use/definition of the term 'foo' will reference the f# namespace and any use of bah/baz will reference the b# namespace".

e.g.


> (namespace ("f#" foo) ("b#" bar baz))

> (define (foo) "I am foo")  ; defines f#foo

> (foo)
"I am foo"

> (f#foo)
"I am foo"

> (define (bar) "I am bar")

> (b#bar)
"I am bar"

This is cool because it allows you to retroactively fit namespaces to scheme code loaded from other files. E.g. If mod1.scm and mod2.scm both defined a procedure 'foo', you could use namespaces to allow both to be used in the same environment thus:


> (namespace ("m1#" foo))
> (load "mod1")    ; contains: (define (foo) "I am mod1's foo")

> (namespace ("m2#" foo))
> (load "mod2")    ; contains: (define (foo) "I am mod2's foo")

> (m1#foo)
"I am mod1's foo"

> (m2#foo)
"I am mod2's foo"

Job done. Now I haven't really used gambit namespaces much, so I not in a position to provide a good comparison with other approaches, however the feature does seem in keeping with the rest of the scheme language. By that I mean rather than a large set of fully blown rich language features you get a small bunch of simple but very extensible primitives with which to build your own language.

An good example of building a big system over these small primitives is Christian Jaeger's chjmodule library where he has used namespaces along with 'load' and 'compile-file' (and of course macros) to build a more industrial strength module system. This includes an 'import' keyword that loads from a search path and a procedure to recursively compile and import modules. Some example code from the README:



$ gsc -i -e '(load "chjmodule")' -

> (import 'srfi-1)
> fold
#
> (fold + 0 '(1 2 3))
6
> (build-recursively/import 'regex-case)
            ; recompiles regex.scm (a dependency) if necessary,
            ; then (re)compiles regex-case.scm if necessary and imports it.
> (regex-case "http://www.xxx.yy" ("http://(.+)" (_ url) url) (else url))
"www.xxx.yy"
> *module-search-path* 
("."
 "mod"
 "gambit"
 "~/gambit"
 "~~"
 "..")

Sweet. I'm guessing it'll also be possible to build the r6rs library syntax in gambit scheme the same way.

Monday, May 22, 2023

Saturday, May 20, 2023

Andy Wingo

approaching cps soup

Good evening, hackers. Today's missive is more of a massive, in the sense that it's another presentation transcript-alike; these things always translate to many vertical pixels.

In my defense, I hardly ever give a presentation twice, so not only do I miss out on the usual per-presentation cost amortization and on the incremental improvements of repetition, the more dire error is that whatever message I might have can only ever reach a subset of those that it might interest; here at least I can be more or less sure that if the presentation would interest someone, that they will find it.

So for the time being I will try to share presentations here, in the spirit of, well, why the hell not.

CPS Soup

A functional intermediate language

10 May 2023 – Spritely

Andy Wingo

Igalia, S.L.

Last week I gave a training talk to Spritely Institute collaborators on the intermediate representation used by Guile's compiler.

CPS Soup

Compiler: Front-end to Middle-end to Back-end

Middle-end spans gap between high-level source code (AST) and low-level machine code

Programs in middle-end expressed in intermediate language

CPS Soup is the language of Guile’s middle-end

An intermediate representation (IR) (or intermediate language, IL) is just another way to express a computer program. Specifically it's the kind of language that is appropriate for the middle-end of a compiler, and by "appropriate" I meant that an IR serves a purpose: there has to be a straightforward transformation to the IR from high-level abstract syntax trees (ASTs) from the front-end, and there has to be a straightforward translation from IR to machine code.

There are also usually a set of necessary source-to-source transformations on IR to "lower" it, meaning to make it closer to the back-end than to the front-end. There are usually a set of optional transformations to the IR to make the program run faster or allocate less memory or be more simple: these are the optimizations.

"CPS soup" is Guile's IR. This talk presents the essentials of CPS soup in the context of more traditional IRs.

How to lower?

High-level:

(+ 1 (if x 42 69))

Low-level:

  cmpi $x, #f
  je L1
  movi $t, 42
  j L2  
L1:
  movi $t, 69
L2:
  addi $t, 1

How to get from here to there?

Before we dive in, consider what we might call the dynamic range of an intermediate representation: we start with what is usually an algebraic formulation of a program and we need to get down to a specific sequence of instructions operating on registers (unlimited in number, at this stage; allocating to a fixed set of registers is a back-end concern), with explicit control flow between them. What kind of a language might be good for this? Let's attempt to answer the question by looking into what the standard solutions are for this problem domain.

1970s

Control-flow graph (CFG)

graph := array<block>
block := tuple<preds, succs, insts>
inst  := goto B
       | if x then BT else BF
       | z = const C
       | z = add x, y
       ...

BB0: if x then BB1 else BB2
BB1: t = const 42; goto BB3
BB2: t = const 69; goto BB3
BB3: t2 = addi t, 1; ret t2

Assignment, not definition

Of course in the early days, there was no intermediate language; compilers translated ASTs directly to machine code. It's been a while since I dove into all this but the milestone I have in my head is that it's the 70s when compiler middle-ends come into their own right, with Fran Allen's work on flow analysis and optimization.

In those days the intermediate representation for a compiler was a graph of basic blocks, but unlike today the paradigm was assignment to locations rather than definition of values. By that I mean that in our example program, we get t assigned to in two places (BB1 and BB2); the actual definition of t is implicit, as a storage location, and our graph consists of assignments to the set of storage locations in the program.

1980s

Static single assignment (SSA) CFG

graph := array<block>
block := tuple<preds, succs, phis, insts>
phi   := z := φ(x, y, ...)
inst  := z := const C
       | z := add x, y
       ...
BB0: if x then BB1 else BB2
BB1: v0 := const 42; goto BB3
BB2: v1 := const 69; goto BB3
BB3: v2 := φ(v0,v1); v3:=addi t,1; ret v3

Phi is phony function: v2 is v0 if coming from first predecessor, or v1 from second predecessor

These days we still live in Fran Allen's world, but with a twist: we no longer model programs as graphs of assignments, but rather graphs of definitions. The introduction in the mid-80s of so-called "static single-assignment" (SSA) form graphs mean that instead of having two assignments to t, we would define two different values v0 and v1. Then later instead of reading the value of the storage location associated with t, we define v2 to be either v0 or v1: the former if we reach the use of t in BB3 from BB1, the latter if we are coming from BB2.

If you think on the machine level, in terms of what the resulting machine code will be, this either function isn't a real operation; probably register allocation will put v0, v1, and v2 in the same place, say $rax. The function linking the definition of v2 to the inputs v0 and v1 is purely notational; in a way, you could say that it is phony, or not real. But when the creators of SSA went to submit this notation for publication they knew that they would need something that sounded more rigorous than "phony function", so they instead called it a "phi" (φ) function. Really.

2003: MLton

Refinement: phi variables are basic block args

graph := array<block>
block := tuple<preds, succs, args, insts>

Inputs of phis implicitly computed from preds

BB0(a0): if a0 then BB1() else BB2()
BB1(): v0 := const 42; BB3(v0)
BB2(): v1 := const 69; BB3(v1)
BB3(v2): v3 := addi v2, 1; ret v3

SSA is still where it's at, as a conventional solution to the IR problem. There have been some refinements, though. I learned of one of them from MLton; I don't know if they were first but they had the idea of interpreting phi variables as arguments to basic blocks. In this formulation, you don't have explicit phi instructions; rather the "v2 is either v1 or v0" property is expressed by v2 being a parameter of a block which is "called" with either v0 or v1 as an argument. It's the same semantics, but an interesting notational change.

Refinement: Control tail

Often nice to know how a block ends (e.g. to compute phi input vars)

graph := array<block>
block := tuple<preds, succs, args, insts,
               control>
control := if v then L1 else L2
         | L(v, ...)
         | switch(v, L1, L2, ...)
         | ret v

One other refinement to SSA is to note that basic blocks consist of some number of instructions that can define values or have side effects but which otherwise exhibit fall-through control flow, followed by a single instruction that transfers control to another block. We might as well store that control instruction separately; this would let us easily know how a block ends, and in the case of phi block arguments, easily say what values are the inputs of a phi variable. So let's do that.

Refinement: DRY

Block successors directly computable from control

Predecessors graph is inverse of successors graph

graph := array<block>
block := tuple<args, insts, control>

Can we simplify further?

At this point we notice that we are repeating ourselves; the successors of a block can be computed directly from the block's terminal control instruction. Let's drop those as a distinct part of a block, because when you transform a program it's unpleasant to have to needlessly update something in two places.

While we're doing that, we note that the predecessors array is also redundant, as it can be computed from the graph of block successors. Here we start to wonder: am I simpliying or am I removing something that is fundamental to the algorithmic complexity of the various graph transformations that I need to do? We press on, though, hoping we will get somewhere interesting.

Basic blocks are annoying

Ceremony about managing insts; array or doubly-linked list?

Nonuniformity: “local” vs ‘`global’' transformations

Optimizations transform graph A to graph B; mutability complicates this task

  • Desire to keep A in mind while making B
  • Bugs because of spooky action at a distance

Recall that the context for this meander is Guile's compiler, which is written in Scheme. Scheme doesn't have expandable arrays built-in. You can build them, of course, but it is annoying. Also, in Scheme-land, functions with side-effects are conventionally suffixed with an exclamation mark; after too many of them, both the writer and the reader get fatigued. I know it's a silly argument but it's one of the things that made me grumpy about basic blocks.

If you permit me to continue with this introspection, I find there is an uneasy relationship between instructions and locations in an IR that is structured around basic blocks. Do instructions live in a function-level array and a basic block is an array of instruction indices? How do you get from instruction to basic block? How would you hoist an instruction to another basic block, might you need to reallocate the block itself?

And when you go to transform a graph of blocks... well how do you do that? Is it in-place? That would be efficient; but what if you need to refer to the original program during the transformation? Might you risk reading a stale graph?

It seems to me that there are too many concepts, that in the same way that SSA itself moved away from assignment to a more declarative language, that perhaps there is something else here that might be more appropriate to the task of a middle-end.

Basic blocks, phi vars redundant

Blocks: label with args sufficient; “containing” multiple instructions is superfluous

Unify the two ways of naming values: every var is a phi

graph := array<block>
block := tuple<args, inst>
inst  := L(expr)
       | if v then L1() else L2()
       ...
expr  := const C
       | add x, y
       ...

I took a number of tacks here, but the one I ended up on was to declare that basic blocks themselves are redundant. Instead of containing an array of instructions with fallthrough control-flow, why not just make every instruction a control instruction? (Yes, there are arguments against this, but do come along for the ride, we get to a funny place.)

While you are doing that, you might as well unify the two ways in which values are named in a MLton-style compiler: instead of distinguishing between basic block arguments and values defined within a basic block, we might as well make all names into basic block arguments.

Arrays annoying

Array of blocks implicitly associates a label with each block

Optimizations add and remove blocks; annoying to have dead array entries

Keep labels as small integers, but use a map instead of an array

graph := map<label, block>

In the traditional SSA CFG IR, a graph transformation would often not touch the structure of the graph of blocks. But now having given each instruction its own basic block, we find that transformations of the program necessarily change the graph. Consider an instruction that we elide; before, we would just remove it from its basic block, or replace it with a no-op. Now, we have to find its predecessor(s), and forward them to the instruction's successor. It would be useful to have a more capable data structure to represent this graph. We might as well keep labels as being small integers, but allow for sparse maps and growth by using an integer-specialized map instead of an array.

This is CPS soup

graph := map<label, cont>
cont  := tuple<args, term>
term  := continue to L
           with values from expr
       | if v then L1() else L2()
       ...
expr  := const C
       | add x, y
       ...

SSA is CPS

This is exactly what CPS soup is! We came at it "from below", so to speak; instead of the heady fumes of the lambda calculus, we get here from down-to-earth basic blocks. (If you prefer the other way around, you might enjoy this article from a long time ago.) The remainder of this presentation goes deeper into what it is like to work with CPS soup in practice.

Scope and dominators

BB0(a0): if a0 then BB1() else BB2()
BB1(): v0 := const 42; BB3(v0)
BB2(): v1 := const 69; BB3(v1)
BB3(v2): v3 := addi v2, 1; ret v3

What vars are “in scope” at BB3? a0 and v2.

Not v0; not all paths from BB0 to BB3 define v0.

a0 always defined: its definition dominates all uses.

BB0 dominates BB3: All paths to BB3 go through BB0.

Before moving on, though, we should discuss what it means in an SSA-style IR that variables are defined rather than assigned. If you consider variables as locations to which values can be assigned and which initially hold garbage, you can read them at any point in your program. You might get garbage, though, if the variable wasn't assigned something sensible on the path that led to reading the location's value. It sounds bonkers but it is still the C and C++ semantic model.

If we switch instead to a definition-oriented IR, then a variable never has garbage; the single definition always precedes any uses of the variable. That is to say that all paths from the function entry to the use of a variable must pass through the variable's definition, or, in the jargon, that definitions dominate uses. This is an invariant of an SSA-style IR, that all variable uses be dominated by their associated definition.

You can flip the question around to ask what variables are available for use at a given program point, which might be read equivalently as which variables are in scope; the answer is, all definitions from all program points that dominate the use site. The "CPS" in "CPS soup" stands for continuation-passing style, a dialect of the lambda calculus, which has also has a history of use as a compiler intermediate representation. But it turns out that if we use the lambda calculus in its conventional form, we end up needing to maintain a lexical scope nesting at the same time that we maintain the control-flow graph, and the lexical scope tree can fail to reflect the dominator tree. I go into this topic in more detail in an old article, and if it interests you, please do go deep.

CPS soup in Guile

Compilation unit is intmap of label to cont

cont := $kargs names vars term
      | ...
term := $continue k src expr
      | ...
expr := $const C
      | $primcall ’add #f (a b)
      | ...

Conventionally, entry point is lowest-numbered label

Anyway! In Guile, the concrete form that CPS soup takes is that a program is an intmap of label to cont. A cont is the smallest labellable unit of code. You can call them blocks if that makes you feel better. One kind of cont, $kargs, binds incoming values to variables. It has a list of variables, vars, and also has an associated list of human-readable names, names, for debugging purposes.

A $kargs contains a term, which is like a control instruction. One kind of term is $continue, which passes control to a continuation k. Using our earlier language, this is just goto *k*, with values, as in MLton. (The src is a source location for the term.) The values come from the term's expr, of which there are a dozen kinds or so, for example $const which passes a literal constant, or $primcall, which invokes some kind of primitive operation, which above is add. The primcall may have an immediate operand, in this case #f, and some variables that it uses, in this case a and b. The number and type of the produced values is a property of the primcall; some are just for effect, some produce one value, some more.

CPS soup

term := $continue k src expr
      | $branch kf kt src op param args
      | $switch kf kt* src arg
      | $prompt k kh src escape? tag
      | $throw src op param args

Expressions can have effects, produce values

expr := $const val
      | $primcall name param args
      | $values args
      | $call proc args
      | ...

There are other kinds of terms besides $continue: there is $branch, which proceeds either to the false continuation kf or the true continuation kt depending on the result of performing op on the variables args, with immediate operand param. In our running example, we might have made the initial term via:

(build-term
  ($branch BB1 BB2 'false? #f (a0)))

The definition of build-term (and build-cont and build-exp) is in the (language cps) module.

There is also $switch, which takes an unboxed unsigned integer arg and performs an array dispatch to the continuations in the list kt, or kf otherwise.

There is $prompt which continues to its k, having pushed on a new continuation delimiter associated with the var tag; if code aborts to tag before the prompt exits via an unwind primcall, the stack will be unwound and control passed to the handler continuation kh. If escape? is true, the continuation is escape-only and aborting to the prompt doesn't need to capture the suspended continuation.

Finally there is $throw, which doesn't continue at all, because it causes a non-resumable exception to be thrown. And that's it; it's just a handful of kinds of term, determined by the different shapes of control-flow (how many continuations the term has).

When it comes to values, we have about a dozen expression kinds. We saw $const and $primcall, but I want to explicitly mention $values, which simply passes on some number of values. Often a $values expression corresponds to passing an input to a phi variable, though $kargs vars can get their definitions from any expression that produces the right number of values.

Kinds of continuations

Guile functions untyped, can multiple return values

Error if too few values, possibly truncate too many values, possibly cons as rest arg...

Calling convention: contract between val producer & consumer

  • both on call and return side

Continuation of $call unlike that of $const

When a $continue term continues to a $kargs with a $const 42 expression, there are a number of invariants that the compiler can ensure: that the $kargs continuation is always passed the expected number of values, that the vars that it binds can be allocated to specific locations (e.g. registers), and that because all predecessors of the $kargs are known, that those predecessors can place their values directly into the variable's storage locations. Effectively, the compiler determines a custom calling convention between each $kargs and its predecessors.

Consider the $call expression, though; in general you don't know what the callee will do to produce its values. You don't even generally know that it will produce the right number of values. Therefore $call can't (in general) continue to $kargs; instead it continues to $kreceive, which expects the return values in well-known places. $kreceive will check that it is getting the right number of values and then continue to a $kargs, shuffling those values into place. A standard calling convention defines how functions return values to callers.

The conts

cont := $kfun src meta self ktail kentry
      | $kclause arity kbody kalternate
      | $kargs names syms term
      | $kreceive arity kbody
      | $ktail

$kclause, $kreceive very similar

Continue to $ktail: return

$call and return (and $throw, $prompt) exit first-order flow graph

Of course, a $call expression could be a tail-call, in which case it would continue instead to $ktail, indicating an exit from the first-order function-local control-flow graph.

The calling convention also specifies how to pass arguments to callees, and likewise those continuations have a fixed calling convention; in Guile we start functions with $kfun, which has some metadata attached, and then proceed to $kclause which bridges the boundary between the standard calling convention and the specialized graph of $kargs continuations. (Many details of this could be tweaked, for example that the case-lambda dispatch built-in to $kclause could instead dispatch to distinct functions instead of to different places in the same function; historical accidents abound.)

As a detail, if a function is well-known, in that all its callers are known, then we can lighten the calling convention, moving the argument-count check to callees. In that case $kfun continues directly to $kargs. Similarly for return values, optimizations can make $call continue to $kargs, though there is still some value-shuffling to do.

High and low

CPS bridges AST (Tree-IL) and target code

High-level: vars in outer functions in scope

Closure conversion between high and low

Low-level: Explicit closure representations; access free vars through closure

CPS soup is the bridge between parsed Scheme and machine code. It starts out quite high-level, notably allowing for nested scope, in which expressions can directly refer to free variables. Variables are small integers, and for high-level CPS, variable indices have to be unique across all functions in a program. CPS gets lowered via closure conversion, which chooses specific representations for each closure that remains after optimization. After closure conversion, all variable access is local to the function; free variables are accessed via explicit loads from a function's closure.

Optimizations at all levels

Optimizations before and after lowering

Some exprs only present in one level

Some high-level optimizations can merge functions (higher-order to first-order)

Because of the broad remit of CPS, the language itself has two dialects, high and low. The high level dialect has cross-function variable references, first-class abstract functions (whose representation hasn't been chosen), and recursive function binding. The low-level dialect has only specific ways to refer to functions: labels and specific closure representations. It also includes calls to function labels instead of just function values. But these are minor variations; some optimization and transformation passes can work on either dialect.

Practicalities

Intmap, intset: Clojure-style persistent functional data structures

Program: intmap<label,cont>

Optimization: program→program

Identify functions: (program,label)→intset<label>

Edges: intmap<label,intset<label>>

Compute succs: (program,label)→edges

Compute preds: edges→edges

I mentioned that programs were intmaps, and specifically in Guile they are Clojure/Bagwell-style persistent functional data structures. By functional I mean that intmaps (and intsets) are values that can't be mutated in place (though we do have the transient optimization).

I find that immutability has the effect of deploying a sense of calm to the compiler hacker -- I don't need to worry about data structures changing out from under me; instead I just structure all the transformations that you need to do as functions. An optimization is just a function that takes an intmap and produces another intmap. An analysis associating some data with each program label is just a function that computes an intmap, given a program; that analysis will never be invalidated by subsequent transformations, because the program to which it applies will never be mutated.

This pervasive feeling of calm allows me to tackle problems that I wouldn't have otherwise been able to fit into my head. One example is the novel online CSE pass; one day I'll either wrap that up as a paper or just capitulate and blog it instead.

Flow analysis

A[k] = meet(A[p] for p in preds[k])
         - kill[k] + gen[k]

Compute available values at labels:

  • A: intmap<label,intset<val>>
  • meet: intmap-intersect<intset-intersect>
  • -, +: intset-subtract, intset-union
  • kill[k]: values invalidated by cont because of side effects
  • gen[k]: values defined at k

But to keep it concrete, let's take the example of flow analysis. For example, you might want to compute "available values" at a given label: these are the values that are candidates for common subexpression elimination. For example if a term is dominated by a car x primcall whose value is bound to v, and there is no path from the definition of V to a subsequent car x primcall, we can replace that second duplicate operation with $values (v) instead.

There is a standard solution for this problem, which is to solve the flow equation above. I wrote about this at length ages ago, but looking back on it, the thing that pleases me is how easy it is to decompose the task of flow analysis into manageable parts, and how the types tell you exactly what you need to do. It's easy to compute an initial analysis A, easy to define your meet function when your maps and sets have built-in intersect and union operators, easy to define what addition and subtraction mean over sets, and so on.

Persistent data structures FTW

  • meet: intmap-intersect<intset-intersect>
  • -, +: intset-subtract, intset-union

Naïve: O(nconts * nvals)

Structure-sharing: O(nconts * log(nvals))

Computing an analysis isn't free, but it is manageable in cost: the structure-sharing means that meet is usually trivial (for fallthrough control flow) and the cost of + and - is proportional to the log of the problem size.

CPS soup: strengths

Relatively uniform, orthogonal

Facilitates functional transformations and analyses, lowering mental load: “I just have to write a function from foo to bar; I can do that”

Encourages global optimizations

Some kinds of bugs prevented by construction (unintended shared mutable state)

We get the SSA optimization literature

Well, we're getting to the end here, and I want to take a step back. Guile has used CPS soup as its middle-end IR for about 8 years now, enough time to appreciate its fine points while also understanding its weaknesses.

On the plus side, it has what to me is a kind of low cognitive overhead, and I say that not just because I came up with it: Guile's development team is small and not particularly well-resourced, and we can't afford complicated things. The simplicity of CPS soup works well for our development process (flawed though that process may be!).

I also like how by having every variable be potentially a phi, that any optimization that we implement will be global (i.e. not local to a basic block) by default.

Perhaps best of all, we get these benefits while also being able to use the existing SSA transformation literature. Because CPS is SSA, the lessons learned in SSA (e.g. loop peeling) apply directly.

CPS soup: weaknesses

Pointer-chasing, indirection through intmaps

Heavier than basic blocks: more control-flow edges

Names bound at continuation only; phi predecessors share a name

Over-linearizes control, relative to sea-of-nodes

Overhead of re-computation of analyses

CPS soup is not without its drawbacks, though. It's not suitable for JIT compilers, because it imposes some significant constant-factor (and sometimes algorithmic) overheads. You are always indirecting through intmaps and intsets, and these data structures involve significant pointer-chasing.

Also, there are some forms of lightweight flow analysis that can be performed naturally on a graph of basic blocks without looking too much at the contents of the blocks; for example in our available variables analysis you could run it over blocks instead of individual instructions. In these cases, basic blocks themselves are an optimization, as they can reduce the size of the problem space, with corresponding reductions in time and memory use for analyses and transformations. Of course you could overlay a basic block graph on top of CPS soup, but it's not a well-worn path.

There is a little detail that not all phi predecessor values have names, since names are bound at successors (continuations). But this is a detail; if these names are important, little $values trampolines can be inserted.

Probably the main drawback as an IR is that the graph of conts in CPS soup over-linearizes the program. There are other intermediate representations that don't encode ordering constraints where there are none; perhaps it would be useful to marry CPS soup with sea-of-nodes, at least during some transformations.

Finally, CPS soup does not encourage a style of programming where an analysis is incrementally kept up to date as a program is transformed in small ways. The result is that we end up performing much redundant computation within each individual optimization pass.

Recap

CPS soup is SSA, distilled

Labels and vars are small integers

Programs map labels to conts

Conts are the smallest labellable unit of code

Conts can have terms that continue to other conts

Compilation simplifies and lowers programs

Wasm vs VM backend: a question for another day :)

But all in all, CPS soup has been good for Guile. It's just SSA by another name, in a simpler form, with a functional flavor. Or, it's just CPS, but first-order only, without lambda.

In the near future, I am interested in seeing what a new GC will do for CPS soup; will bump-pointer allocation palliate some of the costs of pointer-chasing? We'll see. A tricky thing about CPS soup is that I don't think that anyone else has tried it in other languages, so it's hard to objectively understand its characteristics independent of Guile itself.

Finally, it would be nice to engage in the academic conversation by publishing a paper somewhere; I would like to see interesting criticism, and blog posts don't really participate in the citation graph. But in the limited time available to me, faced with the choice between hacking on something and writing a paper, it's always been hacking, so far :)

Speaking of limited time, I probably need to hit publish on this one and move on. Happy hacking to all, and until next time.

by Andy Wingo at Saturday, May 20, 2023

Thursday, May 18, 2023

Göran Weinholt

Fuzzing Scheme with AFL++

The comments on this blog are now back from their GDPR-induced coma. I’m using a custom comment system powered by HTMX and a backend built on Loko Scheme. While writing the backend, one thing lead to another and I wanted to see if my HTTP message parser could crash. This is when I discovered that the AFL support in Loko Scheme had suffered bit rot. I have repaired it now and wanted to demonstrate how to fuzz Scheme code with Loko Scheme and AFL++.

AFL++ is a fuzzer based on the original American Fuzzy Lop (AFL) by Michał “lcamtuf” Zalewski. Loko Scheme previously had support for AFL but it inadvertently stopped working back when the .data segment was made read-only. The fuzzer support is now repaired and accessible with a command line flag.

I did find one bug in the HTTP message parser, but it was not very interesting. More interesting were the problems that I found in laesare, my Scheme reader library. AFL++ found a way to crash it and to make it hang. Oops!

Steps to Fuzzing

Here is how you fuzz a Scheme program (R6RS or R7RS):

  1. First make sure you have AFL++ installed: sudo apt install afl++.
  2. Install Akku.scm, the Scheme package manager. It is required by Loko Scheme.
  3. Install Loko Scheme from git, or any future version later than 0.12.0.
  4. Write a program that reads input from standard input and passes it to your function under test.
  5. Compile that program with loko -fcoverage=afl++ --compile coverage.sps.
  6. Create a directory called inputs with sample input files. It is often enough to provide a single file, but it can help speed up the fuzzing process if you have samples of a wide variety.
  7. Run AFL++ with env AFL_CRASH_EXITCODE=70 afl-fuzz -i inputs/ -o outputs -- ./coverage. Watch the pretty status screen and wait for it to find crashes and timeouts.
  8. Analyze the outputs.

That’s essentially it. Only steps 4—8 are really specific to fuzzing, so I will go through them in detail.

Program Under Test

You need a program that reads from standard input and passes the data to the code you want to test with AFL++. This is usually very simple to accomplish.

If your program works with textual data then you read from (current-input-port), but if it binary data then you make a new binary input port with (standard-input-port) and read from that. You can either read until you see the eof object and pass all the data directly to the code under test, or you can pass the port directly to the code under test. It depends on what your API needs as input.

Here is an example program that feeds data to get-token from laesare:

;; SPDX-License-Identifier: MIT
(import
  (rnrs)
  (laesare reader))

(let ((reader (make-reader (current-input-port) "stdin")))
  (reader-mode-set! reader 'r6rs)
  (let lp ()
    (let-values ([(type token)
                  (guard (con
                          ((lexical-violation? con)
                           (values 'condition con)))
                    (get-token reader))])
      (write type)
      (display #\space)
      (write token)
      (newline)
      (unless (eof-object? token)
        (lp)))))

Compile with Instrumentation

The program now needs to be compiled with instrumentation for AFL++. This is done by passing a new flag to loko:

.akku/env loko -fcoverage=afl++ --compile coverage.sps

The new -fcoverage=afl++ flag tells the code generator to insert a special code sequence inside every if expression.

The way that AFL++ works is that the program receives a shared memory segment from afl-fuzz that it mutates during the execution of the program. You can imagine that this program:

(if test
    conseq
    altern)

is transformed into this program:

(if test
    (begin (afl-mutate (compile-time-random)) conseq)
    (begin (afl-mutate (compile-time-random)) altern))

The expression (compile-time-random) should be fixed at compile-time and should be different for the two branches. Therefore the effect of afl-mutate on the shared memory segment will be different depending on which branch is taken at runtime, but it will be identical for different runs of the program if the input is identical.

Now suppose that this transformation is applied to the whole program. The path of all branches taken through the program for a given input generates a unique fingerprint that AFL++ uses to explore all the branches in the program. It uses some clever algorithms to mutate the input in ways that uncover new paths through the program. This is repeated thousands of times per second to eventually (maybe) find inputs that crash or hang the program.

By the way, when you use the -fcoverage=afl++ flag with Loko you also get an instrumented standard library. This means that AFL++ can see into (rnrs), (scheme base), etc, and can fuzz them along with your code. This means that AFL++ can be smarter when it searches for bugs that are triggered by how your program interacts with the standard library, which would otherwise be a black box.

Run the Fuzzer

With the binary built you can start the fuzzer:

$ env AFL_CRASH_EXITCODE=70 afl-fuzz -i inputs \
    -o outputs -- ./coverage

It will tell you if something is wrong with the program. Otherwise it starts up a screen that looks like this:

The status screen of AFL++ shows progress, findings and various statistics

Not all crashes and timeouts are necessarily unique, many of them are likely to be triggered by the same bug.

The speed of fuzzing can shift over time, but I commonly see around 2500 executions/sec using a single core on my machine. This can be further sped up by using multiple cores and system tuning that the AFL++ manual can tell you more about.

Analyze the outputs

If the “findings in depth” box reports crashes and timeouts then you can go and look in the output directory. The outputs/default/crashes directory contains files that you can just feed directly into your test program:

$ ./coverage < outputs/default/crashes/id:000000,*
…
 Frame 2 has return address #x26987C.
  Local 3: #<closure dynamic-wind /usr/local/lib/loko/loko/runtime/control.loko.sls:164:0>
  Local 4: #[reader port: #<textual-input-port "*stdin*" fd: 0>
                    filename: "stdin" line: 1 column: 40 saved-line: 1
                    saved-column: 10 fold-case?: #f mode: r6rs tolerant?: #f]
  Local 5: &lexical
 Frame 3 has return address #x24282F.
  Local 0: #f
  Local 1: 0
  Local 2: 0
End of stack trace.
The condition has 6 components:
 1. &assertion &violation &serious
 2. &who: get-token
 3. &who: "/usr/local/lib/loko/laesare/reader.sls:460:0"
 4. &message: "Type error: expected a fixnum"
 5. &program-counter: #x2E68F1
 6. &continuation
     k: #<closure continuation /usr/local/lib/loko/loko/runtime/control.loko.sls:152:21>
End of condition components.
…

This tells us there is a crash in the get-token procedure. Unfortunately this is a huge procedure and Loko does not yet generate DWARF information that lets us get source line information from the instruction pointer. We know that something in get-token expected a fixnum, but Loko is a bit sloppy with the &who condition when it comes to assertions from inlined built-ins.

We can use objdump -d ./coverage and look for the instruction at 0x2E68F1 or the instruction that jumps to that address and try to make sense of the context. But there is another way.

More Tools for Easier Fun

AFL++ comes with tools that let you analyze the crashes and minimize the inputs. When you report a bug found using a fuzzer it is important to first use a minimizer to find a minimal reproducer. The person reading your bug report does not want to have to guess which parts of the input are relevant and which are noise. This applies also to us when we’re the ones using AFL++ for our own code.

Minimize with afl-tmin

Here is how you run the minimizer:

$ env AFL_CRASH_EXITCODE=70 afl-tmin  \
  -i outputs/default/crashes/id:000000,* \
  -o crash -- ./coverage

The afl-tmin program uses the same binary as before but has a different goal: remove as much of the input as possible.

afl-tmin reduced the file size by 33.87%

This is what happened to the file:

$ hexdump outputs/default/crashes/id:000000,*
0000000 0023 0100 2300 0000 0001 5c23 4678 4646
0000010 4646 4646 4646 4646 4646 4646 4646 4646
0000020 4646 4646 4646 4646 4623 4646 1821 3070
0000030 1818 1818 1818 1702 1818 5c00 7038
000003e
$ hexdump crash
0000000 5c23 4678 3030 3030 3030 3030 3030 3030
0000010 3030 0030                              
0000013
$ cat -vet crash
#\xF000000000000000

AFL++ has just told us that laesare crashes if it attempts to read a large character constant! That bug is now fixed in the git repo.

Perhaps even more impressive is that the minimizer works even when the program hangs and it turns out that (string->number "0F800000") hangs Loko Scheme. Oops again!

Analyze with afl-analyze

The analyzer is another fun tool and here is how you run it:

$ afl-analyze -i crash -- ./coverage

afl-analyze categorizes each byte in the file

In this case we already know what the problem is a character constant that is too large, so it is not telling us anything new. But it has figured out that the middle zeros do not really affect the program flow, which can be useful information when analyzing other test cases.

tl;dr

Fuzzing is a powerful technique that automatically searches for inputs that crash or hang your program. Loko Scheme can now be used with AFL++ to fuzz Scheme programs.

Write a program coverage.sps that passes standard input to the procedure you want to test and then compile it with a recent Loko Scheme:

mkdir inputs
echo '()' > inputs/nil
.akku/env loko -ftarget=linux -fcoverage=afl++ --compile coverage.sps
env AFL_CRASH_EXITCODE=70 afl-fuzz -i inputs -o outputs -- ./coverage

by weinholt at Thursday, May 18, 2023

Friday, May 12, 2023

The Racket Blog

Racket v8.9

Racket version 8.9 is now available from https://download.racket-lang.org/

As of this release:

Thank you

Thank you to the people who contributed to this release:

Alex Harsányi, Alex Knauth, Alexis King, Ben Greenman, Bert De Ketelaere, Bob Burger, Bogdan Popa, Cadence Ember, D. Ben Knoble, Denis Hirn, dr-neptune, Eli Barzilay, Fred Fu, Gustavo Massaccesi, J. Ryan Stinnett, Jack Firth, Jamie Taylor, Jesse Alama, Jin-Ho King, John Clements, Lazerbeak12345, Mark Hedlund, Masaya Tojo, Matthew Flatt, Matthias Felleisen, Mike Sperber, Philip McGrath, Robby Findler, Ryan Culpepper, Sam Phillips, Sam Tobin-Hochstadt, sarna, Shu-Hung You, Sorawee Porncharoenwase, Stephen De Gabrielle, sxzzsf, Tom Price, Yukai Chou, and Zach O’Brien.

Feedback Welcome

by The Unknown Author at Friday, May 12, 2023

Tuesday, May 2, 2023

Scheme Requests for Implementation

SRFI 240: Reconciled Records

SRFI 240 is now in final status.

This SRFI defines a version of the define-record-type definition of R6RS and SRFI 237 that extends the define-record-type syntax of R7RS, reconciling both systems.

This SRFI is meant to be adopted by R7RS-large to integrate essentially the R6RS record system compatibly with the existing R7RS-small record system.

by Marc Nieper-Wißkirchen at Tuesday, May 2, 2023

SRFI 237: R6RS Records (refined)

SRFI 237 is now in final status.

The record mechanism of R6RS is refined. In particular, the triad of record names, record-type descriptors and record constructor descriptors can be effectively ignored and replaced with the single notion of a record descriptor. We also remove the restriction that the syntactic layer can only define one constructor per record type defined.

by Marc Nieper-Wißkirchen at Tuesday, May 2, 2023

Monday, May 1, 2023

Scheme Requests for Implementation

SRFI 153: Ordered Sets

SRFI 153 is now in final status.

Osets are immutable collections that can contain any Scheme objects as long as a total order exists among the objects. Osets enforce the constraint that no two elements can be the same in the sense of the oset's associated equality predicate. The elements in an oset appear in a fixed order determined by the comparator used to create it.

by John Cowan at Monday, May 1, 2023

Wednesday, April 26, 2023

GNU Guix

The Full-Source Bootstrap: Building from source all the way down

We are delighted and somewhat relieved to announce that the third reduction of the Guix bootstrap binaries has now been merged in the main branch of Guix! If you run guix pull today, you get a package graph of more than 22,000 nodes rooted in a 357-byte program—something that had never been achieved, to our knowledge, since the birth of Unix.

We refer to this as the Full-Source Bootstrap. In this post, we explain what this means concretely. This is a major milestone—if not the major milestone—in our quest for building everything from source, all the way down.

How did we get there, and why? In two previous blog posts, we elaborated on why this reduction and bootstrappability in general is so important.

One reason is to properly address supply chain security concerns. The Bitcoin community was one of the first to recognize its importance well enough to put the idea into practice. At the Breaking Bitcoin conference 2020, Carl Dong gave a fun and remarkably gentle introduction. At the end of the talk, Carl states:

The holy grail for bootstrappability will be connecting hex0 to mes.

Two years ago, at FOSDEM 2021, I (Janneke) gave a short talk about how we were planning to continue this quest.

If you think one should always be able to build software from source, then it follows that the “trusting trust” attack is only a symptom of an incomplete or missing bootstrap story.

The Road to Full-Source Bootstrap

Three years ago, the bootstrap binaries were reduced to just GNU Mes and MesCC-Tools (and the driver to build Guix packages: a static build of GNU Guile 2.0.9).

The new Full-Source Bootstrap, merged in Guix master yesterday, removes the binaries for Mes and MesCC-Tools and replaces them by bootstrap-seeds. For x86-linux (which is also used by the x86_64-linux build), this means this program hex0-seed, with ASCII-equivalent hex0_x86.hex0. Hex0 is self-hosting and its source looks like this:

    ; Where the ELF Header is going to hit  
    ; Simply jump to _start  
    ; Our main function  
    # :_start ; (0x8048054)  
    58 # POP_EAX ; Get the number of arguments  

you can spot two types of line-comment: hex0 (;) and assembly (#). The only program-code in this snippet is 58: two hexidecimal digits that are taken as two nibbles and compiled into the corresponding byte with binary value 58.

Starting from this 357-byte hex0-seed binary provided by the bootstrap-seeds, the stage0-posix package created by Jeremiah Orians first builds hex0 and then all the way up: hex1, catm, hex2, M0, cc_x86, M1, M2, get_machine (that's all of MesCC-Tools), and finally M2-Planet.

The new GNU Mes v0.24 release can be built with M2-Planet. This time with only a remarkably small change, the bottom of the package graph now looks like this (woohoo!):

                              gcc-mesboot (4.9.4)
                                      ^
                                      |
                                    (...)
                                      ^
                                      |
               binutils-mesboot (2.20.1a), glibc-mesboot (2.2.5),
                          gcc-core-mesboot (2.95.3)
                                      ^
                                      |
                             patch-mesboot (2.5.9)
                                      ^
                                      |
                       bootstrappable-tcc (0.9.26+31 patches)
                                      ^
                                      |
                            gnu-make-mesboot0 (3.80)
                                      ^
                                      |
                              gzip-mesboot (1.2.4)
                                      ^
                                      |
                               tcc-boot (0.9.27)
                                      ^
                                      |
                                 mes-boot (0.24.2)
                                      ^
                                      |
                         stage0-posix (hex0..M2-Planet)
                                      ^
                                      |
                          gash-boot, gash-utils-boot
                                      ^
                                      |
                                      *
                     bootstrap-seeds (357-bytes for x86)
                                     ~~~
                   [bootstrap-guile-2.0.9 driver (~25 MiB)]

full graph

We are excited that the NLnet Foundation has been sponsoring this work!

However, we aren't done yet; far from it.

Lost Paths

The idea of reproducible builds and bootstrappable software is not very new. Much of that was implemented for the GNU tools in the early 1990s. Working to recreate it in present time shows us much of that practice was forgotten.

Most bootstrap problems or loops are not so easy to solve and sometimes there are no obvious answers, for example:

While these examples make for a delightful puzzle from a bootstrappability perspective, we would love to see the maintainers of GNU packages consider bootstrappability and start taking more responsibility for the bootstrap story of their packages.

Next Steps

Despite this major achievement, there is still work ahead.

First, while the package graph is rooted in a 357-byte program, the set of binaries from which packages are built includes a 25 MiB statically-linked Guile, guile-bootstrap, that Guix uses as its driver to build the initial packages. 25 MiB is a tenth of what the initial bootstrap binaries use to weigh, but it is a lot compared to those 357 bytes. Can we get rid of this driver, and how?

A development effort with Timothy Sample addresses the dependency on guile-bootstrap of Gash and Gash-Utils, the pure-Scheme POSIX shell implementation central to our second milestone. On the one hand, Mes is gaining a higher level of Guile compatibility: hash table interface, record interface, variables and variable-lookup, and Guile (source) module loading support. On the other hand, Gash and Gash-Utils are getting Mes compatibility for features that Mes is lacking (notably syntax-case macros). If we pull this off, guile-bootstrap will only be used as a dependency of bootar and as the driver for Guix.

Second, the full-source bootstrap that just landed in Guix master is limited to x86_64-linux and i686-linux, but ARM and RISC-V will be joining soon. We are most grateful and excited that the NLnet Foundation has decided to continue sponsoring this work!

Some time ago, Wladimir van der Laan contributed initial RISC-V support for Mes but a major obstacle for the RISC-V bootstrap is that the “vintage” GCC-2.95.3 that was such a helpful stepping stone does not support RISC-V. Worse, the RISC-V port of GCC was introduced only in GCC 7.5.0—a version that requires C++ and cannot be bootstrapped! To this end, we have been improving MesCC, the C compiler that comes with Mes, so it is able to build GCC 4.6.5; meanwhile, Ekaitz Zarraga backported RISC-V support to GCC 4.6.5, and backported RISC-V support from the latest tcc to our bootstrappable-tcc.

Outlook

The full-source bootstrap was once deemed impossible. Yet, here we are, building the foundations of a GNU/Linux distro entirely from source, a long way towards the ideal that the Guix project has been aiming for from the start.

There are still some daunting tasks ahead. For example, what about the Linux kernel? The good news is that the bootstrappable community has grown a lot, from two people six years ago there are now around 100 people in the #bootstrappable IRC channel. Interesting times ahead!

About Bootstrappable Builds and GNU Mes

Software is bootstrappable when it does not depend on a binary seed that cannot be built from source. Software that is not bootstrappable---even if it is free software---is a serious security risk (supply chain security) for a variety of reasons. The Bootstrappable Builds project aims to reduce the number and size of binary seeds to a bare minimum.

GNU Mes is closely related to the Bootstrappable Builds project. Mes is used in the full-source bootstrap path for the Guix System.

Currently, Mes consists of a mutual self-hosting scheme interpreter and C compiler. It also implements a C library. Mes, the scheme interpreter, is written in about 5,000 lines of code of simple C and can be built with M2-Planet. MesCC, the C compiler, is written in scheme. Together, Mes and MesCC can compile bootstrappable TinyCC that is self-hosting. Using this TinyCC and the Mes C library, the entire Guix System for i686-linux and x86_64-linux is bootstrapped.

About GNU Guix

GNU Guix is a transactional package manager and an advanced distribution of the GNU system that respects user freedom. Guix can be used on top of any system running the Hurd or the Linux kernel, or it can be used as a standalone operating system distribution for i686, x86_64, ARMv7, AArch64 and POWER9 machines.

In addition to standard package management features, Guix supports transactional upgrades and roll-backs, unprivileged package management, per-user profiles, and garbage collection. When used as a standalone GNU/Linux distribution, Guix offers a declarative, stateless approach to operating system configuration management. Guix is highly customizable and hackable through Guile programming interfaces and extensions to the Scheme language.

by Janneke Nieuwenhuizen, Ludovic Courtès at Wednesday, April 26, 2023

Tuesday, April 25, 2023

Scheme Requests for Implementation

SRFI 226: Control Features

SRFI 226 is now in final status.

Whenever an expression is evaluated during the run of a Scheme program, there is a continuation awaiting the values of the expression. It is a distinguishing property of the Scheme programming language to offer a procedure (named call/cc) that captures the current continuation as a procedure, which, when called, aborts the then-current continuation and reinstates the captured one.

One can visualize a continuation as a list of (continuation) frames where a non-tail call adds a frame to the top of the list and where the return from a non-tail call removes the appropriate frame.

Moreover, each expression is evaluated in a dynamic environment that conceptually holds the values of parameters like the current output port and the dynamic-wind stack at the point of evaluation. As the dynamic environment is captured and reinstated along the continuation when the call/cc machinery is used, we can view it conceptually as part of the continuation.

The libraries defined in this SRFI are all concerned with continuations in a wider sense. More specifically, the topics are as follows:

Continuation Prompts
A continuation prompt is a special continuation frame that is tagged with a so-called prompt tag. Procedures to install continuation prompts and to abort the current continuation and escape back to a previously installed continuation prompt are provided. Moreover, continuation prompts are equipped with handlers that are invoked when a continuation is aborted to them.
Continuations
When continuations are captured, the list of captured continuation frames is always delimited by some continuation prompt. This extends the semantics of Scheme’s call-with-current-continuation. Moreover, a procedure to capture so-called composable continuations is provided. As opposed to continuations captured by call-with-current-continuation, invoking a composable continuation does not abort the then-current continuation, so composable continuations behave like ordinary procedures. Together with continuation prompts, composable continuations allow one to implement the various proposed sets of control operators for delimited continuations. Finally, a primitive (call-in-continuation) is provided that allows calling a procedure in a given continuation instead of just delivering values to it.
Continuation Marks
Continuation marks are a provided feature that allows one to attach arbitrary information to continuation frames that is captured and reinstated along with the rest of the continuation. Conceptually, exception handlers and parameters are implemented in terms of continuation marks, but the syntax and procedures defined in this SRFI allow the user to use them in more general ways. Moreover, they reify the notion of a tail call, allowing one, for example, to test for tail context.
Exceptions
The exception mechanism of R6RS and R7RS is reinterpreted with respect to the concepts introduced in this SRFI. (Here, and in what follows we mean the so-called small language when we speak about R7RS.) Moreover, the with-exception-handler procedure and the guard syntax gain additional tail-context guarantees.
Parameters
The parameter object mechanism of SRFI 39 and R7RS is reinterpreted with respect to the concepts introduced in this SRFI. Procedures to retrieve the current parameterization and to reinstall it later are provided. Moreover, the parameterize syntax gains an additional tail-context guarantee. To support an alternative model of parameters that is linked to the dynamic extent and not to the current parameterization, the notion of a parameter-like object and the temporarily syntax are introduced.
Fluids
Fluids are a syntactic reinterpretation of parameter objects.
Delayed evaluation
The syntax and procedures on delayed evaluation of R7RS are revisited and redefined to handle the following satisfactorily: the parameterization of the delayed expression being forced, the treatment of exceptions raised during forcing of delayed expressions, and iterative lazy algorithms. Moreover, their semantics are detailed with respect to the concepts introduced in this SRFI, and promises can naturally deliver an arbitrary number of values when being forced. Finally, the initial continuation of a delayed expression being forced is defined in a way that makes it interchangeable with the initial continuation of a thread.
Threads
The thread mechanism of SRFI 18 is detailed with respect to the concepts introduced in this SRFI. In particular, mutation of parameter objects in multi-threaded applications is specified. In order to support timeout arguments in a type-safe way, a minimal API on time objects is included as well.

Large parts of this SRFI have been inspired by the control operators provided by Racket.

by Marc Nieper-Wißkirchen at Tuesday, April 25, 2023

Wednesday, April 19, 2023

GNU Guix

Dissecting Guix, Part 3: G-Expressions

Welcome back to Dissecting Guix! Last time, we discussed monads, the functional programming idiom used by Guix to thread a store connection through a series of store-related operations.

Today, we'll be talking about a concept rather more specific to Guix: g-expressions. Being an implementation of the Scheme language, Guile is built around s-expressions, which can represent, as the saying goes, code as data, thanks to the simple structure of Scheme forms.

As Guix's package recipes are written in Scheme, it naturally needs some way to represent code that is to be run only when the package is built. Additionally, there needs to be some way to reference dependencies and retrieve output paths; otherwise, you wouldn't be able to, for instance, create a phase to install a file in the output directory.

So, how do we implement this "deferred" code? Well, initially Guix used plain old s-expressions for this purpose.

Once Upon a Time

Let's say we want to create a store item that's just a symlink to the bin/irssi file of the irssi package. How would we do that with an s-expression? Well, the s-expression itself, which we call the builder, is fairly simple:

(define sexp-builder
  `(let* ((out (assoc-ref %outputs "out"))
          (irssi (assoc-ref %build-inputs "irssi"))
          (bin/irssi (string-append irssi "/bin/irssi")))
     (symlink bin/irssi out)))

If you aren't familliar with the "quoting" syntax used to create s-expressions, I strongly recommend that you read the excellent Scheme Primer; specifically, section 7, Lists and "cons" and section 11, On the extensibility of Scheme (and Lisps in general)

The %outputs and %build-inputs variables are bound within builder scripts to association lists, which are lists of pairs that act like key/value stores, for instance:

'(("foo" . "bar")
  ("floob" . "blarb")
  ("fvoolag" . "bvarlag"))

To retrieve values from association lists, which are often referred to as alists, we use the assoc-ref procedure:

(assoc-ref '(("boing" . "bouncy")
             ("floing" . "flouncy"))
           "boing")
⇒ "bouncy"

%outputs, as the name might suggest, maps derivation output names to the paths of their respective store items, the default output being out, and %build-inputs maps inputs labels to their store items.

The builder is the easy part; we now need to turn it into a derivation and tell it what "irssi" actually refers to. For this, we use the build-expression->derivation procedure from (guix derivations):

(use-modules (guix derivations)
             (guix packages)
             (guix store)
             (gnu packages guile)
             (gnu packages irc))

(with-store store
  (let ((guile-3.0-drv (package-derivation store guile-3.0))
        (irssi-drv (package-derivation store irssi)))
    (build-expression->derivation store "irssi-symlink" sexp-builder
      #:guile-for-build guile-3.0-drv
      #:inputs `(("irssi" ,irssi-drv)))))
⇒ #<derivation /gnu/store/…-irssi-symlink.drv => /gnu/store/…-irssi-symlink …>

There are several things to note here:

  • The inputs must all be derivations, so we need to first convert the packages using package-derivation.
  • We need to explicitly set #:guile-for-build; there's no default value.
  • The build-expression->derivation and package-derivation procedures are not monadic, so we need to explicitly pass them the store connection.

The shortcomings of using s-expressions in this way are numerous: we have to convert everything to a derivation before using it, and inputs are not an inherent aspect of the builder. G-expressions were designed to overcome these issues.

Premortem Examination

A g-expression is fundamentally a record of type <gexp>, which is, naturally, defined in (guix gexp). The two most important fields of this record type, out of a total of five, are proc and references; the former is a procedure that returns the equivalent s-expression, the latter a list containing everything from the "outside world" that's used by the g-expression.

When we want to turn the g-expression into something that we can actually run as code, we combine these two fields by first building any g-expression inputs that can become derivations (leaving alone those that cannot), and then passing the built references as the arguments of proc.

Here's an example g-expression that is essentially equivalent to our sexp-builder:

(use-modules (guix gexp))

(define gexp-builder
  #~(symlink #$(file-append irssi "/bin/irssi")
             #$output))

gexp-builder is far more concise than sexp-builder; let's examine the syntax and the <gexp> object we've created. To make a g-expression, we use the #~ syntax, equivalent to the gexp macro, rather than the quasiquote backtick used to create s-expressions.

When we want to embed values from outside as references, we use #$, or ungexp, which is, in appearance if not function, equivalent to unquote (,). ungexp can accept any of four reference types:

  • S-expressions (strings, lists, etc), which will be embedded literally.
  • Other g-expressions, embedded literally.
  • Expressions returning any sort of object that can be lowered into a derivation, such as <package>, embedding that object's out store item; if the expression is specifically a symbol bound to a buildable object, you can optionally follow it with a colon and an alternative output name, so package:lib is permitted, but (get-package):lib isn't.
  • The symbol output, embedding an output path. Like symbols bound to buildable objects, this can be followed by a colon and the output name that should be used rather than the default out.

All these reference types will be represented by <gexp-input> records in the references field, except for the last kind, which will become <gexp-output> records. To give an example of each type of reference (with the return value output formatted for easier reading):

(use-modules (gnu packages glib))

#~(list #$"foobar"                         ;s-expression
        #$#~(string-append "foo" "bar")    ;g-expression
        #$(file-append irssi "/bin/irssi") ;buildable object (expression)
        #$glib:bin                         ;buildable object (symbol)
        #$output:out)                      ;output
⇒ #<gexp (list #<gexp-input "foobar":out>
               #<gexp-input #<gexp (string-append "foo" "bar") …>:out>
               #<gexp-input #<file-append #<package irssi@1.4.3 …> "/bin/irssi">:out>
               #<gexp-input #<package glib@2.70.2 …>:bin>
               #<gexp-output out>) …>

Note the use of file-append in both the previous example and gexp-builder; this procedure produces a <file-append> object that builds its first argument and is embedded as the concatenation of the first argument's output path and the second argument, which should be a string. For instance, (file-append irssi "/bin/irssi") builds irssi and expands to /gnu/store/…-irssi/bin/irssi, rather than the /gnu/store/…-irssi that the package alone would be embedded as.

So, now that we have a g-expression, how do we turn it into a derivation? This process is known as lowering; it entails the use of the aptly-named lower-gexp monadic procedure to combine proc and references and produce a <lowered-gexp> record, which acts as a sort of intermediate representation between g-expressions and derivations. We can piece apart this lowered form to get a sense of what the final derivation's builder script would look like:

(define lowered-gexp-builder
  (with-store store
    (run-with-store store
      (lower-gexp gexp-builder))))

(lowered-gexp-sexp lowered-gexp-builder)
⇒ (symlink
   "/gnu/store/…-irssi-1.4.3/bin/irssi"
   ((@ (guile) getenv) "out"))

And there you have it: a s-expression compiled from a g-expression, ready to be written into a builder script file in the store. So, how exactly do you turn this into said derivation?

Well, it turns out that there isn't an interface for turning lowered g-expressions into derivations, only one for turning regular g-expressions into derivations that first uses lower-gexp, then implements the aforementioned conversion internally, rather than outsourcing it to some other procedure, so that's what we'll use.

Unsurprisingly, that procedure is called gexp->derivation, and unlike its s-expression equivalent, it's monadic. (build-expression->derivation and other deprecated procedures were in Guix since before the monads system existed.)

(with-store store
  (run-with-store store
    (gexp->derivation "irssi-symlink" gexp-builder)))
⇒ #<derivation /gnu/store/…-irssi-symlink.drv => /gnu/store/…-irssi-symlink …>

Finally, we have a g-expression-based equivalent to the derivation we earlier created with build-expression->derivation! Here's the code we used for the s-expression version in full:

(define sexp-builder
  `(let* ((out (assoc-ref %outputs "out"))
          (irssi (assoc-ref %build-inputs "irssi"))
          (bin/irssi (string-append irssi "/bin/irssi")))
     (symlink bin/irssi out)))

(with-store store
  (let ((guile-3.0-drv (package-derivation store guile-3.0))
        (irssi-drv (package-derivation store irssi)))
    (build-expression->derivation store "irssi-symlink" sexp-builder
      #:guile-for-build guile-3.0-drv
      #:inputs `(("irssi" ,irssi-drv)))))

And here's the g-expression equivalent:

(define gexp-builder
  #~(symlink #$(file-append irssi "/bin/irssi")
             #$output))

(with-store store
  (run-with-store store
    (gexp->derivation "irssi-symlink" gexp-builder)))

That's a lot of complexity abstracted away! For more complex packages and services, especially, g-expressions are a lifesaver; you can refer to the output paths of inputs just as easily as you would a string constant. You do, however, have to watch out for situations where ungexp-native, written as #+, would be preferable over regular ungexp, and that's something we'll discuss later.

A brief digression before we continue: if you'd like to look inside a <gexp> record, but you'd rather not build anything, you can use the gexp->approximate-sexp procedure, which replaces all references with dummy values:

(gexp->approximate-sexp gexp-builder)
⇒ (symlink (*approximate*) (*approximate*))

The Lowerable-Object Hardware Shop

We've seen two examples already of records we can turn into derivations, which are generally referred to as lowerable objects or file-like objects:

  • <package>, a Guix package.
  • <file-append>, which wraps another lowerable object and appends a string to the embedded output path when ungexped.

There are many more available to us. Recall from the previous post, The Store Monad, that Guix provides the two monadic procedures text-file and interned-file, which can be used, respectively, to put arbitrary text or files from the filesystem in the store, returning the path to the created item.

This doesn't work so well with g-expressions, though; you'd have to wrap each ungexped use of either of them with (with-store store (run-with-store store …)), which would be quite tedious. Thankfully, (guix gexp) provides the plain-file and local-file procedures, which return equivalent lowerable objects. This code example builds a directory containing symlinks to files greeting the world:

(use-modules (guix monads)
             (ice-9 ftw)
             (ice-9 textual-ports))

(define (build-derivation monadic-drv)
  (with-store store
    (run-with-store store
      (mlet* %store-monad ((drv monadic-drv))
        (mbegin %store-monad
          ;; BUILT-DERIVATIONS is the monadic version of BUILD-DERIVATIONS.
          (built-derivations (list drv))
          (return (derivation-output-path
                   (assoc-ref (derivation-outputs drv) "out"))))))))
                   
(define world-greeting-output
  (build-derivation
   (gexp->derivation "world-greeting"
     #~(begin
         (mkdir #$output)
         (symlink #$(plain-file "hi-world"
                      "Hi, world!")
                  (string-append #$output "/hi"))
         (symlink #$(plain-file "hello-world"
                      "Hello, world!")
                  (string-append #$output "/hello"))
         (symlink #$(plain-file "greetings-world"
                      "Greetings, world!")
                  (string-append #$output "/greetings"))))))

;; We turn the list into multiple values using (APPLY VALUES …).
(apply values
       (map (lambda (file-path)
              (let* ((path (string-append world-greeting-output "/" file-path))
                     (contents (call-with-input-file path get-string-all)))
                (list path contents)))
            ;; SCANDIR from (ICE-9 FTW) returns the list of all files in a
            ;; directory (including ``.'' and ``..'', so we remove them with the
            ;; second argument, SELECT?, which specifies a predicate).
            (scandir world-greeting-output
                     (lambda (path)
                       (not (or (string=? path ".")
                                (string=? path "..")))))))
⇒ ("/gnu/store/…-world-greeting/greetings" "Greetings, world!")
⇒ ("/gnu/store/…-world-greeting/hello" "Hello, world!")
⇒ ("/gnu/store/…-world-greeting/hi" "Hi, world!")

Note that we define a procedure for building the output; we will need to build more derivations in a very similar fashion later, so it helps to have this to reuse instead of copying the code in world-greeting-output.

There are many other useful lowerable objects available as part of the gexp library. These include computed-file, which accepts a gexp that builds the output file, program-file, which creates an executable Scheme script in the store using a g-expression, and mixed-text-file, which allows you to, well, mix text and lowerable objects; it creates a file from the concatenation of a sequence of strings and file-likes. The G-Expressions manual page has more details.

So, you may be wondering, at this point: there's so many lowerable objects included with the g-expression library, surely there must be a way to define more? Naturally, there is; this is Scheme, after all! We simply need to acquaint ourselves with the define-gexp-compiler macro.

The most basic usage of define-gexp-compiler essentially creates a procedure that takes as arguments a record to lower, the host system, and the target system, and returns a derivation or store item as a monadic value in %store-monad.

Let's try implementing a lowerable object representing a file that greets the world. First, we'll define the record type:

(use-modules (srfi srfi-9))

(define-record-type <greeting-file>
  (greeting-file greeting)
  greeting?
  (greeting greeting-file-greeting))

Now we use define-gexp-compiler like so; note how we can use lower-object to compile down any sort of lowerable object into the equivalent store item or derivation; essentially, lower-object is just the procedure for applying the right gexp-compiler to an object:

(use-modules (ice-9 i18n))

(define-gexp-compiler (greeting-file-compiler
                       (greeting-file <greeting-file>)
                       system target)
  (lower-object
   (let ((greeting (greeting-file-greeting greeting-file)))
     (plain-file (string-append greeting "-greeting")
       (string-append (string-locale-titlecase greeting) ", world!")))))

Let's try it out now. Here's how we could rewrite our greetings directory example from before using <greeting-file>:

(define world-greeting-2-output
  (build-derivation
   (gexp->derivation "world-greeting-2"
     #~(begin
         (mkdir #$output)
         (symlink #$(greeting-file "hi")
                  (string-append #$output "/hi"))
         (symlink #$(greeting-file "hello")
                  (string-append #$output "/hello"))
         (symlink #$(greeting-file "greetings")
                  (string-append #$output "/greetings"))))))

(apply values
       (map (lambda (file-path)
              (let* ((path (string-append world-greeting-2-output
                                          "/" file-path))
                     (contents (call-with-input-file path get-string-all)))
                (list path contents)))
            (scandir world-greeting-2-output
                     (lambda (path)
                       (not (or (string=? path ".")
                                (string=? path "..")))))))
⇒ ("/gnu/store/…-world-greeting-2/greetings" "Greetings, world!")
⇒ ("/gnu/store/…-world-greeting-2/hello" "Hello, world!")
⇒ ("/gnu/store/…-world-greeting-2/hi" "Hi, world!")

Now, this is probably not worth a whole new gexp-compiler. How about something a bit more complex? Sharp-eyed readers who are trying all this in the REPL may have noticed the following output when they used define-gexp-compiler (formatted for ease of reading):

⇒ #<<gexp-compiler>
    type: #<record-type <greeting-file>>
    lower: #<procedure … (greeting-file system target)>
    expand: #<procedure default-expander (thing obj output)>>

Now, the purpose of type and lower is self-explanatory, but what's this expand procedure here? Well, if you recall file-append, you may realise that the text produced by a gexp-compiler for embedding into a g-expression doesn't necessarily have to be the exact output path of the produced derivation.

There turns out to be another way to write a define-gexp-compiler form that allows you to specify both the lowering procedure, which produces the derivation or store item, and the expanding procedure, which produces the text.

Let's try making another new lowerable object; this one will let us build a Guile package and expand to the path to its module directory. Here's our record:

(define-record-type <module-directory>
  (module-directory package)
  module-directory?
  (package module-directory-package))

Here's how we define both a compiler and expander for our new record:

(use-modules (gnu packages guile)
             (guix utils))

(define lookup-expander (@@ (guix gexp) lookup-expander))

(define-gexp-compiler module-directory-compiler <module-directory>
  compiler => (lambda (obj system target)
                (let ((package (module-directory-package obj)))
                  (lower-object package system #:target target)))
  expander => (lambda (obj drv output)
                (let* ((package (module-directory-package obj))
                       (expander (or (lookup-expander package)
                                     (lookup-expander drv)))
                       (out (expander package drv output))
                       (guile (or (lookup-package-input package "guile")
                                  guile-3.0))
                       (version (version-major+minor
                                 (package-version guile))))
                  (string-append out "/share/guile/site/" version))))

Let's try this out now:

(use-modules (gnu packages guile-xyz))

(define module-directory-output/guile-webutils
  (build-derivation
   (gexp->derivation "module-directory-output"
     #~(symlink #$(module-directory guile-webutils) #$output))))

(readlink module-directory-output/guile-webutils)
⇒ "/gnu/store/…-guile-webutils-0.1-1.d309d65/share/guile/site/3.0"

(scandir module-directory-output/guile-webutils)
⇒ ("." ".." "webutils")

(define module-directory-output/guile2.2-webutils
  (build-derivation
   (gexp->derivation "module-directory-output"
     #~(symlink #$(module-directory guile2.2-webutils) #$output))))

(readlink module-directory-output/guile2.2-webutils)
⇒ "/gnu/store/…-guile-webutils-0.1-1.d309d65/share/guile/site/2.2"

(scandir module-directory-output/guile2.2-webutils)
⇒ ("." ".." "webutils")

Who knows why you'd want to do this, but it certainly works! We've looked at why we need g-expressions, how they work, and how to extend them, and we've now only got two more advanced features to cover: cross-build support, and modules.

Importing External Modules

Let's try using one of the helpful procedures from the (guix build utils) module in a g-expression.

(define simple-directory-output
  (build-derivation
   (gexp->derivation "simple-directory"
     #~(begin
         (use-modules (guix build utils))
         (mkdir-p (string-append #$output "/a/rather/simple/directory"))))))

Looks fine, right? We've even got a use-modules in th--

ERROR:
  1. &store-protocol-error:
      message: "build of `/gnu/store/…-simple-directory.drv' failed"
      status: 100

OUTRAGEOUS. Fortunately, there's an explanation to be found in the Guix build log directory, /var/log/guix/drvs; locate the file using the first two characters of the store hash as the subdirectory, and the rest as the file name, and remember to use zcat or zless, as the logs are gzipped:

Backtrace:
           9 (primitive-load "/gnu/store/…")
In ice-9/eval.scm:
   721:20  8 (primitive-eval (begin (use-modules (guix build #)) (?)))
In ice-9/psyntax.scm:
  1230:36  7 (expand-top-sequence ((begin (use-modules (guix ?)) #)) ?)
  1090:25  6 (parse _ (("placeholder" placeholder)) ((top) #(# # ?)) ?)
  1222:19  5 (parse _ (("placeholder" placeholder)) ((top) #(# # ?)) ?)
   259:10  4 (parse _ (("placeholder" placeholder)) (()) _ c&e (eval) ?)
In ice-9/boot-9.scm:
  3927:20  3 (process-use-modules _)
   222:17  2 (map1 (((guix build utils))))
  3928:31  1 (_ ((guix build utils)))
   3329:6  0 (resolve-interface (guix build utils) #:select _ #:hide ?)

ice-9/boot-9.scm:3329:6: In procedure resolve-interface:
no code for module (guix build utils)

It turns out use-modules can't actually find (guix build utils) at all. There's no typo; it's just that to ensure the build is isolated, Guix builds module-import and module-importe-compiled directories, and sets the Guile module path within the build environment to contain said directories, along with those containing the Guile standard library modules.

So, what to do? Turns out one of the fields in <gexp> is modules, which, funnily enough, contains the names of the modules which will be used to build the aforementioned directories. To add to this field, we use the with-imported-modules macro. (gexp->derivation does provide a modules parameter, but with-imported-modules lets you add the required modules directly to the g-expression value, rather than later on.)

(define simple-directory-output
  (build-derivation
   (gexp->derivation "simple-directory"
     (with-imported-modules '((guix build utils))
       #~(begin
           (use-modules (guix build utils))
           (mkdir-p (string-append #$output "/a/rather/simple/directory")))))))
           
simple-directory-output
⇒ "/gnu/store/…-simple-directory"

It works, yay. It's worth noting that while passing just the list of modules to with-imported-modules works in this case, this is only because (guix build utils) has no dependencies on other Guix modules. Were we to try adding, say, (guix build emacs-build-system), we'd need to use the source-module-closure procedure to add its dependencies to the list:

(use-modules (guix modules))

(source-module-closure '((guix build emacs-build-system)))
⇒ ((guix build emacs-build-system)
   (guix build gnu-build-system)
   (guix build utils)
   (guix build gremlin)
   (guix elf)
   (guix build emacs-utils))

Here's another scenario: what if we want to use a module not from Guix or Guile but a third-party library? In this example, we'll use guile-json , a library for converting between S-expressions and JavaScript Object Notation.

We can't just with-imported-modules its modules, since it's not part of Guix, so <gexp> provides another field for this purpose: extensions. Each of these extensions is a lowerable object that produces a Guile package directory; so usually a package. Let's try it out using the guile-json-4 package to produce a JSON file from a Scheme value within a g-expression.

(define helpful-guide-output
  (build-derivation
   (gexp->derivation "json-file"
     (with-extensions (list guile-json-4)
       #~(begin
           (use-modules (json))
           (mkdir #$output)
           (call-with-output-file (string-append #$output "/helpful-guide.json")
             (lambda (port)
               (scm->json '((truth . "Guix is the best!")
                            (lies . "Guix isn't the best!"))
                          port))))))))

(call-with-input-file
    (string-append helpful-guide-output "/helpful-guide.json")
  get-string-all)
⇒ "{\"truth\":\"Guix is the best!\",\"lies\":\"Guix isn't the best!\"}"

Amen to that, helpful-guide.json. Before we continue on to cross-compilation, there's one last feature of with-imported-modules you should note. We can add modules to a g-expression by name, but we can also create entirely new ones using lowerable objects, such as in this pattern, which is used in several places in the Guix source code to make an appropriately-configured (guix config) module available:

(with-imported-modules `(((guix config) => ,(make-config.scm))
                         …)
  …)

In case you're wondering, make-config.scm is found in (guix self) and returns a lowerable object that compiles to a version of the (guix config) module, which contains constants usually substituted into the source code at compile time.

Native ungexp

There is another piece of syntax we can use with g-expressions, and it's called ungexp-native. This helps us distinguish between native inputs and regular host-built inputs in cross-compilation situations. We'll cover cross-compilation in detail at a later date, but the gist of it is that it allows you to compile a derivation for one architecture X, the target, using a machine of architecture Y, the host, and Guix has excellent support for it.

If we cross-compile a g-expression G that non-natively ungexps L1, a lowerable object, from architecture Y to architecture X, both G and L1 will be compiled for architecture X. However, if G natively ungexps L1, G will be compiled for X and L1 for Y.

Essentially, we use ungexp-native in situations where there would be no difference between compiling on different architectures (for instance, if L1 were a plain-file), or where using L1 built for X would actually break G (for instance, if L1 corresponds to a compiled executable that needs to be run during the build; the executable would fail to run on Y if it was built for X.)

The ungexp-native macro naturally has a corresponding reader syntax, #+, and there's also ungexp-native-splicing, which is written as #+@. These two pieces of syntax are used in the same way as their regular counterparts.

Conclusion

What have we learned in this post? To summarise:

  • G-expressions are essentially abstractions on top of s-expressions used in Guix to stage code, often for execution within a build environment or a Shepherd service script.
  • Much like you can unquote external values within a quasiquote form, you can ungexp external values to access them within a gexp form. The key difference is that you may use not only s-expressions with ungexp, but other g-expressions and lowerable objects too.
  • When a lowerable object is used with ungexp, the g-expression ultimately receives the path to the object's store item (or whatever string the lowerable object's expander produces), rather than the object itself.
  • A lowerable object is any record that has a "g-expression compiler" defined for it using the define-gexp-compiler macro. G-expression compilers always contain a compiler procedure, which converts an appropriate record into a derivation, and sometimes an expander procedure, which produces the string that is to be expanded to within g-expressions when the object is ungexped.
  • G-expressions record the list of modules available in their environment, which you may expand using with-imported-modules to add Guix modules, and with-extensions to add modules from third-party Guile packages.
  • ungexp-native may be used within g-expressions to compile lowerable objects for the host rather than the target system in cross-compilation scenarios.

Mastering g-expressions is essential to understanding Guix's inner workings, so the aim of this blog post is to be as thorough as possible. However, if you still find yourself with questions, please don't hesitate to stop by at the IRC channel #guix:libera.chat and mailing list help-guix@gnu.org; we'll be glad to assist you!

Also note that due to the centrality of g-expressions to Guix, there exist a plethora of alternative resources on this topic; here are some which you may find useful:

  • Arun Isaac's post on using g-expressions with guix deploy.
  • Marius Bakke's "Guix Drops" post which explains g-expressions in a more "top-down" way.
  • This 2020 FOSDEM talk by Christopher Marusich on the uses of g-expressions.
  • And, of course, the one and only original g-expression paper by Ludovic Courtès, the original author of Guix.

About GNU Guix

GNU Guix is a transactional package manager and an advanced distribution of the GNU system that respects user freedom. Guix can be used on top of any system running the Hurd or the Linux kernel, or it can be used as a standalone operating system distribution for i686, x86_64, ARMv7, AArch64 and POWER9 machines.

In addition to standard package management features, Guix supports transactional upgrades and roll-backs, unprivileged package management, per-user profiles, and garbage collection. When used as a standalone GNU/Linux distribution, Guix offers a declarative, stateless approach to operating system configuration management. Guix is highly customizable and hackable through Guile programming interfaces and extensions to the Scheme language.

by ( at Wednesday, April 19, 2023

hyper.dev

Petit Cloud - 2 - Kernel

We jumped into an adventure to build the simplest way possible to deploy a Scheme web application. In the previous article, we identified two procedures, petit-cloud-http-connect, and the kernel procedure make-petit-cloud-application. We will dive into the latter.

Any resemblance to competitor technology is wholly, and fully coincidental.

Continuation

We want to deploy http web application built with a library (hello) with a procedure main:

;; hello.scm
(library (hello)
  (export main)
  (import (petit-cloud application))

  (define main
    (lambda (method path headers body)
      (values 200
              (list (cons 'content-type "text/plain"))
              (string->utf8 "Hello schemer!\n")))))

Given an application server running at DOMAIN:PORT, it must be possible to deploy it with the following shell invokation:

#;sh> petit-cloud deploy DOMAIN:PORT hello.scm

Short of logging, monitoring devices, and error handling, the application server code is:

(import (petit-cloud))


(define petit-cloud-deploy?
  (lambda (method path)
    ...))

(define client-accept
  (lambda (read write close)
    (define-values (method url headers body) (read))
    (if (petit-cloud-deploy? method path)
        (let ((application (make-petit-cloud-application body)))
          (petit-cloud-current-application application)
          (values 201 (list) (bytevector)))
        (let ((application (petit-cloud-current-application)))
          (application method url headers body)))))
           
(petit-cloud-http-connect DOMAIN PORT client-accept)

The procedure make-petit-cloud-application takes a bytevector as argument that is read into a symbolic expression that is the same as the library (hello):

;; hello.scm
(library (hello)
  (export main)
  (import (petit-cloud application))

  (define main
    (lambda (method path headers body)
      (values 200
              (list (cons 'content-type "text/plain"))
              (string->utf8 "Hello schemer!\n")))))

We want to translate the library (hello) into executable code, in other words convert data into code.

Turning data, into a program can be done with an interpreter.

In the case of Scheme, is to use the readily available eval.1

Even if they are ways to make it safer, or safe, the procedure eval is unsafe, it can have side effects, and other effects that will have significant security consequences: do not expose such an application server in open Internet without the careful review of an expert.

As of 2023, R7RS specify eval as follow:

(eval expr-or-definition environment-specifier)

The form library is neither an expression, or definition it is in most Scheme implementation a meta-syntaxic device hardcoded in the macro expander, the same component responsible for interpreting macros.

We could rename, and write (hello) on disk, but there is a more straightforward way, a transformation relying on define, set!, and let. The library (hello) can be rewritten into an expression that can be handed to eval to return the procedure main.

The transformation will yield the following expression called module:

(begin
  (define main #f)

  (let ()
    
    (set! main
          (lambda (method path headers body)
            (values 200
                    (list (cons 'content-type "text/plain"))
                    (string->utf8 "Hello schemer!\n")))))

  main)

Then, the following snippet, will return main:

(eval module '(petit-cloud application))

That is enough to known to implement a simple make-petit-cloud-application.

Backtracking

We described how to transform an library definition, into an expression that can be fed into eval, that will return the procedure that will eventually reply to end-users.

That is not a complete implementation, more code are necessary to make this outline fit for production. If you want to use such an application server, and self-host it, the code is within reach.


  1. The procedure eval can be implemented as an interpreter, or compiler↩︎

Wednesday, April 19, 2023

Tuesday, April 18, 2023

Andy Wingo

sticking point

Good evening, gentle readers. A brief note tonight, on a sticky place.

See, I have too many projects right now.

In and of itself this is not so much of a problem as a condition. I know my limits; I keep myself from burning out by shedding load, and there is a kind of priority list of which projects keep adequate service levels.

First come the tiny humans that are in my care who need their butts wiped and bodies translated to and from school or daycare and who -- well you know the old Hegelian trope, that the dialectic crank of history doesn't turn itself, that it takes actions from people to synthesize the thesis and the antithesis, and that even History itself isn't always monotonic; in the same way, bedtime is a reality, there are the material conditions of sleepiness and you're-gonna-be-so-tired-tomorrow but without the World-Historical Actor which whose name is Dada ain't nobody getting into pyjamas, not to mention getting in bed and going to sleep.

Lower in the priority queue there is work, and work is precious: a whole day, a day of just thinking about things and solving problems; yes, I wish I could choose all of the problems that I work on but the simple fact of being able to think and solve is a gift, or rather a reward, time stolen from the arbitrary chaotic preliterate interruptors. I love my kids -- and here I have to choose my conjunction wisely -- and also they love me. Which is nice! They express this through wanting to sit on my lap and demand that I read to them when I am thinking about things. We all have our love languages, I suppose.

And the house? There are 5 of us now, and that's, you know, more than 5 times the work it was before because the 3 wee ones are more chaotic than the adults. There is so much laundry. We are now a cook-the-whole-500g-bag-of-pasta family, and we're far from the teenager stage.

Finally there are the weird side projects, of which there are a few. I have some sporty things I do that I can't avoid because I am socially coerced into doing them; all in all a fine configuration. I have some volunteer work, which has parts I like that take time, which I am fine with, and parts I don't that I neglect. There's the garden, also neglected but nice to muck about in sometimes. Sometimes I see friends? Rarely, so rarely, but sometimes.

But, netfriends, if I find a free couple hours, I hack. Not less than a couple hours, because otherwise I can't get into things. More? Yes please! But usually no, because priorities.

I write this note because I realized that in the last month or so, hain't been no hacktime; there's barely 30 minutes between the end of the evening kitchen-clean and the onset of zzzz. That's fine, really; ultimately I need to carve out work time for this sort of thing, somehow or other.

But I was also realizing that I painted myself into a corner with my GC work. See, I need some new benchmarks, and I can't afford to use Guile itself as my benchmark yet, because I can't afford to fix deep bugs right now. I need something small. Specifically I need a little language implementation with efficient GC integration. I was looking at porting the little Scheme implementation from my Wasm JIT work to C -- because the rest of whippet is in C -- but is that a good enough reason? I got mired in the swamps of precise GC roots in C and let me tell you, it is disgusting. You don't have RAII like you do in C++, without GCC extensions. You don't have stack maps, like you would like. I might be able to push through but it's the sort of project you need a day for, and those days have not been forthcoming recently.

Anyway, just a little hack log tonight, not detailing a success, but rather a condition. Perhaps observing the position of this logjam will change its velocity. Here's hoping, and until next time, happy hacking!

by Andy Wingo at Tuesday, April 18, 2023

Thursday, April 13, 2023

Gauche Devlog

:immutable slot option

:immutable slot option

Immutability is far more valued nowadays than when CLOS was designed. Back then, basically the system allows programmers to change almost everything, assuming he knows what he was doing.

However, specifying something immutable is not only to protect users from shooting their own foot; it communicates to the readers. Here the readers can be a programmer who reads and uses the code years later (who can be the same programmer who wrote it bot has forgotten the details), or a program-processing programs such as optimizer to take advantage of immutablity.

CLOS (and Gauche's object system) have enough flexibility to implement immutable slots, but it is somewhat awkward. It's not as simple as having a custom setter that throws an error; for, the slot setter is also called in the instance initializer which runs at the instance construction. You have to distinguish whether the slot setter is invoked during initialization or outside initialization, but such dynamic introspection would be costly.

We came up an alternative mechanism which is effectively realizes immutable slots in practical situations, but does not require to distinguish whether it's in initialization or not.

If a slot has a true value for :immutable slot option, the slot can only be initialized once--that is, the setter sets the value if the slot is previously unbound, but throws an error if not. If you give the slot an initial value, either with :init-keyword or :init-value etc., then that one chance to set the value is used within initializer. Uninitialized immutable slots don't make much sense, so we expect almost always immutable slots are initialized this way.

It is possible that the initializer leaves the slot unbound, and later the user call slot-set! to set it once. It can be viewed as delayed initialization.

(We first named this option init-once, for the slot can be set once, but changed our mind for it could be confusing.)

Tag: ObjectSystem

Thursday, April 13, 2023