NOTE: This is a guest post by Felix Winkelmann, the founder and one of the current maintainers of CHICKEN Scheme.
Introduction
This article is about the changes in the next major version of CHICKEN Scheme.
The current version is 5.4.0. The next major version, 6.0.0, is mostly implemented. Currently we are adding final touches before preparing the next release. If you are already familiar with CHICKEN, this article will help you get a more detailed view of what has been done. If you are not into CHICKEN or even Scheme, this article may still give you an insight into what's involved in the development and maintenance of a programming language project. You may also find it interesting to know how we address issues like portability and backwards-compatibility. There are also some juicy details on implementation techniques.
No previous knowledge of CHICKEN Scheme is required, but it will help if you are familiar with common terms used in programming language design and Lisp-y languages.
Versions
CHICKEN uses the usual major.minor.patch versioning scheme, where we bump major versions only for significant changes that break compatibility with older versions. CHICKEN has a relatively large community for a non-mainstream language and has lots of contributed extension libraries called "eggs". Breaking backwards compatibility in non-major versions adds implementation effort for users that just want keep their CHICKEN up to date and any eggs they're using should also keep working.
The process of avoiding breakage can sometimes become challenging. We may, for example, provide two different implementations for one and the same thing, to allow adoption of the new feature while keeping old code working. Typically this happens when the new feature is safer, easier to use, faster or more standards compliant than the old. We also remove features in stages. First we mark it as deprecated, and delete it later. This allows users to upgrade their code gradually.
On major version changes, we create a new mirror of the egg "repository", where most of the extensions are stored. Externally contributed eggs can choose when and how to publish an extension for specific versions. This is described in a previous post on this blog.
Major version changes also usually bump the "binary version". This is a suffix to the shared runtime library name to avoid intermixing tools and programs written for one version with the libraries for another.
We started CHICKEN major version 6 to introduce new features that were dearly required but were impossible to add without changing external interfaces, in particular what modules exist and what they export. Specifically, we needed full support for UNICODE strings and compliance with the R7RS (small) language standard.
Both features were already available in a rudimentary manner as external libraries. They were not fully integrated, a fact that showed in various places and could lead to subtle bugs as core and extension code assumed differences in string representation and the behaviour of standard procedures. The same approach was used for integrating the "full numeric tower" (the full set of numeric types, including rationals, big integers and complex numbers), which was formerly a library and was moved into the core system with the 5.0.0 release.
We'll now describe the most noteworthy changes made during this major version transition.
UNICODE
A major shortcoming of CHICKEN up to now was that it didn't support full UNICODE strings. All string data was assumed to consist of 8-bit characters in the ASCII or Latin-1 range. Internationalisation of source code and applications is unavoidable and libraries and operating systems have moved towards UTF-8 as a standard character encoding. So we saw no choice but to find a reasonably efficient way of using the full range of possible characters in all parts of the system.
There is an open-ended space of design choices regarding how to efficiently implement UNICODE strings. The Damocles Sword of unanticipated performance degradation constantly looms over the head of the language implementer, so finding the ideal solution in terms of memory use, speed and simplicity is not easy. Some implementations go to great lengths by providing complex storage schemes or multiple representations depending on a strings' content. In the end, we think a simple representation serves everybody better by being easier to understand and maintain.
Also it is not entirely sure that (say) a dynamic multi-representation approach pays off sufficiently. Too many factors come into play, like string usage patterns in applications, memory management and implementation overhead. Data exchange at the boundaries of operating system and foreign code also have to be taken into account. You want to avoid unnecessary conversions and copying, especially because CHICKEN is geared towards easy interfacing to external libraries and OS functionality.
We decided to use an UTF-8 representation internally. Many operating systems and applications already support UTF-8, which removes the need for costly conversions. UTF-8 is also backwards compatible to ASCII and keeps storage requirements at a minimum.
Since UTF-8 is a multi-byte encoding, character lookup in a string is of linear complexity. Characters have to be scanned when addressing a specific index in a string. To avoid repeatedly scanning the same section of a string when iterating over it, we use a simple cache slot in the string that maps byte-offsets to code point indices and holds the offset/index pair of the last access.
String representation
A quick recap regarding how strings are currently stored in CHICKEN is in order. If you are interested, there's also an older post on this blog with a more detailed explanation of the data representation used by CHICKEN.
Characters are stored as immediates with the special bit pattern 1110 (type code 111):
This gives us enough bits (even on 32-bit platforms) to hold all code points in the Basic Multilingual Plane (code points in the range 32 - 0x1fffff). CHICKEN 5 already supported characters in that range, but strings were still restricted to having 8-bit elements.
Up to CHICKEN 5 a string was represented by a single byteblock:
In CHICKEN 6 we add an indirection:
As you can see, the string itself is a normal block containing a pointer to a byteblock holding the UTF-8 byte sequence and a trailing zero byte. The "Count" slot is a fixnum with the number of code points in the string (the length). "Offset" and "Index" act as a cache for the byte-offset and code point-index of the last access. They are reset to zero if the string changes its length due to a character's width changing at an offset before the cached point.
An indirection is necessary regardless of the other details of the representation: as UTF-8 is a multi-byte encoding, destructively modifying characters in a string may grow or shrink the total byte-sequence. The string must still keep its identity and be indistinguishable from the original string (in terms of the primitive eq? procedure).
One obvious optimisation we can do here is that character accesses for strings where the length ("Count") and the length of the character data (encoded in the header of the byteblock) minus one is equal: then we can simply index by byte offset.
Alternative representations are possible. I believe the following is used in Chibi Scheme: keep a list of "chunks", i.e. multiple byte-offset/code point-index pairs per string. You can traverse this list to obtain the first chunk of string data containing the index you want to access.
In CHICKEN this would probably be represented thus:
This way we can find the offset for the index right at or before the location addressed. This reduces the amount of scanning to the part inside a chunk pointed to by the offset/index pair.
Such an approach of maintaining the offset/index list is more complex and keeping it up to date is more effort than the simple offset/index cache. Should the performance impact of the simpler representation turn out to be too large, the alternative approach may still be an option to try.
This leads me to the subject of benchmarks. We do have a benchmark suite holding a number of micro-benchmarks and CHICKEN 6 has been compared to previous versions using this suite. Nevertheless, the results of micro-benchmarking have to be taken with a grain of salt.
The behaviour of real-world applications may differ considerably depending on the memory consumption, memory address patterns and the type and amount of data processed. Reasoning about performance is especially difficult due to the complex caching mechanisms of contemporary hardware and operating systems. Therefore we will try to use the simplest approach that has a good chance of providing sufficient performance while keeping the implementation effort and complexity at a minimum.
Dealing with the outside world
There's another point we have to address when considering strings that are received from external sources like OS APIs, from files and from the foreign function interface. What to do when encountering invalid UTF-8 byte sequences, for example when reading file or directory names? Signal an error? Convert to some other representation or mark the location in the byte sequence using some special pattern?
We decided to basically ignore the problem. Strings are accepted whether valid or not. Only when decoding a string we distinguish between valid and invalid sequences. When indexing and scanning a string's byte sequence, we return the invalid byte as a "surrogate pair end" code point that has the value 0xDCxx. This approach allows us to store the byte in the lower 8 bits of the code point. When inserting a character in a string that has such a value, we simply copy the byte. As I understand it, this is the approach used by the "surrogateescape" error handler in PEP 383. However, we do this masking every time we decode an unexpected byte (and do the inverse when encoding).
Say we received a string from an external source containing the following byte sequence:
This is invalid UTF-8. Extracting character by character with the described method would yield the following code point values:
Inserting such a surrogate pair end code point will inject the value of the lower byte. For example, calling (string-set! str #\xdcfe), where str is the above invalid utf-8 encoded string would yield the following byte sequence:
The advantage is that we can simply accept and pass strings containing invalid sequences as if nothing happened. We don't need to do any conversions and checks. We can also access items in the character sequence and store them back without having to worry about the validity of the sequence with respect to the UTF-8 encoding. The process is "lossless".
The disadvantage is that we have to perform an additional check when encoding or decoding UTF-8 sequences. Again we decided to reduce copying and transcoding overhead and rather have complete transparency regardless of the source of the string. Furthermore no validation is performed for incoming strings. The R7RS standard procedure utf8->string which converts bytevectors into strings does validation, though to at least ensure that strings created from binary data are always correctly encoded.
A final issue is UNICODE handling on Windows platforms. There, the OS API entry-points that are UNICODE aware use UTF-16 for strings like filenames or the values of environment variables. On Windows there is no choice but to encode and decode from UTF-8 to UTF-16 and back when interfacing with the OS.
Port encodings
When accessing files, we still want to be able to read and write data in an 8-bit encoding or in binary. We may also want to support additional encodings, even though these are currently not provided by default so we'll need to associate an "encoding" property to "ports". Ports are the Scheme objects that provide access to files and other streams of data, like traffic received from a network. The encoding is selected when opening a port using additional arguments to standard procedures that create ports for textual input and output like open-input-port/open-output-port.
Ports internally hold a table of "port methods", roughly similar to how object-oriented programming systems attach behaviour to data objects of a particular class. The port methods in former versions of CHICKEN included the low-level counterpart of the standard operations peek-char, read-char and read-string for input ports and write-char/write-string (among others) for output ports. The major change here is to replace the string I/O methods with methods that act upon bytevectors.
An internal mechanism delegates the operations for encoding and decoding or scanning for the next character in a stream to an internal hook procedure. Additional encodings can be registered by using a built-in procedure to extend the hook. Currently supported are binary, UTF-8 and Latin-1 encodings. Support for a larger set of encodings can be done through extensions and thus can be developed separately from the core system. Port encodings can be accessed using port-encoding. They can also be changed using the SRFI-17 setter for port-encoding, because encodings need sometimes to be changed while the port is still open.
R7RS does not define whether binary I/O standard procedures are allowed to operate on textual ports and vice versa. In CHICKEN we do not restrict the set of operations depending on the port type, so you can read and write bytevectors to and from a textual port and the other way round. We think this is more practical and intuitive - it makes more sense to read and write binary data as bytevectors and have dedicated operations for this.
R7RS support
The second big change for CHICKEN 6 is proper R7RS (small) compliance. Like with the UTF-8 support this was previously provided by an extension library, but using it needed some minor extra steps to set up. Now, all syntactic definitions of the (scheme base) module are available by default (most notably define-library) without requiring any further action.
Deciding what is available by default in a compilation unit or interpretation environment is a bit of a problem: to make it easy to get going when experimenting or scripting, we defaulted to having all standard R5RS procedures and macros available in the base environment of the interpreter (csi), together with the imports from the (chicken base) module. Compiled code was slightly more restricted but defaulted also to R5RS.
In CHICKEN 5 the scheme module referred to the R5RS standard procedures. To avoid breaking too much existing code this is still the case. So now, scheme is an alias for the R7RS (scheme r5rs) library module that exports all bindings of the former language standard. But to avoid duplicating the set of exported identifiers over several core modules, certain functionality has been moved from (chicken base) to (scheme base) and is thus not available in the default toplevel environment.
To make a long story short, the switch makes it necessary to access certain standard bindings like open-input-string by importing additional modules like (scheme base). This is not necessarily a bad thing, as it incentivises the user to write code in a more standard compliant way. But it all feels a bit clunky and may have been a suboptimal choice. Note that just adding an (import (scheme base)) is usually enough to make existing code run. We will see how that works out.
All required (scheme ...) modules are implemented and can be imported using their full functionality. Two notable changes that influence backwards compatibility are generative record types and hexadecimal escape sequences in string literals.
Formerly record types defined with define-record-type were non-generative: re-defining a record type with the same name, even if the type definition is completely different, would not create a new type. Instances of the former definition would also be of the newly defined type, provided the type name is the same. Now every definition of a record type creates a completely new type, regardless of the name. This is of course more correct and much safer as it doesn't invalidate previously defined instances.
A second (and much more annoying) change is that R7RS requires hex escape sequences in string literals to be terminated by a semicolon. Unfortunately the change is incompatible to the convention used in most programming languages, including existing many Lisp and Scheme implementations.
What in CHICKEN 5 would looked like this:
"the color \x1b[31mRED\x1b[0m"
in CHICKEN 6 (and R7RS) must now be (note the semicolon):
"the color \x1b;[31mRED\x1b;[0m"
The motivation here was probably to avoid having dedicated escape sequences for values that require more than 2 hex digits (e.g. \uNNNN). The change is particularly bad from a compatibility point of view. All string literals that contain the old style of escape sequences must be changed. To keep code working in both CHICKEN 5 and 6 you can use the 4-digit \uNNNN escape sequence which is still valid in all versions.
Foreign Function Interface changes
CHICKEN has a very easy to use foreign function interface, which mostly derives from the fact that the compiler generates C. The alternative approach are binary FFIs that use native code to interface with C code, like libffi, which must reproduce a lot of ABI details to safely interface with C libraries, things like struct alignment and padding, how arguments of various lengths are passed on the stack, etc.
The most notable FFI-related change for CHICKEN 6 is that it allows passing and returning C structs and unions to and from foreign code by value. The contents are copied in and out of bytevectors and wrapped in a block. The components of the struct or union can not be directly manipulated in Scheme but can be passed on to other foreign functions. Additionally, for completeness, you can now also directly pass C99 complex numbers. Note that complex numbers in the FFI are always passed as inexact (i.e., floating-point), as opposed to Scheme complex numbers that may have exact (integer or even rational) real and imaginary components.
Here two major changes have been introduced. First, we now have a proper configuration ("configure") script. Formerly, all build parameters were passed as extra arguments to make(1), like this:
make PREFIX=$HOME/.local ...
This required that for every make(1) invocation, the same set of parameters had to be given to avoid inconsistencies. A configuration script written in portable POSIX sh(1) notation is now provided to perform the configuration step once before building. It also follows the de facto convention used in many GNU programs, where the usual incantation is:
./configure; make; make install
Note that we don't use the dreaded "autotools" (autoconf, automake and libtool), which have arcane syntax, are very brittle and produce more problems that they are trying to solve. They were originally designed to port code to now dead or obscure UNIX derivatives, yet fail to provide a straightforward and easy to use configuration tool for modern systems (Linux/BSD, Mac, Windows, mostly). Our configure script is hand written instead of auto-generated, and only checks for the platform differences that are relevant to those platforms that CHICKEN actually supports.
The old style of passing variables to make(1) should still work, but is deprecated.
The second change in the build system is that we cleaned up the confusion about toolchains on Windows systems. There is a bunch of half-maintained flavors of GNU-based C development environments for Windows systems ("MingGW", "MingGW-w64", "MSYS2", etc.) and it is a constant source of pain to support and test all these variants.
There is now one official "blessed" toolchain that we support. Specifically, we recommend Chris Wellon's excellent w64devkit. It contains compilers, build tools and enough of a POSIX environment for building CHICKEN on Windows. We also require a POSIX sh(1) (contained in w64devkit) and have dropped all support for building on a non-POSIX shell, i.e. cmd.exe. This simplifies the build and package management tools considerably. It also ensures we have less environments to test.
Building for Windows Subsystem for Linux (WSL) and Cygwin is of course still supported, but those use the UNIX build setup and need no or very little specific platform support.
Minor changes
Quite a number of minor changes have taken place that either increase the robustness or standards compliance of the system.
syntax-error is now a macro, as required by R7RS. Previously there was a syntax-error procedure in (chicken syntax) with the same argument signature. So where the error was previously raised at runtime, it is now done at expansion time. This is something to take note of when porting code to CHICKEN 6. The invocation is still the same, so the difference can at least be identified easily and corrected, and (scheme base) needs to be imported, of course.
The csc compiler driver passes arguments now properly to subprocesses via execve(2) and not system(3) which reduces shell quoting headaches.
The chicken-install package ("egg") manager locks the build cache directory to avoid conflicts when running multiple installations in parallel. Also, a custom-config feature has been added to places in the package description (.egg) file that specify compile and link options provided by external tools like pkg-config for more portable configuration of native libraries that packages use. The configuration script is expected to be Scheme code. Other eggs can extend the set of possible customisation facilities by providing library code to access them.
The feathers debugger has been removed from the core system and re-packaged as an egg, allowing separate development or replacement. It always was a bit of a proof-of-concept thing that lacks the robustness and flexibility of a real debugger. Some users found it helpful, so we keep it for the time being.
Future directions
Every major release is a chance of fixing long-standing problems with the codebase and address bad design decisions. CHICKEN is now nearly 25 years old and we had many major overhauls of the system. Sometimes these caused a lot of pain, but still we always try to improve things and hopefully make it more enjoyable and practical for our users. There are places in the code that are messy, too complex, or that require cleanup or rewrite, always sitting there waiting to be addressed. On the other hand CHICKEN has been relatively stable compared to many other language implementations and has a priceless community of users that help us improving it. Our users never stop reminding us of what could be better, where the shortcomings are, where things are hard to use or inefficient.
So the final goal is and will always be to make it more robust and easier to use. Performance improvements are always good but are secondary. Strong standards compliance is a requirement, unless it interferes with practicality. We also try to avoid dependencies in the core system at all costs. This eases porting and avoids friction that is very often introduced by inadequate tools or overdesigned development environments.
Some moody ramblings about Scheme standards
The switch to seamless support for R7RS (small) was due for quite a while. R7RS is by now the generally accepted Scheme standard that implementations try to follow. How the further development of R7RS (large) will turn out remains to be seen, but I'm not very confident that it will result in anything more that just a shapeless agglomeration of features. The current direction seems to be oriented towards creating something that includes everything but the "kitchen-sink" - too ambitious and too much driven by the urge to produce something big and comprehensive.
What always distinguished Scheme from other languages and Lisp dialects was its elegance and minimalism. This resulted in a large number of experimental implementations, the investigation of various directions of programming language semantics and in highly interesting advanced implementation techniques that are still in use today. The expressiveness and the small number of core concepts made it also perfect for teaching computing. It still is great for teaching, even if the tendency to address perceived requirements of the "market" replaces the academic use of Scheme with languages that belong more to the "mainstream". This strikes me as strange, as if learning multiple languages and studying different programming paradigms couldn't help in obtaining a broader view of software development; or as if one is somehow wasting time by exploring the world of programming in a non-mainstream language.
Repeatedly the small size and limited scope of Scheme has driven a number of enthusiasts dedicated to favouring a broader use in the industry to demand "bigness". They dream of a comprehensive standard that will support everything and please everyone and makes it easy to write portable and non-trivial applications in Scheme and have them run on large number of implementations. With this comes the tacit expectation that implementers will just follow such a large standard and implement it faithfully.
But what made Scheme successful (and beautiful) was the small size, the small number of powerful concepts, the minimalism, the joy in experimentation. Trying to create the Comprehensive Mother of all Schemes is in my opinion a waste of time. In fact, it makes Scheme less relevant. Large languages inevitably die - the warts and inconsistencies that crop up when you try to design too much up front will just get more blatant as the language ages and will annoy users, constrain implementers and make people search for alternatives.
You can already write serious programs and use a large number of libraries in Scheme: just install CHICKEN, Guile or (even) Racket and get going. The code will to a certain part not be portable across implementations, that's unavoidable, but to use dynamic languages effectively and successfully, some experience with and a certain understanding of the design of the underlying compiler or interpreter is necessary anyway.
Acknowledgements
I would like to thank my employer, bevuta IT GmbH which sponsored part of the effort of preparing a new major release of CHICKEN.