Sander van der Burg's blog

In the previous blog post, I have explained the differences between Disnix (which does service deployment) and NixOps (which does infrastructure deployment), and shown that both tools can be used together to address both concerns in a deployment process.

Furthermore, I raised a couple of questions and intentionally left one unmentioned question unanswered. The unmentioned question that I was referring to is the following: "Is Disnix still alive or is it dead?".

The answer is that Disnix's development was progressing at a very low pace for some time after I left academia -- I made minor changes once in a while, but nothing really interesting happened.

However, for the last few months, I am using it on a daily basis and made many big improvements. Moreover, I have reached a stable point and decided that this is a good moment to announce the next release!

New features

So what is new in this release?

Visualization tool

I have added a new tool to the Disnix toolset named: disnix-visualize that generates Graphviz images to visualize a particular deployment scenario. An example image is shown below:

The above picture shows a deployment scenario of the StaffTracker Java example in which services are divided over two machines in a network and have all kinds of complex dependency relationships as denoted by the arrows.

The tool was already included in the development versions for a while, but has never been part of any release.

Dysnomia

I also made a major change in Disnix's architecture. As explained in my old blog post about Disnix, activating and deactivating services cannot be done generically and I have developed a plugin system to take care of that.

This plugin system package (formerly known as Disnix activation scripts) has now become an independent tool named Dysnomia and can also be applied in different contexts. Furthermore, it can also be used as a standalone tool.

For example, a running MySQL DBMS instance (called a container in Dysnomia terminology) could be specified in a configuration file (such as ~/mysql-production) as follows:


type=mysql-database
mysqlUsername=root
mysqlPassword=verysecret

A database can be encoded as an SQL file (~/test-database/createdb.sql) creating the schema:


create table author
( AUTHOR_ID  INTEGER       NOT NULL,
  FirstName  VARCHAR(255)  NOT NULL,
  LastName   VARCHAR(255)  NOT NULL,
  PRIMARY KEY(AUTHOR_ID)
);

create table books
( ISBN       VARCHAR(255)  NOT NULL,
  Title      VARCHAR(255)  NOT NULL,
  AUTHOR_ID  VARCHAR(255)  NOT NULL,
  PRIMARY KEY(ISBN),
  FOREIGN KEY(AUTHOR_ID) references author(AUTHOR_ID)
    on update cascade on delete cascade
);

We can use the following command-line instruction to let Dysnomia deploy the database to the MySQL DBMS container we have just specified earlier:


$ dysnomia --operation activate --component ~/test-database \
  --container ~/mysql-production

When Disnix has to execute deployment operations, two external tools are consulted -- Nix takes care of all deployment operations of the static parts of a system, and Dysnomia takes care of performing the dynamic activation and deactivation steps.

Concurrent closure transfers

In the previous versions of Disnix, only one closure (of a collection of services and its intra-dependencies) is transferred to a target machine at the time. If a target machine has more network bandwidth than the coordinator, this is usually fine, but in all other cases, it slows the deployment process down.

In the new version, two closures are transferred concurrently by default. The amount of concurrent closure transfers can be adjusted as follows:


$ disnix-env -s services.nix -i infrastructure.nix \
  -d distribution.nix --max-concurrent-transfers 4

The last command-line argument states that 4 closures should be transferred concurrently.

Concurrent service activation and deactivation

In the old Disnix versions, the activation and deactivation steps of a service on a target machine were executed sequentially, i.e. one service on a machine at the time. In all my old testcases these steps were quite cheap/quick, but now that I have encountered systems that are much bigger, I noticed that there is a lot of deployment time that we can save.

In the new implementation, Disnix tries to concurrently activate or deactivate one service per machine. The amount of services that can be concurrently activated or deactivated per machine can be raised in the infrastructure model:


{
  test1 = {
    hostname = "test1";
    numOfCores = 2;
  };
}

In the above infrastructure model, the numOfCores attribute states that two services can be concurrently activated/deactivated on machine test1. If this attribute has been omitted, it defaults to 1.

Multi connection protocol support

By default, Disnix uses an SSH protocol wrapper to connect to the target machines. There is also an extension available, called DisnixWebService, that uses SOAP + MTOM instead.

In the old version, changing the connection protocol means that every target machine should be reached with it. In the new version, you can also specify the target property and client interface in the infrastructure model to support multi connection protocol deployments:


{
  test1 = {
    hostname = "test1";
    targetProperty = "hostname";
    clientInterface = "disnix-ssh-client";
  };

  test2 = {
    hostname = "test2";
    targetEPR = http://test2:8080/DisnixWebService/services/DisnixWebService;
    targetProperty = "targetEPR";
    clientInterface = "disnix-soap-client";
  };
}

The above infrastructure model states the following:

To connect to machine: test1, the hostname attribute contains the address and the disnix-ssh-client tool should be invoked to connect it.
To connect to machine: test2, the targetEPR attribute contains the address and the disnix-soap-client tool should be invoked to connect it.

NixOps integration

As described in my previous blog post, Disnix does service deployment and can integrate NixOS' infrastructure deployment features with an extension called DisnixOS.

DisnixOS can now also be used in conjunction with NixOps -- NixOps can be used to instantiate and deploy a network of virtual machines:


$ nixops create ./network.nix ./network-ec2.nix -d ec2
$ nixops deploy -d ec2

and DisnixOS can be used to deploy services to them:


$ export NIXOPS_DEPLOYMENT=ec2
$ disnixos-env -s services.nix -n network.nix -d distribution.nix \
    --use-nixops

Omitted features

There are also a couple of features described in some older blog posts, papers, and my PhD thesis, which have not become part of the new Disnix release.

Dynamic Disnix

This is an extended framework built on top of Disnix supporting self-adaptive redeployment. Although I promised to make it part of the new release a long time ago, it did not happen. However, I did update the prototype to work with the current Disnix implementation, but it still needs refinements, documentation and other small things to make it usable.

Brave people who are eager to try it can pull the Dynamic Disnix repository from my GitHub page.

Snapshot/restore features of Dysnomia

In a paper I wrote about Dysnomia I also proposed state snapshotting/restoring facilities. These have not become part of the released versions of Dysnomia and Disnix yet.

The approach I have described is useful in some scenarios but also has a couple of very big drawbacks. Moreover, it also significantly alters the behavior of Disnix. I need to find a way to properly integrate these features in such a way that they do not break the standard approach. Moreover, these techniques must be selectively applied as well.

Conclusion

In this blog post, I have announced the availability of the next release of Disnix. Perhaps I should give it the codename: "Disnix Forever!" or something :-). Also, the release date (Friday the 13th) seems to be appropriate.

Moreover, the previous release was considered an advanced prototype. Although I am using Disnix on a daily basis now to eat my own dogfood, and the toolset has become much more usable, I would not yet classify this release as something that is very mature yet.

Disnix can be obtained by installing NixOS, through Nixpkgs or from the Disnix release page.

I have also updated the Disnix homepage a bit, which should provide you more information.

A few months ago, I noticed that somebody was referring to my "On Nix and GNU Guix" blog post from the Ask Ubuntu forum. The person who started the topic wanted to know how Snappy Ubuntu compares to Nix and GNU Guix.

Unfortunately, he did not read my blog post (or possibly one of the three Nix and NixOS explanation recipes) in detail.

Moreover, I was hoping that somebody else involved with Snappy Ubuntu would do a more in depth comparison and write a response, but this still has not happened yet. As a matter of fact, there is still no answer as of today.

Because of these reasons, I have decided to take a look at Snappy Ubuntu Core and do an evaluation myself.

What is Snappy Ubuntu?

Snappy is Ubuntu's new mechanism for delivering applications and system upgrades. It is used as the basis of their upcoming cloud and mobile distributions and supposed to be offered alongside the Debian package manager that Ubuntu currently uses for installing and upgrading software in their next generation desktop distribution.

Besides the ability to deploy packages, Snappy also has a number of interesting non-functional properties. For example, the website says the following:

The snappy approach is faster, more reliable, and lets us provide stronger security guarantees for apps and users -- that's why we call them "snappy" applications.

Snappy apps and Ubuntu Core itself can be upgraded atomically and rolled back if needed -- a bulletproof approach that is perfect for deployments where predictability and reliability are paramount. It's called "transactional" or "image-based" systems management, and we’re delighted to make it available on every Ubuntu certified cloud.

The text listed above contains a number of interesting quality aspects that have a significant overlap with Nix -- reliability, atomic upgrades and rollbacks, predictability, and being "transactional" are features that Nix also implements.

Package organization

The Snappy Ubuntu Core distribution uses mostly a FHS compliant filesystem layout. One notable deviation is the folder in which applications are installed.

For application deployment the /app folder is used in which files belonging to a specific application version reside in separate folders. Application folders use the following naming convention:


/app/<name>/<version>[.<developer>]

Each application is identified by its name, version identifier and optionally a developer identifier, as shown below:


$ ls -l /apps
drwxr-xr-x 2 root ubuntu 4096 Apr 25 20:38 bin
drwxr-xr-x 3 root root   4096 Apr 25 15:56 docker
drwxr-xr-x 3 root root   4096 Apr 25 20:34 go-example-webserver.canonical
drwxr-xr-x 3 root root   4096 Apr 25 20:31 hello-world.canonical
drwxr-xr-x 3 root root   4096 Apr 25 20:38 webcam-demo.canonical
drwxr-xr-x 3 root ubuntu 4096 Apr 23 05:24 webdm.sideload

For example, the /app/webcam-demo.canonical/1.0.1 refers to a package named: webcam version 1.0.1 that is delivered by Canonical.

There are almost no requirements on the contents of an application folder. I have observed that the example packages seem to follow some conventions though. For example:


$ cd /apps/webcam-demo.canonical/1.0.1
$ find . -type f ! -iname ".*"
./bin/x86_64-linux-gnu/golang-static-http
./bin/x86_64-linux-gnu/fswebcam
./bin/golang-static-http
./bin/runner
./bin/arm-linux-gnueabihf/golang-static-http
./bin/arm-linux-gnueabihf/fswebcam
./bin/fswebcam
./bin/webcam-webui
./meta/readme.md
./meta/package.yaml
./lib/x86_64-linux-gnu/libz.so.1
./lib/x86_64-linux-gnu/libc.so.6
./lib/x86_64-linux-gnu/libX11.so.6
./lib/x86_64-linux-gnu/libpng12.so.0
./lib/x86_64-linux-gnu/libvpx.so.1
...
./lib/arm-linux-gnueabihf/libz.so.1
./lib/arm-linux-gnueabihf/libc.so.6
./lib/arm-linux-gnueabihf/libX11.so.6
./lib/arm-linux-gnueabihf/libpng12.so.0
./lib/arm-linux-gnueabihf/libvpx.so.1
...

Binaries are typically stored inside the bin/ sub folder, while libraries are stored inside the lib/ sub folder. Moreover, the above example also ships binaries for two kinds of system architectures (x86_64 and ARM) that reside inside bin/x86_64-linux, bin/arm-linux-gnueabihf, and lib/x86_64-linux, arm-linux-gnueabihf sub folders.

The only sub folder that has a specific purpose is meta/ that is supposed to contain at least two files -- the readme.md file contains documentation in which the first heading and the first paragraph have a specific meaning, and the package.yaml file contains various meta attributes related to the deployment of the package.

Snappy's package storing convention also makes it possible to store multiple versions of a package next to each other, as shown below:


$ ls -l /apps/webcam-demo.canonical
drwxr-xr-x 7 clickpkg clickpkg 4096 Apr 24 19:38 1.0.0
drwxr-xr-x 7 clickpkg clickpkg 4096 Apr 25 20:38 1.0.1
lrwxrwxrwx 1 root     root        5 Apr 25 20:38 current -> 1.0.1

Moreover, every application folder contains a symlink named: current/ that refers to the version that is currently in use. This approach makes it possible to do atomic upgrades and rollbacks by merely flipping the target of the current/ symlink. As a result, the system always refers to an old or new version of the package, but never to an inconsistent mix of the two.

Apart from storing packages in isolation, they also must be made accessible to end users. For each binary that is declared in the package.yaml file, e.g.:


# for debugging convience we also make the binary available as a command
binaries:
 - name: bin/webcam-webui

a wrapper script is placed inside /apps/bin that is globally accessible by the users of a system through the PATH environment variable.

Each wrapper script contains the app name. For example, the webcam-webui binary (shown earlier) must be started as follows:


$ webcam-demo.webcam-webui

Besides binaries, package configurations can also declare services from which systemd jobs are composed. The corresponding configuration files are put into the /etc/systemd/system folder and also use a naming convention containing the package name.

Unprivileged users can also install their own packages. The corresponding application files are placed inside $HOME/app and are organized in the same way as the global /app folder.

Snappy's package organization has many similarities with Nix's package organization -- Nix also stores files files belonging to a package in isolated folders in a special purpose directory called the Nix store.

However, Nix uses a more powerful way of identifying packages. Whereas Snappy only identifies packages with their names, version numbers and vendor identifiers, Nix package names are prefixed with unique hash codes (such as /nix/store/wan65mpbvx2a04s2s5cv60ci600mw6ng-firefox-with-plugins-27.0.1) that are derived from all build time dependencies involved to build the package, such as compilers, libraries and the build scripts themselves.

The purpose of using hash codes is to make a distinction between any variant of the same package. For example, when a package is compiled with a different version of GCC, linked against a different library dependency, when debugging symbols are enabled or disabled or certain optional features enabled or disabled, or the build procedure has been modified, a package with a different hash is built that is safely stored next to existing variants.

Moreover, Nix also uses symlinking to refer to specific versions of packages, but the corresponding mechanism is more powerful. Nix generates so-called Nix profiles which synthesize the contents of a collection of installed packages in the Nix store into a symlink tree so that their files (such as executables) can be referenced from a single location. A second symlink indirection refers to Nix profile containing the desired versions of the packages.

Nix profiles also allow unprivileged users to manage their own set of private packages that do not conflict with other user's private packages or the system wide installed packages. However, partly because of Nix's package naming convention, also the packages of unprivileged users can be safely stored in the global Nix store, so that common dependencies can be shared among users in a secure way.

Dependency management

Software packages are rarely self contained -- they typically have dependencies on other packages, such as shared libraries. If a dependency is missing or incorrect, a package may not work properly or not at all.

I observed that in the Snappy example packages, all dependencies are bundled statically. For example, in the webcam-demo package, the lib/ sub folder contains the following files:


./lib/x86_64-linux-gnu/libz.so.1
./lib/x86_64-linux-gnu/libc.so.6
./lib/x86_64-linux-gnu/libX11.so.6
./lib/x86_64-linux-gnu/libpng12.so.0
./lib/x86_64-linux-gnu/libvpx.so.1
...
./lib/arm-linux-gnueabihf/libz.so.1
./lib/arm-linux-gnueabihf/libc.so.6
./lib/arm-linux-gnueabihf/libX11.so.6
./lib/arm-linux-gnueabihf/libpng12.so.0
./lib/arm-linux-gnueabihf/libvpx.so.1
...

As can be seen in the above output, all the library dependencies, including the libraries' dependencies (even libc) are bundled into the package. When running an executable or starting a systemd job, a container (essentially an isolated/restricted environment) is composed in which the process runs (with some restrictions) where it can find its dependencies in the "common FHS locations", such as /lib.

Besides static bundling, there seems to be a primitive mechanism that provides some form of sharing. According to the packaging format specification, it also possible to declare dependencies on frameworks in the package.yaml file:


frameworks: docker, foo, bar # list of frameworks required

Frameworks are managed like ordinary packages in /app, but they specify additional required system privileges and require approval from Canonical to allow them to be redistributed.

Although it is not fully clear to me from the documentation how these dependencies are addressed, I suspect that the contents of the frameworks is made available to packages inside the containers in which they run.

Moreover, I noticed that dependencies are only addressed by their names and that they refer to the current versions of the corresponding frameworks. In the documentation, there seems to be no way (yet) to refer to other versions or variants of frameworks.

The Nix-way of managing dependencies is quite different -- Nix packages are constructed from source and the corresponding build procedures are executed in isolated environments in which only the specified build-time dependencies can be found.

Moreover, when constructing Nix packages, runtime dependencies are bound statically to executables, for example by modifying the RPATH of an ELF executable or wrapping executables in scripts that set environment variables allowing it to find its dependencies (such as CLASSPATH or PERL5LIB). A subset of the buildtime dependencies are identified by Nix as runtime dependencies by scanning for hash occurrences in the build result.

Because dependencies are statically bound to executables, there is no need to compose containers to allow executables to find them. Furthermore, they can also refer to different versions or variants of library dependencies of a package without conflicting with other package's dependencies. Sharing is also supported, because two packages can refer to the same dependency with the same hash prefix in the Nix store.

As a sidenote: with Nix you can also use a containerized approach by composing isolated environments (e.g. a chroot environment or container) in which packages can find their dependencies from common locations. A prominent Nix package that uses this approach is Steam, because it is basically a deployment tool conflicting with Nix's deployment properties. Although such an approach is also possible, it is only used in very exceptional cases.

System organization

Besides applications and frameworks, the base system of the Snappy Ubuntu Core distribution can also be upgraded and downgraded. However, a different mechanism is used to accomplish this.

According to the filesystem layout & updates guide, the Snappy Ubuntu Core distribution follows a specific partition layout:

boot partition. This is a very tiny partition used for booting and should be big enough to contain a few kernels.
system-a partition. This partition contains a minimal working base system. This partition is mounted read-only.
system-b partition. An alternative partition containing a minimal working base system. This partition is mounted read-only as well.
writable partition. A writable partition that stores everything else including the applications and frameworks.

Snappy uses an "A/B system partitions mechanism" to allow a base system to be updated as a single unit by applying a new system image. It is also used to roll back to the "other" base system in case of problems with the most recently-installed system by making the bootloader switch root filesystems.

NixOS (the Linux distribution built around the Nix package manager) approaches system-level upgrades in a different way and is much more powerful. In NixOS, a complete system configuration is composed from packages residing in isolation in the Nix store (like ordinary packages) and these are safely stored next to existing versions. As a result, it is possible to roll back to any previous system configuration that has not been garbage collected yet.

Creating packages

According to the packaging guide, creating Snap files is very simple. It is just creating a directory, putting some files in there, creating a meta/ sub folder with a readme.md and package.yaml file, and running:


$ snappy build .

The above file generates a Snap file, which is basically just a tarball file containing the contents of the folder.

In my opinion, creating Snap packages is not that easy -- the above process demonstrates that delivering files from one machine to another is straight forward, but getting a package right is another thing.

Many packages on Linux systems are constructed from source code. To properly do that, you need to have the required development tools and libraries deployed first, a process that is typically easier said than done.

Snappy does not provide facilities to make that process manageable. With Snappy, it is the packager's own responsibility.

In contrast, Nix is a source package manager and provides a DSL that somebody can use construct isolated environments in which builds are executed and automatically deploys all buildtime dependencies that are required to build a package.

The build facilities of Nix are quite accessible. For example, you can easily construct your own private set of Nix packages or a shell session containing all development dependencies.

Moreover, Nix also implements transparent binary deployment -- if a particular Nix package with an identical hash exists elsewhere, we can download it from a remote location instead of building it from source ourselves.

Isolation

Another thing the Snappy Ubuntu Core distribution does with containers (besides using them to let a package find its dependencies) is restricting the things programs are allowed to do, such as the TCP/UDP ports they are allowed to bind to.

In Nix and NixOS, it is not a common practice to restrict the runtime behaviour of programs by default. However, it is still possible to impose restrictions on running programs, by composing a systemd job for a program yourself in a system's NixOS configuration.

Overview

The following table summarizes the conceptual differences between the Snappy Ubuntu Core and Nix/NixOS covered in this blog post:

	Snappy Ubuntu Core	Nix/NixOS
Concept	Binary package manager	Source package manager (with transparent binary deployment)
Dependency addressing	By name	Exact (using hash codes)
Dependency binding	Container composition	Static binding (e.g. by modifying RPATH or wrapping executables)
Systems composition management	"A/B" partitions	Package compositions
Construction from source	Unmanaged	Managed
Unprivileged user installations	Supported without sharing	Supported with sharing
Runtime isolation	Part of package configuration	Supported optionally, by manually composing a systemd job

Discussion

Snappy shares some interesting traits with Nix that provide a number of huge deployment benefits -- by deviating from the FHS and storing packages in isolated folders, it becomes easier to store multiple versions of packages next to each other and to perform atomic upgrades and rollbacks.

However, something that I consider a huge drawback of Snappy is the way dependencies are managed. In the Snappy examples, all library dependencies are bundled statically consuming more disk space (and RAM at runtime) than needed.

Moreover, packaging shared dependencies as frameworks is not very convenient and require approval from Canonical if they must be distributed. As a consequence, I think it will not be very encouraging to modularize systems, which is generally considered a good practice.

According to the framework guide the purpose of frameworks is to extend the base system, but not to be a sharing mechanism. Also the guide says:

Frameworks exist primarily to provide mediation of shared resources (eg, device files, sensors, cameras, etc)

So it appears that sharing in general is discouraged. In many common Linux distributions (including Debian and derivatives such as Ubuntu), it is common that the reuse-degree is raised to almost a maximum. For example, each library is packaged individually and sometimes libraries are even split into binary, development and documentation sub packages. I am not sure how Snappy is going to cope with such a fine granularity of reuse. Is Snappy going to be improved to support reuse as well, or is it now considered a good thing to package huge monolithic blobs?

Also, Snappy does only binary deployment and is not really helpful to alleviate the problem of constructing packages from source which is also quite a challenge in my opinion. I see lots of room for improvement in this area as well.

Another funny observation is the fact that Snappy Ubuntu Core relies on advanced concepts such as containers to make programs work, while there are also simpler solutions available, such as static linking.

Finally, something that Nix/NixOS could learn from the Snappy approach is the runtime isolation of programs out-of-the-box. Currently, doing this with Nix/NixOS this is not as convenient as with Snappy.

References

This is not the only comparison I have done between Nix/NixOS and another deployment approach. A few years ago while I was still a PhD student, I also did a comparison between the deployment properties of GoboLinux and NixOS.

Interestingly enough, GoboLinux addresses packages in a similar way as Snappy, supports sharing, does not provide runtime isolation of programs, but does have a very powerful source construction mechanism that Snappy lacks.

A while ago, I have written a blog post about my IFF file format experiments in which I have developed a collection of libraries and tools capable of parsing and displaying two IFF file format applications. Some time before that, I have developed a Nix function capable of building software for AmigaOS allowing me to easily backport these experimental software packages to AmigaOS.

The development of these experimental packages were dormant for quite some time, but recently I was able to find some time to do some improvements. This time I have updated the Amiga video emulation library to support a few new features and to more accurately emulate an Amiga display.

Differences between Amiga and PC displays

So why is emulation of Amiga video modes necessary? Emulation of Amiga graphics hardware is required because pixels are encoded differently than "modern" PC hardware, resolutions have different semantics, and Amiga hardware has a few special screen modes to "squeeze" more colors out of the amount of available color registers.

Pixels

Encoding pixels on modern hardware is not so complicated these days. Modern hardware allows us to address each individual pixel's color value using 4 bytes per pixel in which one byte represents the red color intensity, one byte the blue color intensity and one byte the green color intensity. The fourth byte is often unused, or serves as the alpha component setting the transparency of a pixel in multi-layered images.

However, older PC hardware, such as those used during the DOS era, were often not powerful enough to address every pixel's color value individually, mainly due to memory constraints. Many classic DOS games used a 320x200 video resolution with a color palette consisting of 256 color values. Each pixel was encoded as a byte referring to an index value of a palette. This encoding is also known as chunky graphics.

System resources in the Amiga era were even more scarce, especially during its launch in 1985. The original graphics chipset was only capable of storing 32 color values in a palette, although there were a few special screen modes capable of squeezing more colors out of the 32 configurable ones. It also used a different way of encoding pixels, probably to make storage of graphics surfaces as memory efficient as possible.

In Amiga chipsets pixels are encoded as bitplanes rather than bytes. When using bitplane encoding, an image is stored multiple times in memory. In every image occurence, a bit represents a pixel. In the first occurence, a bit is the least significant bit of an index a value. In the last occurence a bit is the most significant bit of an index value. By adding all the bits together we can determine the index value of the palette to determine a pixel's color value. For example, if we would encode an image using 16 colors, then we need 4 bitplane surfaces.

To be able to display an Amiga image on a PC we have to convert the bitplane format to either chunky graphics or RGB graphics. Likewise, to be able to display a PC image on an Amiga, we have to convert the pixel surface to bitplane format.

Palette

As explained earlier, due to memory constraints, Amiga graphics use a palette of 32 configurable color values (or 256 colors when using the newer AGA chipset is used). Each pixel refers to an index of a palette instead of a color value of its own.

The original chipset color values are stored in 16-bit color registers in which every color component consists of 4 bits (4 bits are unused). The newer AGA chipset as well as VGA graphics use 8 bits per color component.

Furthermore, the Amiga video chips have special screen modes to squeeze more colors out of the amount of color registers available. The Extra Half Brite (EHB) screen mode is capable of displaying 64 colors out of a predefined 32, in which the last 32 colors values have half of the intensity of the first 32 colors. The above screenshot is an example picture included with Deluxe Paint V using the EHB screenmode. A closer look at the floor in the picture look may reveal to you that an EHB palette is used.

The Hold-and-Modify (HAM) screen mode is used to modify a color component of the previous adjacent pixel or to pick a new color from the given palette. This screen mode makes it possible to use all possible color values (4096) in one screen with some loss of image quality.

The above screenshot is an example image included with Deluxe Paint V using HAM screenmode to utilise many more colors than the amount available color registers. A closer look at the image very may reveal that it has some quality loss because of the limitations of HAM compression.

To be able to properly display an image on a PC we need to convert 12-bit color values to 32-bit color values. Moreover, we also have to calculate each pixel's color value when HAM screen modes are used and convert the image to true color graphics, since a picture using HAM mode may use more than 256 colors.

Resolutions

Another difference between Amiga displays and "modern" displays are the display resolutions. On PCs, resolutions refer to the amount of pixels per scanline and the amount of scanlines per screen.

On the Amiga, resolutions only refer to the amount of pixels per scanline and this amount is often fixed. For example most displays support 320 lowres pixels per scanline, although this value can be slightly increased by utilising the overscan region. A high resolution screen has twice the amount of pixels per scanline compared to a low resolution screen. A super hires screen has double the amount of pixels per scanline compared to a high resolution screen. Moreover, a low resolution pixel is twice a wide as a high resolution pixel and so on.

Vertically, there are only two options. In principle, there are a fixed amount of scanlines on a display. Typically, NTSC displays support 200 scanlines and PAL displays support 256 scanlines. This amount can be slightly increased by utilising the overscan region of a display.

The amount of scanlines can be doubled, using a so-called interlace screen mode. However, interlace screen modes have a big drawback -- they draw the odd scanlines in one frame and the even ones in another. On displays with a short after glow, flickering may be observed.

Because of these differences, we may observe odd results in some cases when we convert an Amiga pixel surface to a chunky/RGB pixel surface. For example, a non-interlaced high resolution image looks twice as wide on a PC display than on an Amiga display, as can be seen in the left picture above, which contains a pixel-by-pixel conversion of a the Workbench screenshot.

To give it the same look as an Amiga display, we must correct its aspect ratio by doubling the amount of scanlines on the PC display, which is done in the right screenshot. People who have used the Amiga Workbench will immediately notice that this image is much closer to what can be seen on a real Amiga.

A library for performing conversions of Amiga display surfaces

While writing my previous blog post, I have developed a library (libamivideo) capable of automatically converting Amiga bitplane surfaces to chunky/RGB pixel surface and vice versa. In the latest release, I have added the following features and improvements:

Support for super hires video modes
Correct aspect ratio support
A better defined API

The library can be obtained from my GitHub page. I have also updated the SDL-based ILBM viewer (SDL_ILBM) as well as the native AmigaOS ILBM viewer (amiilbm) to use the new library implementation.

Some people around me have noticed that I frequently use the command-line for development and system administration tasks, in particular when I'm working on Linux and other UNIX-like operating systems.

Typically, when you see me working on something behind my computer, you will most likely observe a screen that looks like this:

The above screenshot shows a KDE plasma desktop session in which I have several terminal screens opened running a command-line shell session.

For example, I do most of my file management tasks on the command-line. Moreover, I'm also a happy Midnight Commander user and I use the editor that comes with it (mcedit) quite a lot as well.

As a matter of fact, mcedit is my favorite editor for the majority of my tasks, so you will never see me picking any side in the well-known Emacs vs vi discussion (or maybe I upset people to say that I also happen to know Vim and never got used to Emacs at all). :-).

Using the command-line

For Linux users who happen to do development, this way of working makes (sort of) sense. To most outsiders, however, the above screenshot looks quite menacing and they often question me why this is a good (or appealing) way of working. Some people even advised me to change my working style, because they consider it to be confusing and inefficient.

Although I must admit that it takes a bit of practice to learn a relevant set of commands and get used to it, the reasons for me to stick to such a working style are the following:

It's a habit. This is obviously not a very strong argument, but I'm quite used to typing commands when executing system administration and development tasks, and they are quite common in a Linux/Unix world -- much of the documentation that you will find online explain how to do things on the command-line.

Moreover, I have been using a command-line interface since the very first moment I was introduced to a computer. For example, my first computer's operating system shell was a BASIC programming language interpreter, which you even had to use to perform simple tasks, such as loading a program from disk/tape and running it.
Speed. This argument may sound counter-intuitive to some people, but I can accomplish many tasks quite quickly on the command-line and probably even faster than using a graphical user interface.

For example, when I need to do a file management task, such as removing all backup files (files having a ~ suffix), I can simply run the following shell command:
```
$ rm *~
```
If I would avoid the shell and use a graphical file manager (e.g. Dolphin, GNOME Files, Windows Explorer, Apple Finder), picking these files and moving them to the trashcan would certainly take much more time and effort.

Moreover, shells (such as bash) have many more useful goodies to speed things up, such as TAB-completion, allowing someone to only partially type a command or filename and let the shell complete it by pressing the TAB-key.
Convenient integration. Besides running a single command-line instruction to do a specific task, I also often combine various instructions to accomplish something more difficult. For example, the following chain of commands:
```
$ wc -l $(find . -name \*.h -or -name \*.c) | head -n -1 | sort -n -r
```
comes in handy when I want to generate an overview of lines of code per source file implemented in the C programming language and sort them in reverse order (that is the biggest source file first).

As you may notice, I integrate four command-line instructions: the find instruction seeks for all relevant C source files in the current working directory, the wc -l command calculates the lines of code per file, the head command chops off the last line displaying the total values and the sort command does a numeric sort of the output in reverse order.

Integration is done by pipes (using the | operator) which make the output of one process the input of another and command substitution (using the $(...) construct) that substitutes the invocation by its output.
Automation. I often find myself repeating common sets of shell instructions to accomplish certain tasks. I can conveniently turn them into shell scripts to make my life easier.

Recommendations for developing command-line utilities

Something that I consider an appealing property of the automation and integration aspects is that people can develop "their own" commands in nearly any programming language (using any kind of technology) in a straight forward way and easily integrate them with other commands.

I have developed many command-line utilities -- some of them have been developed for private use by my past and current employers (I described one of them in my bachelor's thesis for example), and others are publicly available as free and open source software, such as the tools part of the IFF file format experiments project, Disnix, NiJS and the reengineered npm2nix.

Although developing simple shell scripts as command-line utilities is a straight forward task, implementing good quality command-line tools is often more complicated and time consuming. I typically take the following properties in consideration while developing them:

Interface conventions

There are various ways to influence the behaviour of processes invoked from the command-line. The most common ways are through command-line options, environment variables and files.

I expose anything that is configurable as command-line options. I mostly follow the conventions of the underlying OS/platform:

On Linux and most other UNIX-like systems (e.g. Mac OS X, FreeBSD) I follow GNU's convention for command line parameters.

For each parameter, I always define a long option (e.g. --help) and for the most common parameters also a short option (e.g. -h). I only implement a purely short option interface if the target platform does not support long options, such as classic UNIX-like operating systems.
On DOS/Windows, I follow the DOS convention. That is: command line options are single character only and prefixed by a slash, e.g. /h.

I always implement two command-line options, namely a help option displaying a help page that summarizes how the command can be used and a version option displaying the name and version of the package, and optionally a copyright + license statement.

I typically reserve non-option parameters to file names/paths only.

If certain options are crosscutting among multiple command line tools, I also make them configurable through environment variables in addition to command-line parameters.

For example, Disnix exposes each deployment activity (e.g. building a system, distributing and activating it) as a separate command-line tool. Disnix has the ability to deploy a system to multiple target environments (called: a profile). The following chain of command-line invocations to deploy a system make more sense to specify a profile:


$ export DISNIX_PROFILE=testenv
$ manifest=$(disnix-manifest -s services.nix -i infrastructure.nix \
    -d distribution.nix)
$ disnix-distribute $manifest
$ disnix-activate $manifest
$ disnix-set $manifest

than passing -p testenv parameter four times (to each command-line invocation).

I typically use files to process or produce arbitrary sets of data:

If a tool takes one single file as input or produces one single output or a combination of both, I also allow it to read from the standard input or write to the standard output so that it can be used as a component in a pipe. In most cases supporting these special file descriptors is almost as straight forward as opening arbitrary files.

For example, the ILBM image viewer's primary purpose is just to view an ILBM image file stored on disk. However, because I also allow it to read from the standard input I can also do something like this:
```
$ iffjoin picture1.ILBM picture2.ILBM | ilbmviewer
```
In the above example, I concatenate two ILBM files into a single IFF container file and invoke the viewer to view both of them without storing the immediate result on disk first. This can be useful for a variety of purposes.
A tool that produces output writes to the standard output data that can be parsed/processed by another process in a pipeline. All other output, e.g. errors, debug messages, notifications go the the standard error.

Finally, every process should return an exit status when it finishes. By convention, if everything went OK it should return 0, and if some error occurs a non-zero exit status that uniquely identifies the error.

Command-line option parsing

As explained in the previous section, I typically follow the command-line option conventions of the platform. However, parsing them is usually not that straight forward. Some things we must take into account are:

We must know which parameters are command-line option flags (e.g. starting with -, -- or /) and which are non-option parameters.
Some command-line options have a required argument, some take an optional argument and some take none.

Luckily, there are many libraries available for a variety of programming languages/platforms implementing such a parser. So far, I have used the following:

For UNIX programs implemented in C/C++: getopt_long(), getopt().
For shell scripts: the getopt command-line utility
Java command-line utilities: jargs
JavaScript/Node.js: optparse.js

I have never implemented a sophisticated command-line parser myself. The only case in which I ended up implementing a custom parser is in the native Windows and AmigaOS ports of the IFF libraries projects -- I could not find any libraries supporting their native command-line option style. Fortunately, the command-line interfaces were quite simple, so it did not take me that much effort.

Validation and error reporting

Besides parsing the command-line options (and non-options), we must also check whether all mandatory parameters are set, whether they have the right format and set default values for unspecified parameters if needed.

Furthermore, in case of an error regarding the inputs: we must report it to the caller and exit the process with a non-zero exit status.

Documentation

Another important concern while developing a command-line utility (that IMO is often overlooked) is providing documentation about its usage.

For every command-line tool, I typically implement an help option parameter displaying a help page on the terminal. This help page contains the following sections:

A usage line describing briefly in what ways the command-line tool can be invoked and what their mandatory parameters are.
A brief description explaining what the tool does.
The command-line options. For each option, I document the following properties:
- The short and long option identifiers.
- Whether the option requires no parameter, an optional parameter, or a required parameter.
- A brief description of the option
- What the default value is, if appropriate
An example of an option that I have documented for the disnix-distribute utility is:
```
-m, --max-concurrent-transfers=NUM  Maximum amount of concurrent closure
                                    transfers. Defauls to: 2
```
that specifies that the amount of concurrent transfers can be specified through the -m short or --max-concurrent-transfers long option, requires a numeric argument and defaults to 2 if the option is unspecified.
An environment section describing which environment variables can be configured and their meanings.
I only document the exit statuses if any of them has a special meaning. Zero in case of success and non-zero in case of a failure is too obvious.

Besides concisely writing a help page, there are more practical issues with documentation -- the help page is typically not the only source of information that describes how command-line utilities can be used.

For projects that are a bit more mature, I also want to provide a manpage of every command-line utility that more or less contain the same stuff as a help page. In large and more complicated projects, such as Disnix, I also provide a Docbook manual that (besides detailed instructions and background information) includes the help pages of the command-line utilities in the appendix.

I used to write these manpages and docbook help pages by hand, but it's quite tedious to write the same stuff multiple times. Moreover, it is even more tedious to keep them all up-to-date and consistent.

Fortunately, we can also generate the latter two artifacts. GNU's help2man utility comes in quite handy to generate a manual page by invoking the --help and --version options of an existing command-line utility. For example, by following GNU's convention for writing help pages, I was able to generate a reasonably good manual page for the disnix-distribute tool, by simply running:


$ help2man --output=disnix-distribute.1 --no-info --name \
'Distributes intra-dependency closures of services to target machines' \
  --libtool ./disnix-distribute

If needed, additional information can be augmented to the generated manual page.

I also discovered a nice tool called doclifter that allows me to generate Docbook help pages from manual pages. I can run the following command to generate a Docbook man page section from the earlier generated manpage:


$ doclifter -x disnix-distribute.1

The above command-line instruction generates a Docbook 5 XML file (disnix-distribute.1.xml) from the earlier manual page and the result looks quite acceptable to me. The only thing I had to do is manually replacing some xml:id attributes so that a command's help page can be properly referenced from the other Docbook sections.

Modularization

When implementing a command-line utility many concerns need to be implemented besides the primary tasks that it needs to do. In this blog post I have mentioned the following:

Interface conventions
Command-line option parsing
Validation and error reporting
Documentation

I tend to separate the command-line interface implementing the above concerns into a separate module/file (typically called main.c), unless the job that the command-line tool should do is relatively simple (e.g. it can be expressed in 1-10 lines of code), because I consider modularization a good practice and a module should preferably not grow too big or do too many things.

Reuse

When I implement a toolset of command-line utilities, such as Disnix, I often see that these tools have many common parameters, common validation procedures and a common procedure displaying the tool version.

I typically abstract them away in a common module, file or library so that I don't find myself duplicating them making maintenance more difficult.

Discussion

In this blog post, I have explained that I prefer the command-line for many development and system administration tasks. Moreover, I have also explained that I have developed many command-line utilities.

Creating good quality command-line tools is not straight forward. To make myself and other people that I happen to work with aware of it, I have written down some of the aspects that I take into consideration.

Although the command-line has some appealing properties, it is obviously not perfect. Some command-line tools are weird, may significantly lack quality and cannot be integrated in a straight forward way through pipes. Furthermore, many shells (e.g. bash) implement a programming language having weird features and counter-intuitive traits.

For example, one of my favorite pitfalls in bash is its behaviour when executing shell scripts, such as the following script (to readers that are unfamiliar with the commands: false is a command that always fails):


echo "hello"
false | cat
echo "world"
false
echo "!!!"

When executing the script like this:


$ bash script.bash
Hello
world
!!!

bash simply executes everything despite the fact that two commands fail. However, when I add the -e command line parameter, the execution is supposed to be stopped if any command returns a non-zero exit status:


$ bash -e script.bash
Hello
world

However, there is still one oddity -- the pipe (false | cat) still succeeds, because the exit status of the pipe corresponds to the exit status of the last component of the pipe only! There is another way to check the status of the other components (through $PIPESTATUS), but this feels counter-intuitive to me! Fortunately, bash 3.0 and onwards have an additional setting that makes the behaviour come closer to what I expect:


$ bash -e -o pipefail script.bash
Hello

Despite a number of oddities, the overall idea of the command-line is good IMO: it is a general purpose environment that can be interactive, programmed and extended with custom commands implemented in nearly any programming language that can be integrated with each other in a straight forward manner.

Meanwhile, now that I have made myself aware of some important concerns, I have adapted the development versions of all my free and open source projects to properly reflect them.

A couple of months ago, I announced a new Disnix release after a long period of only little development activity.

As I have explained earlier, Disnix's main purpose is to automatically deploy service-oriented systems into heterogeneous networks of machines running various kinds of operating systems.

In addition to automating deployment, it has a couple of interesting non-functional properties as well. For example, it supports reliable deployment, because components implementing services are stored alongside existing versions and older versions are never automatically removed. As a result, we can always roll back to the previous configuration in case of a failure.

However, there is one major unaddressed concern when using Disnix to deploy a service-oriented system. Like the Nix the package manager -- that serves as the basis of Disnix --, Disnix does not manage state.

The absence of state management has a number of implications. For example, when deploying a database, it gets created on first startup, often with a schema and initial data set. However, the structure and contents of a database typically evolves over time. When updating a deployment configuration that (for example) moves a database from one machine to another, the changes that have been made since its initial deployment are not migrated.

So far, state management in combination with Disnix has always been a problem that must be solved manually or by using an external solution. For a single machine, manual state management is often tedious but still doable. For large networks of machines, however, it may become a problem that is too big too handle.

A few years ago, I rushed out a prototype tool called Dysnomia to address state management problems in conjunction with Disnix and wrote a research paper about it. In the last few months, I have integrated the majority of the concepts of this prototype into the master versions of Dysnomia and Disnix.

Executing state management activities

When deploying a service oriented system with Disnix, a number of deployment activities are executed. For the build and distribution activities, Disnix consults the Nix package manager.

After all services have been transferred, Disnix activates them and deactivates the ones that have become obsolete. Disnix consults Dysnomia to execute these activities through a plugin system that delegates the execution of these steps to an appropriate module for a given service type, such as a process, source code repository or a database.

Deployment activities carried out by Dysnomia require two mandatory parameters. The first parameter is a container specification capturing the properties of a container that hosts one or more mutable components. For example, a MySQL DBMS instance can be specified as follows:

type=mysql-database
mysqlUsername=root
mysqlPassword=verysecret

The above specification states the we have a container of type mysql-database that can be reached using the above listed credentials. The type attribute allows Dysnomia to invoke the module that executes the required deployment steps for MySQL.

The second parameter refers to a logical representation of the initial state of a mutable component. For example, a MySQL database is represented as a script that generates its schema:

create table author
( AUTHOR_ID  INTEGER       NOT NULL,
  FirstName  VARCHAR(255)  NOT NULL,
  LastName   VARCHAR(255)  NOT NULL,
  PRIMARY KEY(AUTHOR_ID)
);

create table books
( ISBN       VARCHAR(255)  NOT NULL,
  Title      VARCHAR(255)  NOT NULL,
  AUTHOR_ID  VARCHAR(255)  NOT NULL,
  PRIMARY KEY(ISBN),
  FOREIGN KEY(AUTHOR_ID) references author(AUTHOR_ID)
    on update cascade on delete cascade
);

A MySQL database can be activated in a MySQL DBMS, by running the following command-line instruction with the two configuration files shown earlier as parameters:

$ dysnomia --operation activate \
  --component ~/testdb \
  --container ~/mysql-production

The above command first checks if a MySQL database named testdb exists. If it does not exists, it gets created and the initial schema is imported. If the database with the given name exists already, it does nothing.

With the latest Dysnomia, it is also possible to run snapshot operations:

$ dysnomia --operation snapshot \
  --component ~/testdb \
  --container ~/mysql-production

The above command invokes the mysqldump utility to take a snapshot of the testdb in a portable and consistent manner and stores the output in a so-called Dysnomia snapshot store.

When running the following command-line instruction, the contents of the snapshot store is displayed for the MySQL container and testdb component:

$ dysnomia-snapshots --query-all --container mysql-database --component testdb
mysql-production/testdb/9b0c3562b57dafd00e480c6b3a67d29146179775b67dfff5aa7a138b2699b241
mysql-production/testdb/1df326254d596dd31d9d9db30ea178d05eb220ae51d093a2cbffeaa13f45b21c
mysql-production/testdb/330232eda02b77c3629a4623b498855c168986e0a214ec44f38e7e0447a3f7ef

As may be observed, the dysnomia-snapshots utility outputs three relative paths that correspond to three snapshots. The paths reflect over a number of properties, such as the container name and component name. The last path component is a SHA256 hash code reflecting its contents (that is computed from the actual dump).

Each container type follows its own naming convention to reflect its contents. While MySQL and most of the other Dysnomia modules use output hashes, also different naming conventions are used. For example, the Subversion module uses the revision id of the repository.

A naming convention using an attribute to reflect its contents has all kinds of benefits. For example, if the MySQL database does not change and we run the snapshot operation again, it discovers that a snapshot with the same output hash already exists, preventing it to store the same snapshot twice improving storage efficiency.

The absolute versions of the snapshot paths can be retrieved with the following command:

$ dysnomia-snapshots --resolve mysql-database/testdb/330232eda02b77c3629a4623b498855c...
/var/state/dysnomia/snapshots/mysql-production/testdb/330232eda02b77c3629a4623b498855...

Besides snapshotting, it is also possible to restore state with Dysnomia:

$ dysnomia --operation restore \
  --component ~/testdb \
  --container ~/mysql-production

The above command restores the latest snapshot generation. If no snapshot exist in the store, it does nothing.

Finally, it is also possible to clean things up. Similar to the Nix package manager, old components are never deleted automatically, but must be explicitly garbage collected. For example, deactivating the MySQL database can be done as follows:

$ dysnomia --operation deactivate \
  --component ~/testdb \
  --container ~/mysql-production

The above command does not delete the MySQL database. Instead, it simply marks it as garbage, but otherwise keeps it. Actually deleting the database can be done by invoking the garbage collect operation:

$ dysnomia --operation collect-garbage \
  --component ~/testdb \
  --container ~/mysql-production

The above command first checks whether the database has been marked as garbage. If this is the case (because it has been deactivated) it is dropped. Otherwise, this command does nothing (because we do not want to delete stuff that is actually in use).

Besides the physical state of components, also all generations of snapshots in the store are kept by default. They can be removed by running the snapshot garbage collector:

$ dysnomia-snapshots --gc --keep 3

The above command states that all but the last 3 snapshot generations should be removed from the snapshot store.

Managing state of service-oriented systems

With the new snapshotting facilities provided by Dysnomia, we have extended Disnix to support state deployment of service-oriented systems.

By default, the new version of Disnix does not manage state and its behaviour remains exactly the same as the previous version, i.e. it only manages the static parts of the system. To allow Disnix to manage state of services, they must be explicitly annotated as such in the services model:

staff = {
  name = "staff";
  pkg = customPkgs.staff;
  dependsOn = {};
  type = "mysql-database";
  deployState = true;
};

Adding the attribute deployState to a service that is set to true causes Disnix to manage its state as well. For example, when changing the target machine of the database in the distribution model and by running the following command:

$ disnix-env -s services.nix -i infrastructure.nix -d distribution.nix

Disnix executes the data migration phase after the configuration has been successfully activated. In this phase, Disnix snapshots the state of the annotated services on the target machines, transfers the snapshots to the new targets (through the coordinator machine), and finally restores their state.

In addition to data migration, Disnix can also be used as a backup tool. Running the following command:

$ disnix-snapshot

Captures the state of all annotated services in the configuration that has been previously deployed and transfers their snapshots to the coordinator machine's snapshot store.

Likewise, the snapshots can be restored as follows:

$ disnix-restore

By default, the above command only restores the state of the services that are in the last configuration, but not in the configuration before. However, it may also be desirable to force the state of all annotated services in the current configuration to be restored. This can be done as follows:

$ disnix-restore --no-upgrade

Finally, the snapshots that are taken on the target machines are not deleted automatically. Disnix can also automatically clean the snapshot stores of a network of machines:

$ disnix-clean-snapshots --keep 3 infrastructure.nix

The above command deletes all but the last three snapshot generations from all machines defined in the infrastructure model.

Discussion

The extended implementations of Dysnomia and Disnix implement the majority of concepts described in my earlier blog post and the corresponding paper. However, there are a number of things that are different:

The prototype implementation stores snapshots in the /dysnomia folder (analogous to the Nix store that resides in /nix/store), which is a non-FHS compliant directory. Nix has a number of very good reasons to deviate from the FHS and requires packages to be addressed by their absolute paths across machines so that they can be uniformly accessed by a dynamic linker.

However, such a level of strictness is not required for addressing snapshots. In the current implementation, snapshots are stored in /var/state/dysnomia which is FHS-compliant. Furthermore, snapshots are identified by their relative paths to the store folder. The snapshot store's location can be changed by setting the DYSNOMIA_STATEDIR environment variable, allowing someone to have multiple snapshot stores.
In the prototype, the semantics of the deactivate operation also imply deleting the state of mutable component in a container. As this is a dangerous and destructive operation, the current implementation separates the actual delete operation into a garbage collect operation that must be invoked explicitly.
In both the prototype and the current implementation, a Dysnomia plugin can choose its own naming convention to identify snapshots. In the prototype, the naming must reflect both the contents and the order in which the snapshots have been taken. As a general fallback, I proposed using timestamps.

However, timestamps are unreliable in a distributed setting because the machines' clocks may not be in sync. In the current implementation, I use output hashes as a general fallback. As hashes cannot reflect the order in their names, Dysnomia provides a generations folder containing symlinks to snapshots which names reflect the order in which they have been taken.

The paper also describes two concepts that are still unimplemented in the current master version:

The incremental snapshot operation is unimplemented. Although this feature may sound attractive, I could only properly do this with Subversion repositories and MySQL databases with binary logging enabled.
To upgrade a service-oriented system (that includes moving state) atomically, access to the system in the transition and data migration phases must be blocked/queued. However, when moving large data sets, this time window could be incredibly big.

As an optimization, I proposed an improved upgrade process in which incremental snapshots are transferred inside the locking time window, while full snapshots are transferred before the locking starts. Although it may sound conceptually nice, it is difficult to properly apply it in practice. I may still integrate it some day, but currently I don't need it. :)

Finally, there are some practical notes when using Dysnomia's state management facilities. Its biggest drawback is that the way state is managed (by consulting tools that store dumps on the filesystem) is typically expensive in terms of time (because it may take a long time writing a dump to disk) and storage. For very large databases, the costs may actually be too high.

As described in the previous blog post and the corresponding paper, there are alternative ways of doing state management:

Filesystem-level snapshotting is typically faster since files only need to be copied. However, its biggest drawback is that physical state may be inconsistent (because of unfinished write operations) and non-portable. Moreover, it may be difficult to manage individual chunks of state. NixOps, for example, supports partition-level state management of EBS volumes.
Database replication engines can also typically capture and transfer state much more efficiently.

Because Dysnomia's way of managing state has some expensive drawbacks, it has not been enabled by default in Disnix. Moreover, this was also the main reason why I did not integrate the features of the Dysnomia prototype sooner.

The reason why I have proceeded anyway, is that I have to manage a big environment of small databases, which sizes are only several megabytes each. For such an environment, Dysnomia's snapshotting facilities work fine.

Availability

The state management facilities described in this blog post are part of Dysnomia and Disnix version 0.4. I also want to announce their immediate availability! Visit the Disnix homepage for more information!

As with the previous release, Disnix still remains a tool that should be considered an advanced prototype, despite the fact that I am using it on a daily basis to eat my own dogfood. :)

I have been working on many Disnix related aspects for the last few months. For example, in my last blog post I have announced a new Disnix release supporting experimental state management.

Although I am quite happy with the most recent feature addition, another major concern that the basic Disnix toolset does not solve is coping with the dynamism of the services and the environment in which a system has been deployed.

Static modeling of services and the environment have the following consequences:

We must write an infrastructure model reflecting all relevant properties of all target machines. Although writing such a configuration file for a new environment is doable, it is quite tedious and error prone to keep it up to date and in sync with their actual configurations -- whenever a machine's property or the network changes, the infrastructure model must be updated accordingly.

(As a sidenote: when using the DisnixOS extension, a NixOS network model is used instead of an infrastructure model from which the machine's configurations can be automatically deployed making the consistency problem obsolete. However, the problem remains to persist if we need to deploy to a network of non-NixOS machines)
We must manually specify the distribution of services to machines. This problem typically becomes complicated if services have specific technical requirements on the host that they need to run (e.g. operating system, CPU architecture, infrastructure components such as an application server).

Moreover, a distribution could also be subject to non-functional requirements. For example, a service providing access to privacy-sensitive data should not be deployed to a machine that is publicly accessible from the internet.

Because requirements may be complicated, it is typically costly to repeat the deployment planning process whenever the network configuration changes, especially if the process is not automated.

To cope with the above listed issues, I have developed a prototype extension called Dynamic Disnix and wrote a paper about it. The extension toolset provides the following:

A discovery service that captures the properties of the machines in the network from which an infrastructure model is generated.
A framework allowing someone to automate deployment planning processes using a couple of algorithms described in the literature.

Besides the dynamism of the infrastructure model and distribution models, I also observed that the services model (capturing the components of which a system consists) may be too static in certain kinds of situations.

Microservices

Lately, I have noticed that some kind of new paradigm named Microservice architectures is gaining a lot of popularity. In many ways this new trend reminds me of the service-oriented architectures days -- everybody was talking about it and had success stories, but nobody had a full understanding of it, nor an idea what it exactly was supposed to mean.

However, if I would restrict myself to some of their practical properties, microservices (like "ordinary" services in a SOA-context) are software components and one important trait (according to Clemens Szyperski's Component Software book) is that a software component:

is a unit of independent deployment

Another important property of microservices is that they interact with other by sending messages through the HTTP communication protocol. In practice, many people accomplish this by running processes with an embedded HTTP server (as opposed to using application servers or external web servers).

Deploying Microservices with Disnix

Although Disnix was originally developed to deploy a "traditional" service-oriented system case-study (consisting of "real" web services using SOAP as communication protocol), it has been made flexible enough to deploy all kinds of components. Likewise, Disnix can also deploy components that qualify themselves as microservices.

However, when deploying microservices (running embedded HTTP servers) there is one practical issue -- every microservice must listen on their own unique TCP port on a machine. Currently, meeting this requirement is completely the responsibility of the person composing the Disnix deployment models.

In some cases, this problem is more complicated than expected. For example, manually assigning a unique TCP port to every service for the initial deployment is straight forward, but it may also be desired to move a service from one machine to another. It could happen that a previously assigned TCP port will conflict with another service after moving it, breaking the deployment of the system.

The port assignment problem

So far, I take the following aspects into account when assignment ports:

Each service must listen on a port that is unique to the machine the service runs on. In some cases, it may also be desirable to assign a port that is unique to the entire network (instead of a single machine) so that it can be uniformly accessed regardless of its location.
The assigned ports must be within a certain range so that (for example) they do not collide with system services.
Once a port number has been assigned to a service, it must remain reserved until it gets undeployed.

The alternative would be to reassign all port numbers to all services for each change in the network, but that can be quite costly in case of an upgrade. For example, if we upgrade a network running 100 microservices, all 100 of them may need to be deactivated and activated to make them listen on their newly assigned ports.

Dynamically configuring ports in Disnix models

Since it is quite tedious and error prone to maintain port assignments in Disnix models, I have developed a utility to automate the process. To dynamically assign ports to services, they must be annotated with the portAssign property in the services model (which can be changed to any other property through a command-line parameter):


{distribution, system, pkgs}:

let
  portsConfiguration = if builtins.pathExists ./ports.nix
    then import ./ports.nix else {};
  ...
in
rec {
  roomservice = rec {
    name = "roomservice";
    pkg = customPkgs.roomservicewrapper { inherit port; };
    dependsOn = {
      inherit rooms;
    };
    type = "process";
    portAssign = "private";
    port = portsConfiguration.ports.roomservice or 0;
  };

  ...

  stafftracker = rec {
    name = "stafftracker";
    pkg = customPkgs.stafftrackerwrapper { inherit port; };
    dependsOn = {
      inherit roomservice staffservice zipcodeservice;
    };
    type = "process";
    portAssign = "shared";
    port = portsConfiguration.ports.stafftracker or 0;
    baseURL = "/";
  };
}

In the above example, I have annotated the roomservice component with a private port assignment property meaning that we want to assign a TCP port that is unique to the machine and the stafftracker component with a shared port assignment meaning that we want to assign a TCP port that is unique to the network.

By running the following command we can assign port numbers:


$ dydisnix-port-assign -s services.nix -i infrastructure.nix \
    -d distribution.nix > ports.nix

The above command generates a port assignment configuration Nix expression (named: ports.nix) that contains port reservations for each service and port assignment configurations for the network and each individual machine:


{
  ports = {
    roomservice = 8001;
    ...
    zipcodeservice = 3003;
  };
  portConfiguration = {
    globalConfig = {
      lastPort = 3003;
      minPort = 3000;
      maxPort = 4000;
      servicesToPorts = {
        stafftracker = 3002;
      };
    };
    targetConfigs = {
      test2 = {
        lastPort = 8001;
        minPort = 8000;
        maxPort = 9000;
        servicesToPorts = {
          roomservice = 8001;
        };
      };
    };
  };
}

The above configuration attribute set contains three properties:

The ports attribute contains the actual port numbers that have been assigned to each service. The services defined in the services model (shown earlier) refer to the port values defined here.
The portConfiguration attribute contains port configuration settings for the network and each target machine. The globalConfig attribute defines a TCP port range with ports that must be unique to the network. Besides the port range it also stores the last assigned TCP port number and all global port reservations.
The targetConfigs attribute contains port configuration settings and reservations for each target machine.

We can also run the port assign command-utility again with an existing port assignment configuration as a parameter:


$ dydisnix-port-assign -s services.nix -i infrastructure.nix \
    -d distribution.nix -p ports.nix > ports2.nix

The above command-line invocation reassigns TCP ports, taking the previous port reservations into account so that these will be reused where possible (e.g. only new services get a port number assigned). Furthermore, it also clears all port reservations of the services that have been undeployed. The new port assignment configuration is stored in a file called ports2.nix.

Conclusion

In this blog post, I have identified another deployment planning problem that manifests itself when deploying microservices that all have to listen on a unique TCP port. I have developed a utility to automate this process.

Besides assigning port numbers, there are many other kinds of problems that need a solution while deploying microservices. For example, you might also want to restrict their privileges (e.g. by running all of them as separate unprivileged users). It is also possible to take care of that with Dysnomia.

Availability

The dydisnix-port-assign utility is part of the Dynamic Disnix toolset that can be obtained from my GitHub page. Unfortunately, the Dynamic Disnix toolset is still a prototype with no end-user documentation or a release, so you have to be brave to use it.

Moreover, I have created yet another Disnix example package (a Node.js variant of the ridiculous StaffTracker example) to demonstrate how "microservices" can be deployed. This particular variant uses Node.js as implementation platform and exposes the data sets through REST APIs. All components are microservices using Node.js' embedded HTTP server listening on their own unique TCP ports.

I have also modified the TCP proxy example to use port assignment configurations generated by the tool described in this blog post.

I have noticed that while doing development work, many outsiders experience the way I work as quite odd and consider it to be inefficient.

The first reason is (probably) because I like the command-line for many tasks and I frequently use a unconventional text editor, which some people don't understand. Second, I also often use Nix (and related Nix utilities) during development.

In this blog post, I'd like to elaborate a bit about the second aspect and discuss why it is actually quite useful to do this.

Obtaining and installing the dependencies of a software project

In all software projects I have been involved with so far, the first thing I typically had to do is installing all its required dependencies. Examples of such dependencies are a compiler/interpreter + runtime environment for a specific programming language (such as GCC, the Java Development Kit, Perl, Python, Node.js etc.), a development IDE (such as Eclipse), and many library packages.

I have experienced that this step is often quite time consuming, since many dependencies have to be downloaded and installed. Moreover, I have frequently encountered situations in which a bunch of special settings have to be configured to make everything work right, which can be quite cumbersome and tedious.

Many developers just simply download and install everything on their machine in an ad-hoc manner and manually perform all the steps they have to do.

Automating the deployment of dependencies of a software project

What I often do instead is to immediately use Nix to automate the process of installing all software project dependencies and configuring all their settings.

To fully automate deployment with Nix, all dependencies have to be provided as Nix packages. First, I investigate if everything I need is in the Nix packages collection. Fortunately, the Nix package collection provides packages for many kinds of programming languages and associated libraries. If anything is missing, I package it myself. Although some things are very obscure to package (such as the Android SDK), most things can be packaged in a straight forward manner.

After packaging all the missing dependencies, I must install them and generate an environment in which all the dependencies can be found. For example, in order to let Java projects find their library dependencies, we must set the CLASSPATH environment variable to refer to directories or JAR files containing our required compiled classes, PERL5LIB to let Perl find its Perl modules, NODE_PATH to let Node.js find its CommonJS modules etc.

The Nix packages collection has a nice utility function called myEnvFun that can be used to automatically do this. This function can be used to compose development environments that automatically set nearly all of its required development settings. The following example Nix expression (disnixenv.nix) can be used to deploy a development environment for Disnix:


with import <nixpkgs> {};

myEnvFun {
  name = "disnix";
  buildInputs = [
    pkgconfig dbus dbus_glib libxml2 libxslt getopt
    autoconf automake libtool dysnomia
  ];
}

The development environment can be deployed by installing it with the Nix package manager:


$ nix-env -f disnixenv.nix -i env-disnix

The above command-line invocation deploys all Disnix's dependencies and generates a shell script (named: load-env-disnix) that launches a shell session in an environment in which all build settings are configured. We can enter this development environment by running:


$ load-env-disnix
env-disnix loaded

disnix:[sander@nixos:~]$

As can be seen in the code fragment above, we have entered an environment in which all its development dependencies are present and configured. For example, the following command-line instruction should work, since libxml2 is declared as a development dependency:


$ xmllint --version
xmllint: using libxml version 20901

We can also unpack the Disnix source tarball and run the following build instructions, which should work without any trouble:


$ ./bootstrap
$ ./configure
$ make

Besides deploying a development environment, we can also discard it if we don't need it anymore by running:


$ nix-env -e env-disnix

Automating common deployment tasks

After deploying a development environment, I'm usually doing all kinds of development tasks until the project reaches a more stable state. Once this is the case, I immediately automate most of its deployment operations in a Nix expression (that is often called: release.nix).

This Nix expression is typically an attribute set following Hydra's (the Nix-based continuous integration server) convention in which each attribute represents a deployment job. Jobs that I typically implement are:

Source package. This is a job that helps me to easily redistribute the source code of a project to others. For GNU Autotools based projects this typically involves the: make dist command-line instruction. Besides GNU Autotools, I also generate source packages for other kinds of projects. For example, in one of my Node.js projects (such a NiJS) I produce a TGZ file that can be deployed with the NPM package manager.
Binary package jobs that actually build the software project for every target system that I want to support, such as i686-linux, x86_64-linux, and x86_64-darwin.
A job that automatically compiles the program manual from e.g. Docbook, if this applicable.
A program documentation catalog job. Many programming languages allow developers to write code-level documentation in comments from which a documentation catalog can be generated, e.g. through javadoc, doxygen or JSDuck. I also create a job that takes care of doing that.
Unit tests. If my project has a unit test suite, I also create a job executing those, since it is required to deploy most of its development dependencies to run the test suite as well.
System integration tests. If I have any system integration tests that can run be on Linux, I try to implement a job using the NixOS test driver to run those. The NixOS test driver also automatically deploys all the required environmental dependencies in a VM, such as a DBMS, web server etc. in such a way that you can run them as a unprivileged user without affecting the host system's configuration.

After writing a release Nix expression, I can use the Nix package manager to perform all the deployment tasks for my software project. For example, I can run the following job that comes with Disnix to build a source tarball:


$ nix-build release.nix -A tarball

Another advantage is that I can use the same expression in combination with Hydra to so that I can continuously build and test it.

Spawning development environments from Nix jobs

Apart from being able to deploy a software project and its dependencies, another nice advantage of having its deployment automated through Nix is that I can use the same Nix jobs to reproduce its build environment. By replacing nix-build with nix-shell all its dependencies are deployed, but instead of building the package itself, a shell session is started in which all its dependencies are configured:


$ nix-shell release.nix -A tarball

The above command-line invocation produces a similar build environment as shown earlier with myEnvFun by just using a Nix expression for an existing job.

Discussion

So far I have explained what I typically do with the Nix package manager during development. So why is this actually useful?

Besides deploying systems into production environments, setting up a development environment is a similar problem with similar complexities. For me it is logical that I solve these kind of problems the same way as I do in production environments.
Packaging all software project dependencies in Nix ensures reliable and reproducible deployment in production environments as well, since Nix ensures dependency completeness and removes as many side effects as possible. In Nix, we have strong guarantees that a system deployed through Nix in production behaves exactly the same in development and vice-versa.
We can also share development environments, which is useful for a variety of reasons. For example, we can reproduce the same development environment on a second machine, or give it to another developer to prevent him spending a lot of time in setting up his development environment.
Supporting continuous integration and testing with Hydra comes (almost) for free.
With relatively little effort you can also enable distributed deployment of your software project with Disnix and/or NixOps.

Although I have encountered people telling me that it's consuming too much time, you also get a lot for it in return. Moreover, in most common scenarios I don't think the effort of automating the deployment of a development environment is that big if you have experience with it.

As a concluding remark, I know that things like Agile software development and continuous delivery of new versions is something that (nearly) everybody wants these days. To implement continuous delivery/deployment, the first step is to automate the deployment of a software project from the very beginning. That is exactly what I'm doing by executing the steps described in this blog post.

People may also want to read an old blog post that contains recommendations to improve software deployment processes.

Today, it's three years ago that I started this blog, so I thought this is a good opportunity to reflect about last year's writings.

Obtaining my PhD degree

I think the most memorable thing that happened for me this year is the fact that I finally obtained my PhD degree. From that moment on, I can finally call myself Dr. Sander, or actually (if I take my other degree into account): Dr. Ir. Sander (or in fact: Dr. Ir. Ing. Sander if I take all of them into account, but I believe the last degree has been been superseded by the middle one, but I'm not sure :-) ).

Anyway, after obtaining my PhD degree I don't feel that much different, apart from the fact that I feel relieved that it's done. It took me quite some effort to get my PhD dissertation and all the preparations done for the defense. Besides my thesis, I also had to defend my propositions. Most of them were not supposed to be directly related to my research subject.

Programming in JavaScript

From the moment that I have switched jobs, I have also been involved in a lot of JavaScript programming these days. Every programming language and runtime environment have their weird/obscure problems/challenges, but in my opinion JavaScript is a very special case.

As a former teaching assistant for the concepts of programming languages course, I remain interested in discovering important lessons allowing me to prevent turning code into a mess and dealing with challenges that a particular programming gives. So far, I have investigated object oriented programming through prototypes, and two perspectives of dealing with asynchronous programming problems that come with most JavaScript environments.

Besides programming challenges I also have to perform deployment tasks from JavaScript programs. People who happen to know me know that I prefer Nix and Nix-related solutions. I have developed NiJS: an internal DSL for Nix to make my life a bit easier to do that.

Continuous integration and testing

Another technical aspect I have been working on is setting up a continuous integration facility by using Hydra: the Nix-based continuous integration server. I wrote a couple of blog posts describing its features, how to set it up and how to secure it.

I also made a couple improvements to the Android and iOS Nix build functions, so that I can use Hydra to continuously build mobile apps.

Nix/NixOS development

Besides Hydra, I have been involved with various other parts of the Nix project as well. One of the more interesting things I did is developing a Nix function that can be used to compose FHS-compatible chroot environments. This function is particularly useful to run binary-only software in NixOS that cannot be patched, such as Steam.

I also wrote two blog posts to explain the user environment and development environment concepts.

Fun programming

Besides my PhD defense and all other activities, there was a bit of room to do some fun programming as well. I have improved my Amiga video emulation library (part of my IFF file format experiments project) a bit by implementing support for super hires resolutions and redefining its architecture.

Moreover, I have updated all the packages to use the new version of this library.

Research

After obtaining my PhD degree, I'm basically relieved from my research/publication duties. However, there is one research related thing that caught my attention two months ago.

The journal paper titled: 'Disnix: A toolset for distributed deployment' that got accepted in April last year is finally going to get published in Volume 79 of 'Science of Computer Programming', unbelievable!

Although it sounds like good news that another paper of mine gets published, the thing that disturbs me is that the publication process took an insanely long time! I wrote the first version of this paper for the WASDeTT-3 workshop that was held in October 2010. So that technically means that I started doing the work for the paper several months prior to that.

In February 2011, I have adapted/extended the workshop paper and submitted its first journal draft. Now, in January 2014 it finally gets published, which means that it took almost 3 years to get published (if you take the workshop into account as well, then it's actually closer to 3,5 years!).

In some academic assessment standards, journal papers have more value than conference papers. Although this journal paper should increase my value as a researcher, it's actually crazy if you think too much about it. The first reason is that I wrote the first version of this paper before I started this blog. Meanwhile, I have already written 54 blog articles, two tech reports, published two papers at conferences, and I finished my PhD dissertation.

The other reason is that peer reviewing and publishing should help the authors and the research discipline in general. To me this does not look like any help. Meanwhile, in the current development version of Disnix some aspects of its architecture have evolved considerably compared to what has been described in the paper, so it is no use for anyone else in the research community anymore.

The only value the paper still provides are the general ideas and the way Disnix manifests itself externally.

Although the paper is not completely valueless, and I'm happy it gets published, it also feels weird that I don't depend on it anymore.

Blog posts

As with my previous annual reflections, I will also publish the top 10 of my most frequently read blog posts:

On Nix and GNU Guix. This is a critical blog post that also ended up first in last year's top 10. I think this blog posts will remain at the top position for the time being, since it attracted an insane amount of visitors.
An alternative explanation of the Nix package manager. My alternative explanation of Nix, which I wrote to clarify things. It was also second in last year's top 10.
Setting up a Hydra build cluster for continuous integration and testing (part 1). Apparently, Hydra and some general principles about continuous integration have attracted quite some vistors. However, the follow up blog posts I wrote about Hydra don't seem to be that interesting to outsiders.
Using Nix while doing development. I wrote this blog post 2 days ago, and it attracted quite some visitors. I have noticed that setting up development environments is an attractive feature for Nix users.
Second computer. This is an old blog post about my good ol' Amiga. It was also in all previous top 10s and I think it will remain like that for the time being. The Amiga rocks!
An evaluation and comparison of GoboLinux. Another blog article that remains popular from the beginning. It's still a pity that GoboLinux has not been updated and sticks to their 014.01 release, which dates from 2008.
Composing FHS-compatible chroot environments with Nix (or deploying Steam in NixOS). This is something I have developed to be able to run Steam in NixOS. It seems to have attracted quite some users, which does not come as a surprise. NixOS users want to play Half-Life!
Software deployment complexity. An old blog post about software deployment complexity in general. Still remains popular.
Deploying iOS applications with the Nix package manager. A blog post that I wrote last year describing how we can use the Nix package manager to build apps for the iPhone/iPad. For a long time the Android variant of this blog post was more popular, but recently this blog article surpassed it. I have no clue why.
Porting software to AmigaOS (unconventional style). People still seem to like one of my craziest experiments.

Conclusion

I already have three more blog posts in draft/planning stages and more ideas that I like to explore, so expect more to come. The remaining thing I'd like to say is:

HAPPY NEW YEAR!!!

Last month, I have been working on quite a lot of things. One of the things I did was improving the Nix function that builds Titanium SDK applications. In fact, it was in Nixpkgs for quite a while already, but I have never written about it on my blog, apart from a brief reference in an earlier blog post about Hydra.

The reason that I have decided to write about this function is because the process of getting Titanium applications deployable with Nix is quite painful (although I have managed to do it) and I want to report about my experiences so that these issues can be hopefully resolved in the future.

Although I have a strong opinion on certain aspects of Titanium, this blog post is not to discuss about the development aspects of the Titanium framework. Instead, the focus is on getting the builds of Titanium apps automated.

What is Titanium SDK?

Titanium is an application framework developed by Appcelerator, which purpose is to enable rapid development of mobile apps for multiple platforms. Currently, Titanium supports iOS, Android, Tizen, Blackberry and mobile web applications.

With Titanium, developers use JavaScript as an implementation language. The JavaScript code is packaged along with the produced app bundles, deployed to an emulator or device and interpreted there. For example, on Android Google's V8 JavaScript runtime is used, and on iOS Apple's JavaScriptCore is used.

Besides using JavaScript code, Titanium also provides an API supporting database access and (fairly) cross platform GUI widgets that have a (sort of) native look on each platform.

Titanium is not a write once run anywhere approach when it comes to cross platform support, but claims that 60-90% of the app code can be reused among platforms.

Finally, the Titanium Studio software distribution is proprietary software, but most of its underlying components (including the Titanium SDK) are free and open-source software available under the Apache Software License. As far as I can see, the Nix function that I wrote does not depend on any proprietary components, besides the Java Development Kit.

Packaging the Titanium CLI

The first thing that needs to be done to automate Titanium builds is being able to build stuff from the command-line. Appcelerator provides a command-line utility (CLI) that is specifically designed for this purpose and is provided as a Node.js package that can be installed through the NPM package manager.

Packaging NPM stuff in Nix is actually quite straight forward and probably the easiest part of getting the builds of Titanium apps automated. Simply adding titanium to the list of node packages (pkgs/top-level/node-packages.json) in Nixpkgs and running npm2nix, a utility developed by Shea Levy that automatically generates Nix expressions for any node package and all their dependencies, did the job for me.

Packaging the Titanium SDK

The next step is packaging the Titanium SDK that contains API libraries, templates and build script plugins for each target platform. The CLI supports multiple SDK versions at the same time and requires at least one version of an SDK installed.

I've obtained an SDK version from Appcelerator's continuous builds page. Since the SDK distributions are ZIP files containing binaries, I have to use the patching/wrapping tricks I have described in a few earlier blog posts again.

The Nix expression I wrote for the SDK basically unzips the 3.2.1 distribution, copies the contents into the Nix store and makes the following changes:

The SDK distribution contains a collection of Python scripts that execute build and debugging tasks. However, to be able to run them in NixOS, the shebangs must be changed so that the Python interpreter can be found:
```
find . -name \*.py | while read i
do
    sed -i -e "s|#!/usr/bin/env python|#!${python}/bin/python|" $i
done
```
The SDK contains a subdirectory (mobilesdk/3.2.1.v20140206170116) with a version number and timestamp in it. However, the timestamp is a bit inconvenient, because the Titanium CLI explicitly checks for SDK folders that correspond to a Titanium SDK version number in a Titanium project file (tiapp.xml). Therefore, I strip it out of the directory name to make my life easier:
```
$ cd mobilesdk/*
$ mv 3.2.1.v20140206170116 3.2.1.GA
```
The Android builder script (mobilesdk/*/android/builder.py) packages certain files into an APK bundle (which is technically a ZIP file).

However, the script throws an exception if it encounters files with timestamps below January 1, 1980, which are not supported by the ZIP file format. This is a problem, because Nix automatically resets timestamps of deployed packages to one second after January 1, 1970 (a.k.a. UNIX-time: 1) to make builds more deterministic. To remedy the issue, I had to modify several pieces of the builder script.

What I basically did to fix this is searching for invocations to ZipFile.write() that adds a file from the filesystem to a zip archive, such as:
```
apk_zip.write(os.path.join(lib_source_dir, 'libtiverify.so'), lib_dest_dir + 'libtiverify.so')
```
I refactored such invocations into a code fragment using a file stream:
```
info = zipfile.ZipInfo(lib_dest_dir + 'libtiverify.so')
info.compress_type = zipfile.ZIP_DEFLATED
info.create_system = 3
tf = open(os.path.join(lib_source_dir, 'libtiverify.so'))
apk_zip.writestr(info, f.read())
tf.close()
```
The above code fragment ignores the timestamp of the files to be packaged and uses the current time instead, thus fixing the issue with files that reside in the Nix store.
There were two ELF executables (titanium_prep.{linux32,linux64}) in the distribution. To be able to run them under NixOS, I had to patch them so that the dynamic linker can be found:
```
$ patchelf --set-interpreter ${stdenv.gcc.libc}/lib/ld-linux-x86-64.so.2 \
    titanium_prep.linux64
```
The Android builder script (mobilesdk/*/android/builder.py) requires the sqlite3 python module and the Java Development Kit. Since dependencies do not reside in standard locations in Nix, I had to wrap the builder script to allow it to find them:
```
mv builder.py .builder.py
cat > builder.py <<EOF
#!${python}/bin/python

import os, sys

os.environ['PYTHONPATH'] = '$(echo ${python.modules.sqlite3}/lib/python*/site-packages)'
os.environ['JAVA_HOME'] = '${jdk}/lib/openjdk'

os.execv('$(pwd)/.builder.py', sys.argv)
EOF
```
Although the Nixpkgs collection has a standard function (wrapProgram) to easily wrap executables, I could not use it, because this function turns any executable into a shell script. The Titanium CLI expects that this builder script is a Python script and will fail if there is a shell code around it.
The iOS builder script (mobilesdk/osx/*/iphone/builder.py) invokes ditto to do a recursive copy of a directory hierarchy. However, this executable cannot be found in a Nix builder environment, since the PATH environment variable is set to only the dependencies that are specified. The following command fixes it:
```
$ sed -i -e "s|ditto|/usr/bin/ditto|g" \
    $out/mobilesdk/osx/*/iphone/builder.py
```
When building IPA files for iOS devices, the Titanium CLI invokes xcodebuild, that in turn invokes the Titanium CLI again. However, it does not seem to propagate all parameters properly, such as the path to the CLI's configuration file. The following modification allows me to set an environment variable called: NIX_TITANIUM_WORKAROUND providing additional parameters to work around it:
```
$ sed -i -e "s|--xcode|--xcode '+process.env['NIX_TITANIUM_WORKAROUND']+'|" \
    $out/mobilesdk/osx/*/iphone/cli/commands/_build.js
```

Building Titanium Apps

Besides getting the Titanium CLI and SDK packaged in Nix, we must also be able to build Titanium apps. Apps can be built for various target platforms and come in several variants.

For some unknown reason, the Titanium CLI (in contrast to the old Python build script) forces people to login with their Appcelerator account, before any build task can be executed. However, I discovered that after logging in a file is written into the ~/.titanium folder indicating that the system has logged in. I can simulate logins by creating this file myself:


export HOME=$TMPDIR

mkdir -p $HOME/.titanium
cat > $HOME/.titanium/auth_session.json <<EOF
{ "loggedIn": true }
EOF

We also have to tell the Titanium CLI where the Titanium SDK can be found. The following command-line instruction updates the config to provide the path to the SDK that we have just packaged:


$ echo "{}"> $TMPDIR/config.json
$ titanium --config-file $TMPDIR/config.json --no-colors \
    config sdk.defaultInstallLocation ${titaniumsdk}

The Titanium SDK also contains a collection of prebuilt modules, such as one to connect to Facebook. To allow them to be found, I run the following command line instruction to adapt the module search path:


$ titanium --config-file $TMPDIR/config.json --no-colors \
    config paths.modules ${titaniumsdk}

I have also noticed that if the SDK version specified in a Titanium project file (tiapp.xml) does not match the version of the installed SDK, the Titanium CLI halts with an exception. Of course, the version number in a project file can be adapted, but it in my opinion, it's more flexible to just be able to take any version. The following instruction replaces the version inside tiapp.xml into something else:


$ sed -i -e "s|<sdk-version>[0-9a-zA-Z\.]*</sdk-version>|<sdk-version>${tiVersion}</sdk-version>|" tiapp.xml

Building Android apps from Titanium projects

For Android builds, we must tell the Titanium CLI where to find the Android SDK. The following command-line instruction adds its location to the config file:


$ titanium config --config-file $TMPDIR/config.json --no-colors \
    android.sdkPath ${androidsdkComposition}/libexec/android-sdk-*

The variable: androidsdkComposition refers to an Android SDK plugin composition provided by the Android SDK Nix expressions I have developed earlier.

After performing the previous operation, the following command-line instruction can be used to build a debug version of an Android app:


$ titanium build --config-file $TMPDIR/config.json --no-colors --force \
    --platform android --target emulator --build-only --output $out

If the above command succeeds, an APK bundle called app.apk is placed in the Nix store output folder. This bundle contains all the project's JavaScript code and is signed with a developer key.

The following command produces a release version of the APK (meant for submission to the Play Store) in the Nix store output folder, with a given key store, key alias and key store password:


$ titanium build --config-file $TMPDIR/config.json --no-colors --force \
    --platform android --target dist-playstore --keystore ${androidKeyStore} \
    --alias ${androidKeyAlias} --password ${androidKeyStorePassword} \
    --output-dir $out

Before the JavaScript files are packaged along with the APK file, they are first passed through Google's Closure Compiler, which performs some static checking, removes dead code, and minifies all the source files.

Building iOS apps from Titanium projects

Apart from Android, we can also build iOS apps from Titanium projects.

I have discovered that while building for iOS, the Titanium CLI invokes xcodebuild which in turn invokes the Titanium CLI again. However, it does not propagate the --config-file parameter, causing it to fail. The earlier hack that I made in the SDK expression with the environment variable can be used to circumvent this:


export NIX_TITANIUM_WORKAROUND="--config-file $TMPDIR/config.json"

After applying the workaround, building an app for the iPhone simulator is straight forward:


$ cp -av * $out
$ cd $out

$ titanium build --config-file $TMPDIR/config.json --force --no-colors \
    --platform ios --target simulator --build-only \
    --device-family universal --output-dir $out

After running the above command, the simulator executable is placed into the output Nix store folder. It turns out that the JavaScript files of the project folder are symlinked into the folder of the executable. However, after the build has completed these symlink references will become invalid, because the temp folder has been deleted. To allow the app to find these JavaScript files, I simply copy them along with the executable into the Nix store.

Finally, the most complicated task is producing IPA bundles to deploy an app to a device for testing or to the App Store for distribution.

Like native iOS apps, they must be signed with a certificate and mobile provisioning profile. I used the same trick described in an earlier blog post on building iOS apps with Nix to generate a temporary keychain in the user's home directory for this:


export HOME=/Users/$(whoami)
export keychainName=$(basename $out)

security create-keychain -p "" $keychainName
security default-keychain -s $keychainName
security unlock-keychain -p "" $keychainName
security import ${iosCertificate} -k $keychainName -P "${iosCertificatePassword}" -A

provisioningId=$(grep UUID -A1 -a ${iosMobileProvisioningProfile} | grep -o "[-A-Z0-9]\{36\}")

if [ ! -f "$HOME/Library/MobileDevice/Provisioning Profiles/$provisioningId.mobileprovision" ]
then
    mkdir -p "$HOME/Library/MobileDevice/Provisioning Profiles"
    cp ${iosMobileProvisioningProfile} \
"$HOME/Library/MobileDevice/Provisioning Profiles/$provisioningId.mobileprovision"
fi

I also discovered that builds fail, because some file (the facebook module) from the SDK cannot be read (Nix makes deployed package read-only). I circumvented this issue by making a copy of the SDK in my temp folder, fixing the file permissions, and configure the Titanium CLI to use the copied SDK instance:


cp -av ${titaniumsdk} $TMPDIR/titaniumsdk

find $TMPDIR/titaniumsdk | while read i
do
    chmod 755 "$i"
done

titanium --config-file $TMPDIR/config.json --no-colors \
    config sdk.defaultInstallLocation $TMPDIR/titaniumsdk

Because I cannot use the temp folder as a home directory, I also have to simulate a login again:


$ mkdir -p $HOME/.titanium
$ cat > $HOME/.titanium/auth_session.json <<EOF
{ "loggedIn": true }
EOF

Finally, I can build an IPA by running:


$ titanium build --config-file $TMPDIR/config.json --force --no-colors \
    --platform ios --target dist-adhoc --pp-uuid $provisioningId \
    --distribution-name "${iosCertificateName}" \
    --keychain $HOME/Library/Keychains/$keychainName \
    --device-family universal --output-dir $out

The above command-line invocation minifies the JavaScript code, builds an IPA file with a given certificate, mobile provisioning profile and authentication credentials, and puts the result in the Nix store.

Example: KitchenSink

I have encapsulated all the builds commands shown in the previous section into a Nix function called: titaniumenv.buildApp {}. To test the usefulness of this function, I took KitchenSink, an example app provided by Appcelerator, to show Titanium's abilities. The App can be deployed to all target platforms that the SDK supports.

To package KitchenSink, I wrote the following expression:


{ titaniumenv, fetchgit
, target, androidPlatformVersions ? [ "11" ], release ? false
}:

titaniumenv.buildApp {
  name = "KitchenSink-${target}-${if release then "release" else "debug"}";
  src = fetchgit {
    url = https://github.com/appcelerator/KitchenSink.git;
    rev = "d9f39950c0137a1dd67c925ef9e8046a9f0644ff";
    sha256 = "0aj42ac262hw9n9blzhfibg61kkbp3wky69rp2yhd11vwjlcq1qc";
  };
  tiVersion = "3.2.1.GA";

  inherit target androidPlatformVersions release;

  androidKeyStore = ./keystore;
  androidKeyAlias = "myfirstapp";
  androidKeyStorePassword = "mykeystore";
}

The above function fetches the KitchenSink example from GitHub and builds it for a given target, such as iphone or android, and supports building a debug version for an emulator/simulator, or a release version for a device or for the Play store/App store.

By invoking the above function as follows, a debug version of the app for Android is produced:


import ./kitchensink {
  inherit (pkgs) fetchgit titaniumenv;
  target = "android";
  release = false;
}

The following function invocation produces an iOS executable that can be run in the iPhone simulator:


import ./kitchensink {
  inherit (pkgs) fetchgit titaniumenv;
  target = "iphone";
  release = false;
}

As may be observed, building KitchenSink through Nix is a straight forward process for most targets. However, the target producing an IPA version of KitchenSink that we can deploy to a real device is a bit complicated to use, because of some restrictions made by Apple.

Since all apps that are deployed to a real device have to be signed and the mobile provisioning profile should match the app's app id, this is sort of problem. Luckily, I can also do a comparable renaming trick as I have described earlier with in a blog post about improving the testability of iOS apps. Simply executing the following commands in the KitchenSink folder were sufficient:


sed -i -e "s|com.appcelerator.kitchensink|${newBundleId}|" tiapp.xml
sed -i -e "s|com.appcelerator.kitchensink|${newBundleId}|" manifest

The above commands change the com.appcelerator.kitchensink app id into any other specified string. If this app id is changed to the corresponding id in a mobile provisioning profile, then you should be able to deploy KitchenSink to a real device.

I have added the above renaming procedure to the KitchenSink expression. The following example invocation to the earlier Nix function, shows how we can rename the app's id to: com.example.kitchensink and how to use a certificate and mobile provisioning profile for an exisiting app:


import ./kitchensink {
  inherit (pkgs) stdenv fetchgit titaniumenv;
  target = "iphone";
  release = true;
  rename = true;
  newBundleId = "com.example.kitchensink";
  iosMobileProvisioningProfile = ./profile.mobileprovision;
  iosCertificate = ./certificate.p12;
  iosCertificateName = "Cool Company";
  iosCertificatePassword = "secret";
}

By using the above expressions KitchenSink can be built for both Android and iOS. The left picture above shows what it looks like on iOS, the right picture shows what it looks like on Android.

Discussion

With the Titanium build function described in this blog post, I can automatically build Titanium apps for both iOS and Android using the Nix package manager, although it was quite painful to get it done and tedious to maintain.

What bothers me the most about this process is the fact that Appcelerator has crafted their own custom build tool with lots of complexity (in terms of code size), flaws (e.g. not propagating the CLI's argument properly from xcodebuild) and weird issues (e.g. an odd way of detecting the presence of the JDK, and invoking the highly complicated legacy python scripts), while there are already many more mature build solutions available that can do the same job.

A quick inspection of Titanium CLI's git repository shows me that it consists of 8174 lines of code. However, not all of their build stuff is there. Some common stuff, such as the JDK and Android detection stuff, resides in the node-appc project. Moreover, the build steps are performed by plugin scripts that are distributed with the SDK.

A minor annoyance is that the new Node.js based Titanium CLI requires Oracle's Java Development Kit to make Android builds work, while the old Python based build script worked fine with OpenJDK. I have no idea yet how to fix this. Since we cannot provide a Nix expression that automatically downloads Oracle's JDK that automatically (due to license restrictions), Nix users are forced to manually download and import it into the Nix store first, before any of the Titanium stuff can be built.

So how did I manage to figure all this mess out?

Besides knowing that I have to patch executables, fix shebangs and wrap certain executables, the strace command on Linux helps me out a lot (since it shows me things like files that can not be opened) as well as the fact that Python and Node.js show me error traces with line numbers when something goes wrong so that I can debug easily what's going on.

However, since I also have to do builds on Mac OS X for iOS devices, I observed that there is no strace making ease my pain on that particular operating system. However, I discovered that there is a similar tool called: dtruss, that provides me similar data regarding system calls.

There is one minor annoyance with dtruss -- it requires super-user privileges to work. Fortunately, thanks to this MacWorld article, I can fix this by setting the setuid bit on the dtrace executable:


$ sudo chmod u+s /usr/sbin/dtrace

Now I can conveniently use dtruss in unprivileged build environments on Mac OS X to investigate what's going on.

Availability

The Titanium build environment as well as the KitchenSink example are part of Nixpkgs.

The top-level expression for KitchenSink example as well as the build operations described earlier is located in pkgs/development/mobile/titaniumenv/examples/default.nix. To build a debug version of KitchenSink for Android, you can run:


$ nix-build -A kitchensink_android_debug

The release version can be built by running:


$ nix-build -A kitchensink_android_release

The iPhone simulator version can be built by running:


$ nix-build -A kitchensink_ios_development

Building an IPA is slightly more complicated. You have to provide a certificate and mobile provisioning profile, and some renaming trick settings as parameters to make it work (which should of course match to what's inside the mobile provisioning profile that is actually used):


$ nix-build --arg rename true \
    --argstr newBundleId com.example.kitchensink \
    --arg iosMobileProvisionProfile ./profile.mobileprovision \
    --arg iosCertificate ./certificate.p12 \
    --argstr iosCertificateName "Cool Company" \
    --argstr iosCertificatePassword secret \
    -A kitchensink_ipa

There are also a couple of emulator jobs to easily spawn an Android emulator or iPhone simulator instance.

Currently, iOS and Android are the only target platforms supported. I did not investigate Blackberry, Tizen or Mobile web applications.

As explained in a previous blog post, Disnix's purpose is to be a distributed service deployment tool -- it deploys systems that are composed of distributable components (called services) that may have dependencies on each other into networks of machines having various characteristics.

The definition of a service in a Disnix context is not very strict. Basically, a service can take almost any form, such as a web service, web application, UNIX processes and even entire NixOS configurations.

Apart from the fact that we can deploy various kinds of services, they have another important characteristic from a deployment perspective. By default, services are target-agnostic, which means that they always have the same form regardless to what machine they are deployed in the network. In most cases this is considered a good thing.

However, there are also situations in which we want to deploy services that are built and configured specifically for a target machine. In this blog post, I will elaborate on this problem and describe how target-specific services can be deployed with Disnix.

Target-agnostic service deployment

Why are services target-agnostic by default in Disnix?

This property actually stems from the way "ordinary packages" are built with the Nix package manager which is used as a basis for Disnix.

As explained earlier, Nix package builds are influenced by its declared inputs only, such as the source code, build scripts and other kinds of dependencies, e.g. a compiler and libraries. Nix has means to ensure that undeclared dependencies cannot influence a build and that dependencies never collide with each other.

As a result, builds are reliable and reproducible. For example, it does not matter where the build of a package is performed. If the inputs are the same, then the corresponding outcome will be the same as well. (As a sidenote: there are some caveats, but in general there are no observable side-effects). Also, it provides better guarantees that, for example, if I have build and tested a program on my machine that it will work on a different machine as well.

Moreover, since it does not matter where a package has been built, we can, for example, also download a package built from identical inputs from a remote location, instead of building it ourselves improving the efficiency of deployment processes.

In Disnix, Nix's concept of building packages has been extended to services in a distributed setting. The major difference between a package and a serivce is that services take an additional class of dependencies into account. Besides the intra-dependencies that Nix manages, services may also have inter-dependencies on services that may be deployed to remote machines in a network. Disnix can be used to configure services in such a way that a service knows how to reach them and that the system is activated and deactivated in the right order.

As a consequence, it does not take a machine's properties into account when deploying it to a target machine in the network unless a machine's properties are explicitly provided as dependencies of a service.

In many cases, this is a good thing. For example, the following image shows a particular deployment scenario of the ridiculous StaffTracker example (described in some of my research publications and earlier blog posts):

The above image describes a deployment scenario in which we have deployed services (denoted by the ovals) to two machines in a network (denoted by the grey boxes). The arrows denote inter-dependency relationships.

One of the things we could do is changing the location of the StaffTracker web application front-end service, by changing the following line in the distribution model:

StaffTracker = [ infrastructure.test2 ];

to:

StaffTracker = [ infrastructure.test1 ];

Redeploying the system yields the following deployment architecture:

Performing the redeployment procedure is actually quite efficient. Since the intra-dependencies and inter-dependencies of the StaffTracker service have not changed, we do not have to rebuild and reconfigure the StaffTracker service. We can simply take the existing build result from the coordinator machine (that has been previously distributed to machine test1) and distribute it to test2.

Also, because the build result is the same, we have better guarantees that if the service worked on machine test1, it should work on machine test2 as well.

(As a sidenote: there is actually a situation in which a service will get rebuilt when moving it from one machine to another while its intra-dependencies and inter-dependencies have not changed.

Disnix also supports heterogeneous service deployment meaning that we can run target machines having different CPU architectures and operating systems. For example, if test2 were a Linux machine and test1 a Mac OS X machine, Disnix attempts to rebuild it for the new platform.

However, if all machines have the CPU architecture and operating system this will not happen).

Deploying target-specific services

Target-agnostic services are generally considered good because they improve reproducibility and efficiency when moving a service from machine to another. However, in some situations you may need to configure a service for a target machine specifically.

An example of a deployment scenario in which we need to deploy target-specific services, is when we want to deploy a collection of Node.js web applications and an nginx reverse proxy in which each web application should be reached by its own unique DNS domain name (e.g. http://webapp1.local, http://webapp2.local etc.).

We could model the nginx reverse proxy and each web application as (component-agnostic) distributable services, and deploy them in a network with Disnix as follows:

We can declare the web applications to be inter-dependencies of the nginx service and generate its configuration accordingly.

Although this approach works, the downside is that in the above deployment architecture, the test1 machine has to handle all the network traffic including the requests that should be propagated to the web applications deployed to test2 making the system not very scalable, because only one machine is responsible for handling all the network load.

We can also deploy two redundant instances of the nginx service by specifying the following attribute in the distribution model:

nginx = [ infrastructure.test1 infrastructure.test2 ];

The above modification yields the following deployment architecture:

The above deployment architecture is more scalable -- now requests meant for any of the web applications deployed to machine test1 can be handled by the nginx server deployed to test1 and the nginx server deployed to test2 can handle all the requests meant for the web applications deployed to test2.

Unfortunately, there is also an undesired side effect. As all the nginx services have the same form regardless to which machines they have been deployed, they have inter-dependencies on all web applications in the entire network including the ones that are not running on the same machine.

This property makes upgrading the system very inefficient. For example, if we update the webapp3 service (deployed to machine test2), the nginx configurations on all the other machines must be updated as well causing all nginx services on all machines to be upgraded, because they also have an inter-dependency on the upgraded web application.

In a 2 machine scenario with 4 web applications, this efficiency may still be acceptable, but in a big environment with tens of web applications and tens of machines, we most likely suffer from many (hundreds of) unnecessary redeployment activities bringing the system down for a unnecessary long time.

A more efficient deployment architecture would be the following:

We deploy two target-specific nginx services that only have inter-dependencies on the web applications deployed to the same machine. In this scenario, upgrading webapp3 does not affect the configurations of any of the services deployed to the test1 machine.

How to specify these target-specific nginx services?

A dumb way to do it is to define a service for each target in the Disnix services model:


{pkgs, system, distribution}:

let
  customPkgs = ...
in
rec {
  ...

  nginx-wrapper-test1 = rec {
    name = "nginx-wrapper-test1";
    pkg = customPkgs.nginx-wrapper;
    dependsOn = {
      inherit webapp1 webapp2;
    };
    type = "wrapper";
  };

  nginx-wrapper-test2 = rec {
    name = "nginx-wrapper-test2";
    pkg = customPkgs.nginx-wrapper;
    dependsOn = {
      inherit webapp3 webapp4;
    };
    type = "wrapper";
  };
}

And then distributing them to the appropriate target machines in the Disnix distribution model:


{infrastructure}:

{
  ...
  nginx-wrapper-test1 = [ infrastructure.test1 ];
  nginx-wrapper-test2 = [ infrastructure.test2 ];
}

Manually specifying target-specific services is quite tedious and labourious especially if you have tens of services and tens of machines. We have to specify machines x components services resulting in hundreds of target-specific service configurations.

Furthermore, there is a bit of repetition. Both the distribution model and the service models reflect mappings from services to target machines.

A better approach would be to generate target-specific services. An example of such an approach is to specify the mappings of these services in the distribution model first:


{infrastructure}:

let
  inherit (builtins) listToAttrs attrNames getAttr;
in
{
  webapp1 = [ infrastructure.test1 ];
  webapp2 = [ infrastructure.test1 ];
  webapp3 = [ infrastructure.test2 ];
  webapp4 = [ infrastructure.test2 ];
} //

# To each target, distribute a reverse proxy

listToAttrs (map (targetName: {
  name = "nginx-wrapper-${targetName}";
  value = [ (getAttr targetName infrastructure) ];
}) (attrNames infrastructure))

In the above distribution model, we statically map all the target-agnostic web application services, and for each target machine in the infrastructure model we generate a mapping of the target-specific nginx service to its target machine.

We can generate the target-specific nginx service configurations in the services model as follows:


{system, pkgs, distribution, invDistribution}:

let
  customPkgs = import ../top-level/all-packages.nix {
    inherit pkgs system;
  };
in
{
  webapp1 = ...

  webapp2 = ...

  webapp3 = ...

  webapp4 = ...
} //

# Generate nginx proxy per target host

builtins.listToAttrs (map (targetName:
  let
    serviceName = "nginx-wrapper-${targetName}";
    servicesToTarget = (builtins.getAttr targetName invDistribution).services;
  in
  { name = serviceName;
    value = {
      name = serviceName;
      pkg = customPkgs.nginx-wrapper;
      # The reverse proxy depends on all services distributed to the same
      # machine, except itself (of course)
      dependsOn = builtins.removeAttrs servicesToTarget [ serviceName ];
      type = "wrapper";
    };
  }
) (builtins.attrNames invDistribution))

To generate the nginx services, we iterate over a so-called inverse distribution model mapping targets to services that has been computed from the distribution model (mapping services to one or more machines in the network).

The inverse distribution model is basically just the infrastructure model in which each target attribute set has been augmented with a services attribute containing the properties of the services that have been deployed to it. The services attribute refers to an attribute set in which each key is the name of the service and each value the service configuration properties defined in the services model:


{
  test1 = {
    services = {
      nginx-wrapper-test1 = ...
      webapp1 = ...
      webapp2 = ...
    };
    hostname = "test1";
  };

  test2 = {
    services = {
      nginx-wrapper-test2 = ...
      webapp3 = ...
      webapp4 = ...
    };
    hostname = "test2";
  };
}

For example, if we refer to invDistribution.test1.services we get all the configurations of the services that are deployed to machine test1. If we remove the reference to the nginx reverse proxy, we can pass this entire attribute set as inter-dependencies to configure the reverse proxy on machine test1. (The reason why we remove the reverse proxy as a dependency is because it is meaningless to let it refer to itself. Furthermore, this would also cause infinite recursion).

With this approach we can also easily scale up the environment. By simply adding more machines in the infrastructure model and additional web application service mappings in the distribution model, the service configurations in the service model get adjusted automatically not requiring us to think about specifying inter-dependencies at all.

Conclusion

To make target-specific service deployment possible, you need to explicitly define service configurations for specific target machines in the Disnix services model and distribute them to the right targets.

Unfortunately, manually specifying target-specific services is quite tedious, inefficient and laborious, in particular in big environments. A better solution would be to generate the configurations of target-specific services.

To make generation more convenient, you may have to refer to the infrastructure model and you need to know which services are deployed to each target.

I have integrated the inverse distribution generation feature into the latest development version of Disnix and it will become part of the next Disnix release.

Moreover, I have developed yet another example package, called the Disnix virtual hosts example, to demonstrate how it can be used.

I have been responsible for many things in my past and current career. Besides research and development, I have also been responsible for software and systems configuration management in small/medium sized companies, such as my current employer.

I have observed that in organizations like these, configuration management (CM) is typically nobody's (full) responsibility. Preferably, people want to stick themselves to their primary responsibilities and typically carry out change activities in an ad-hoc and unstructured way.

Not properly implementing changes have a number of serious implications. For example, some problems I have encountered are:

Delays. There are many factors that will unnecessarily increase the time it will take to implement a change. Many of my big delays were caused by the fact that I always have to search for all the relevant installation artifacts, such as documentation, installation discs, and so on. I have also encountered many times that artifacts were missing requiring me to obtain copies elsewhere.
Errors. Any change could potentially lead to errors for many kinds of reasons. For example, implementing a set of changes in the wrong order could break a system. Also, components of which a system consist may have complex dependencies on other components that have to be met. Quite often, it is not fully clear what the dependencies of a component or system are, especially when documentation is incomplete or lacking.

Moreover, after having solved an error, you need to remember many arbitrary things, such as workarounds, that tend to become forgotten knowledge over time.
Disruptions. When implementing changes, a system may be partially or fully unavailable until all changes have been implemented. Preferably this time window should be as short as possible. Unfortunately, the inconsistency time window most likely becomes quite big when the configuration management process is not optimal or subject to errors.

It does not matter if an organization is small or big, but these problems cost valuable time and money. To alleviate these problems, it is IMO unavoidable to have a structured way of carrying out changes so that a system maintains its integrity.

Big organizations typically have information systems, people and management procedures to support structured configuration management, because failures are typically too costly for them. There are also standards available (such as the IEEE 828-2012 Standard for Configuration Management in Systems and Software Engineering) that they may use as a reference for implementing a configuration management process.

However, in small organizations people typically refrain from thinking about a process at all while they keep suffering from the consequences, because they think they are too small for it. Furthermore, they find it too costly to invest in people or an information system supporting configuration management procedures. Consulting a standard is something that is generally considered a leap too far.

In this blog post, I will describe a very basic software configuration management process I have implemented at my current employer.

The IEEE Standard for Configuration Management

As crazy as this may sound, I have used the IEEE 828-2012 standard as a reference for my implementation. The reason why I consulted this standard besides the fact that using an existing and reasonably well-established reference is good, is that I was already familiar with it, because of my previous background as a PhD researcher in software deployment.

The IEEE standard defines a framework of so-called "lower-level processes" from which a configuration management process can be derived. The most important lower-level processes are:

CM identification, which concerns identifying, naming, describing, versioning, storing and retrieving configuration items (CIs) and their baselines.
CM change control is about planning, requesting, approval and verification of change activities.
CM status accounting is about identifying the status of CIs and change requests.
CM auditing concerns identifying, tracing and reporting discrepancies with regards to the CIs and their baselines.
CM release management is about planning, defining a format for distribution, delivery, and archival of configuration items.

All the other lower-level processes have references to the lower-level processes listed above. CM planning is basically about defining a plan how to carry out the above activities. CM management is about actually installing the tools involved, executing the process, monitoring its progress, and status and revising/optimizing the plan if any problem occurs.

The remaining two lower-level processes concern outside involvement -- Supplier configuration item control concerns CIs that are provided by external parties. Interface control concerns the implications of configuration changes that concern external parties. I did not take the traits of these lower-level processes into account in the implementation.

Implementing a configuration management process

Implementing a configuration management process (according to the IEEE standard) starts by developing configuration management plan. The standard mentions many planning aspects, such as identifying the information needs, reporting needs, the reporting frequency, and the information needed to manage CM activities. However, from my perspective, in a small organization many of these aspects are difficult to answer in advance, in particular the information needs.

As a rule of thumb, I think that when you do not exactly know what is needed, consider that the worst thing could happen -- the office burns down and everything gets lost. What does it take to reproduce the entire configuration from scratch?

This is how I have implemented the main lower-level processes:

CM identification

A configuration item is any structural unit that is distinguishable and configurable. In our situation, the most important kind of configuration item is a machine configuration (e.g. a physical machine or a virtual machine hosted in an IaaS environment, such as Amazon EC2), or a container configuration (such as an Apache Tomcat container or PaaS service, such as Elastic Beanstalk).

Machines/containers belong to an environment. Some examples of environments that we currently maintain are: the production environment containing the configurations of the production machines of our service, test contains the configurations of the test environment, and internal contains the configurations of our internal IT infrastructure, such as our internal Hydra build cluster, and other peripherals, such as routers and printers.

Machines run a number of applications that may have complex installation procedures. Moreover, identical/similar application configurations may have to be deployed to multiple machines.

For storing the configurations of the CIs, I have set up a Git repository that follows a specific directory structure of three levels:


<environment>/<machine | container>/<application>

Each directory contains all artifacts (e.g. keys, data files, configuration files, scripts, documents etc.) required to reproduce a CI's configuration from scratch. Moreover, each directory has a README.mdmarkdown file:

The top-levelREADME.md describes which environments are available and what their purposes are.
The environment-levelREADME.md describes which machines/containers are part of it, a brief description of their purpose, and a picture showing how they are connected. I typically use Dia to draw them, because the tool is simple and free.
The machine-levelREADME.md describes the purpose of the machine and the activities that must be carried out to reproduce its configuration from scratch.
The application-levelREADME.md captures the steps that must be executed to reproduce an application's configuration.

When storing artifacts and writing README.md files, I try to avoid duplication as much as possible, because it makes it harder to keep the repository consistent and maintainable:

README.md files are not supposed to be tool manuals. I mention the steps that must be executed and what their purposes are, but I avoid explaining how a tool works. That is the purpose of the tool's manual.
When there are common files used among machines, applications or environments, I do not duplicate them. Instead, I use a _common/ folder that I put one directory level higher. For example, the _common/ folder in the internal/ directory contains shared artifacts that are supposed to reused among all machines belonging to our internal IT infrastructure.
I also capture common configuration steps in a separate README.md and refer to it, instead of duplicating the same steps in multiple README.md files.

Because I use Git, versioning, storage and retrieval of configurations is implicit. I do not have to invent something myself or think too much about it. For example, I do not have to manually assign version numbers to CIs, because Git already computes them for each change. Moreover, because I use textual representations of most of the artifacts I can also easily compare versions of the configurations.

Furthermore, besides capturing and storing all the prerequisites to reproduce a configuration, I also try to automate this process as much as possible. For most of the automation aspects, I use tools from the Nix project, such as the Nix package manager for individual packages, NixOS for system configurations, Disnix for distributed services, and NixOps for networks of machines.

Tools in the Nix project are driven by declarative specifications -- a specification captures the structure of a system, e.g. their components and their dependencies. From this specification the entire deployment process will be derived, such as building the components from source code, distributing them to the right machines in the network, and activating them in the right order.

Using a declarative deployment approach prevents me writing down the activities to carry out, because they are implicit. Also, there is no need describing the structure of the system because it is already captured in the deployment specification.

Unfortunately, not all machine's deployment processes can be fully automated with Nix deployment tools (e.g. non-Linux machines and special purpose peripherals, such as routers) still requiring me to carry out some configuration activities manually.

CM change control

Implementing changes may cause disruptions costing time and money. That is why the right people must be informed and approval is needed. Big organizations typically have sophisticated management procedures including request and approval forms, but in a small organization it typically suffices to notify people informally before implementing a change.

Besides notifying people, I also take the following things into account while implementing changes:

Some configuration changes require validation including review and testing before they can be actually implemented in production. I typically keep the master Git branch in a state releasable state, meaning that it is ready to be deployed into production. Any changes that require explicit validation go into a different branch first.

Moreover, when using tools from the Nix project it is relatively easy to reliably test changes first by deploying a system into a test environment, or by spawning virtual NixOS machines in which integration tests can be executed.
Sometimes you need to make an analysis of the impact and costs that a change would bring. Up-to-date and consistent documentation of the CIs including their dependencies makes this process more reliable.

Furthermore, with tools from the Nix project you can also make better estimations by executing a dry-run deployment process -- the dry run shows what activities will be carried out without actually executing them or bringing the system down.
After a change has been deployed, we also need to validate whether the new configuration is correct. Typically, this requires testing.

Tools from the Nix project support congruent deployment, meaning that if the deployment has succeeded, the actual configuration is guaranteed to match the deployment specification for the static parts of a system, giving better guarantees about its validity.
Also you have to pick the right moment to implement potentially disrupting changes. For example, it is typically a bad idea to do this while your service is under peak load.

CM status accounting

It is also highly desirable to know what the status of the CIs and the change activities are. The IEEE standard goes quite far in this. For example, the overall state of the system may converge into a certain direction (e.g. in terms of features complete, error ratios etc.), which you continuously want to measure and report about. I think that in a small organization these kinds of status indicators are typically too complex to define and measure, in particular in the early stages of a development process.

However, I think the most important status indicator that you never want to lose track of is the following: does everything (still) work?

There are two facilities that help me out a lot in keeping a system in working state:

Automating deployment with tools from the Nix project ensure that the static parts of a deployed system are congruent with the deployment configuration specification and atomic -- either a deployment is in the old configuration or the new configuration but never in an inconsistent mix of the two. As a result, we have fewer broken configurations as a result of (re)deploying a system.
We must also observe a system's runtime behavior and take action if things will grow out of hand. For example, when a machine runs out of system resources.

Using a monitoring service, such as Zabbix or Datadog, helps me a lot in accomplishing this. They can also be used to configure alarms that warn you when things become critical.

CM auditing

Another important aspect is the integrity of the configurations repository. How can we be sure that what is stored inside the repository matches the actual configurations and that the configuration procedures still work?

Fortunately, because we use tools from the Nix project, there is relatively little audit work we need to do. With Nix-related tools the deployment process is fully automated. As a consequence, we need to adapt the deployment specification when we intend to make changes. Moreover, since the deployment specifications of Nix-related tools are congruent, we know that the static parts of a system are guaranteed to match the actual configurations if the (re)deployment process succeeded.

However, for non-NixOS machines and other peripherals, we must still manually check once in a while whether the indented configuration matches. I made it a habit to go through them once a month and to adjust the documentation if any discrepancies were found.

CM release management

When updating a configuration file, we must also release the new corresponding configuration items. The IEEE standard describes many concerns, such as approval procedures, requirements on the distribution mechanism and so on.

For me, most of these concerns are unimportant, especially in a small organization. The only thing that matters to me is that a release process is fully automated, reliable, reproducible. Fortunately, the deployment tools from the Nix project support these properties quite well.

Discussion

In this blog post, I have described a basic configuration management process that I have implemented in a small organization.

Some people will probably argue that defining a CM process in a small organization looks crazy. Some people think they do not need a process and that it is too much of an investment. Following an IEEE standard is generally considered a leap too far.

In my opinion, however, the barrier of implementing a CM process is not actually not that high. From my experience, the biggest investment is setting up a configuration management repository. Although big organizations typically have sophisticated information systems, I have also shown that using a simple filesystem layout and collection of free and open source tools (e.g. Git, Dia, Nix) a simple variant of such a repository can be set up with relatively little effort.

I also observed that automating CM tasks helps a lot, in particular using a declarative and congruent deployment approach, such as Nix. With a declarative approach, configuration activities are implicit (they are a consequence of applying a change in the deployment specification) and do not have to be documented. Furthermore, because Nix's deployment models are congruent, the static aspects of a configuration are guaranteed to match the deployment specifications. Moreover, the deployment model serves as the documentation, because it captures the structure of a system.

So how beneficial is setting up a CM process in a small organization? I observed many benefits. For example, a major benefit is that I can carry out many CM tasks much faster. I no longer have to waste much of my time looking for configuration artifacts and documentation. Also, because the steps to carry out are documented or automated, there are fewer things I need to (re)discover or solve while implementing a change.

Another benefit is that I can more reliably estimate the impact of implementing changes, because the CIs and their relationships are known. More knowledge, also causes fewer errors.

Although a simple CM approach provides benefits and many aspects can be automated, it always requires discipline from all people involved. For example, when errors are discovered and configurations must be modified in a stressful situation, it is very tempting to bypass updating the documentation.

Moreover, communication is also an important aspect. For example, when notifying people of a potentially disrupting change, clear communication is required. Typically, also non-technical stakeholders must be informed. Eventually, you have to start developing formalized procedures to properly handle decision processes.

Finally, the CM approach described in this blog post is obviously too limited if a company grows. If an organization gets bigger, a more sophisticated and more formalized CM approach will be required.

As described in a number of older blog posts, Nix is primarily a source based package manager -- it constructs packages from source code by executing their build procedures in isolated environments in which only specified dependencies can be found.

As an optimization, it provides transparent binary deployment -- if a package that has been built from the same inputs exists elsewhere, it can be downloaded from that location instead of being built from source improving the efficiency of deployment processes.

Because Nix is a source based package manager, the documentation mainly describes how to build packages from source code. Moreover, the Nix expressions are written in such a way that they can be included in the Nixpkgs collection, a repository containing build recipes for more than 2500 packages.

Although the manual contains some basic packaging instructions, I noticed that there were a few practical bits were missing. For example, how to package software privately outside the Nixpkgs tree is not clearly described, which makes experimentation a bit less convenient, in particular for newbies.

Despite being a source package manager, Nix can also be used to deploy binary software packages (i.e. software for which no source code and build scripts have been provided). Unfortunately, getting prebuilt binaries to run properly is quite tricky. Furthermore, apart from some references, there are no examples in the manual describing how to do this either.

Since I am receiving too many questions about this lately, I have decided to write a blog post about it covering two examples that should be relatively simple to repeat.

Why prebuilt binaries will typically not work

Prebuilt binaries deployed by Nix typically do not work out of the box. For example, if we want to deploy a simple binary package such as pngout (only containing a set of ELF execuables) we may initially think that copying the executable into the Nix store suffices:

with import <nixpkgs> {};

stdenv.mkDerivation {
  name = "pngout-20130221";

  src = fetchurl {
    url = http://static.jonof.id.au/dl/kenutils/pngout-20130221-linux.tar.gz;
    sha256 = "1qdzmgx7si9zr7wjdj8fgf5dqmmqw4zg19ypg0pdz7521ns5xbvi";
  };

  installPhase = ''
    mkdir -p $out/bin
    cp x86_64/pngout $out/bin
'';
}

However, when we build the above package:

$ nix-build pngout.nix

and attempt to run the executable, we stumble upon the following error:

$ ./result/bin/pngout
bash: ./result/bin/pngout: No such file or directory

The above error is quite strange -- the corresponding file resides in exactly the specified location yet it appears that it cannot be found!

The actual problem is not that the executable is missing, but one of its dependencies. Every ELF executable that uses shared libraries consults the dynamic linker/loader (that typically resides in /lib/ld-linux.so.2 (on x86 Linux platforms) and /lib/ld-linux-x86-64.so.2 on (x86-64 Linux platforms)) to provide the shared libraries it needs. This path is hardwired into the ELF executable, as can be observed by running:

$ readelf -l ./result/bin/pngout 

Elf file type is EXEC (Executable file)
Entry point 0x401160
There are 8 program headers, starting at offset 64

Program Headers:
  Type           Offset             VirtAddr           PhysAddr
                 FileSiz            MemSiz              Flags  Align
  PHDR           0x0000000000000040 0x0000000000400040 0x0000000000400040
                 0x00000000000001c0 0x00000000000001c0  R E    8
  INTERP         0x0000000000000200 0x0000000000400200 0x0000000000400200
                 0x000000000000001c 0x000000000000001c  R      1
      [Requesting program interpreter: /lib64/ld-linux-x86-64.so.2]
  LOAD           0x0000000000000000 0x0000000000400000 0x0000000000400000
                 0x000000000001593c 0x000000000001593c  R E    200000
  LOAD           0x0000000000015940 0x0000000000615940 0x0000000000615940
                 0x00000000000005b4 0x00000000014f9018  RW     200000
  DYNAMIC        0x0000000000015968 0x0000000000615968 0x0000000000615968
                 0x00000000000001b0 0x00000000000001b0  RW     8
  NOTE           0x000000000000021c 0x000000000040021c 0x000000000040021c
                 0x0000000000000044 0x0000000000000044  R      4
  GNU_EH_FRAME   0x0000000000014e5c 0x0000000000414e5c 0x0000000000414e5c
                 0x00000000000001fc 0x00000000000001fc  R      4
  GNU_STACK      0x0000000000000000 0x0000000000000000 0x0000000000000000
                 0x0000000000000000 0x0000000000000000  RW     8

In NixOS, most parts of the system are stored in a special purpose directory called the Nix store (i.e. /nix/store) including the dynamic linker. As a consequence, the dynamic linker cannot be found because it resides elsewhere.

Another reason why most binaries will not work is because they must know where to find its required shared libraries. In most conventional Linux distributions these reside in global directories (e.g. /lib and /usr/lib). In NixOS, these folders do not exist. Instead, every package is stored in isolation in separate folders in the Nix store.

Why complication from source works

In contrast to prebuilt ELF binaries, binaries produced by a source build in a Nix build environment work out of the box typically without problems (i.e. they often do not require any special modifications in the build procedure). So why is that?

The "secret" is that the linker (that gets invoked by the compiler) has been wrapped in the Nix build environment -- if we invoke ld, then we actually end up using a wrapper: ld-wrapper that does a number of additional things besides the tasks the linker normally carries out.

Whenever we supply a library to link to, the wrapper appends an -rpath parameter providing its location. Furthermore, it appends the path to the dynamic linker/loader (-dynamic-linker) so that the resulting executable can load the shared libraries on startup.

For example, when producing an executable, the compiler may invoke the following command that links a library to a piece of object code:

$ ld test.o -lz -o test

in reality, ld has been wrapped and executes something like this:

$ ld test.o -lz \
  -rpath /nix/store/31w31mc8i...-zlib-1.2.8/lib \
  -dynamic-linker \
    /nix/store/hd6km3hscb...-glibc-2.21/lib/ld-linux-x86-64.so.2 \
  ...
  -o test

As may be observed, the wrapped transparently appends the path to zlib as an RPATH parameter and provides the path to the dynamic linker.

The RPATH attribute is basically a colon separated string of paths in which the dynamic linker looks for its shared dependencies. The RPATH is hardwired into an ELF binary.

Consider the following simple C program (test.c) that displays the version of the zlib library that it links against:

#include <stdio.h>
#include <zlib.h>

int main()
{
    printf("zlib version is: %s\n", ZLIB_VERSION);
    return 0;
}

With the following Nix expression we can compile an executable from it and link it against the zlib library:

with import <nixpkgs> {};

stdenv.mkDerivation {
  name = "test";
  buildInputs = [ zlib ];
  buildCommand = ''
    gcc ${./test.c} -lz -o test
    mkdir -p $out/bin
    cp test $out/bin
'';
}

When we build the above package:

nix-build test.nix

and inspect the program headers of the ELF binary, we can observe that the dynamic linker (program interpreter) corresponds to an instance residing in the Nix store:

$ readelf -l ./result/bin/test 

Elf file type is EXEC (Executable file)
Entry point 0x400680
There are 9 program headers, starting at offset 64

Program Headers:
  Type           Offset             VirtAddr           PhysAddr
                 FileSiz            MemSiz              Flags  Align
  PHDR           0x0000000000000040 0x0000000000400040 0x0000000000400040
                 0x00000000000001f8 0x00000000000001f8  R E    8
  INTERP         0x0000000000000238 0x0000000000400238 0x0000000000400238
                 0x0000000000000050 0x0000000000000050  R      1
      [Requesting program interpreter: /nix/store/hd6km3hs...-glibc-2.21/lib/ld-linux-x86-64.so.2]
  LOAD           0x0000000000000000 0x0000000000400000 0x0000000000400000
                 0x000000000000096c 0x000000000000096c  R E    200000
  LOAD           0x0000000000000970 0x0000000000600970 0x0000000000600970
                 0x0000000000000260 0x0000000000000268  RW     200000
  DYNAMIC        0x0000000000000988 0x0000000000600988 0x0000000000600988
                 0x0000000000000200 0x0000000000000200  RW     8
  NOTE           0x0000000000000288 0x0000000000400288 0x0000000000400288
                 0x0000000000000020 0x0000000000000020  R      4
  GNU_EH_FRAME   0x0000000000000840 0x0000000000400840 0x0000000000400840
                 0x0000000000000034 0x0000000000000034  R      4
  GNU_STACK      0x0000000000000000 0x0000000000000000 0x0000000000000000
                 0x0000000000000000 0x0000000000000000  RW     8
  PAX_FLAGS      0x0000000000000000 0x0000000000000000 0x0000000000000000
                 0x0000000000000000 0x0000000000000000         8

Furthermore, if we inspect the dynamic section of the binary, we will see that an RPATH attribute has been hardwired into it providing a collection of library paths (including the path to zlib):

$ readelf -d ./result/bin/test 

Dynamic section at offset 0x988 contains 27 entries:
  Tag        Type                         Name/Value
 0x0000000000000001 (NEEDED)             Shared library: [libz.so.1]
 0x0000000000000001 (NEEDED)             Shared library: [libc.so.6]
 0x000000000000000f (RPATH)              Library rpath: [
/nix/store/8w39iz6sp...-test/lib64:
/nix/store/8w39iz6sp...-test/lib:
/nix/store/i9nn1fkcy...-gcc-4.9.3/libexec/gcc/x86_64-unknown-linux-gnu/4.9.3:
/nix/store/31w31mc8i...-zlib-1.2.8/lib:
/nix/store/hd6km3hsc...-glibc-2.21/lib:
/nix/store/i9nn1fkcy...-gcc-4.9.3/lib]
 0x000000000000001d (RUNPATH)            Library runpath: [
/nix/store/8w39iz6sp...-test/lib64:
/nix/store/8w39iz6sp...-test/lib:
/nix/store/i9nn1fkcy...-gcc-4.9.3/libexec/gcc/x86_64-unknown-linux-gnu/4.9.3:
/nix/store/31w31mc8i...-zlib-1.2.8/lib:
/nix/store/hd6km3hsc...-glibc-2.21/lib:
/nix/store/i9nn1fkcy...-gcc-4.9.3/lib]
 0x000000000000000c (INIT)               0x400620
 0x000000000000000d (FINI)               0x400814
 0x0000000000000019 (INIT_ARRAY)         0x600970
 0x000000000000001b (INIT_ARRAYSZ)       8 (bytes)
 0x000000000000001a (FINI_ARRAY)         0x600978
 0x000000000000001c (FINI_ARRAYSZ)       8 (bytes)
 0x0000000000000004 (HASH)               0x4002a8
 0x0000000000000005 (STRTAB)             0x400380
 0x0000000000000006 (SYMTAB)             0x4002d8
 0x000000000000000a (STRSZ)              528 (bytes)
 0x000000000000000b (SYMENT)             24 (bytes)
 0x0000000000000015 (DEBUG)              0x0
 0x0000000000000003 (PLTGOT)             0x600b90
 0x0000000000000002 (PLTRELSZ)           72 (bytes)
 0x0000000000000014 (PLTREL)             RELA
 0x0000000000000017 (JMPREL)             0x4005d8
 0x0000000000000007 (RELA)               0x4005c0
 0x0000000000000008 (RELASZ)             24 (bytes)
 0x0000000000000009 (RELAENT)            24 (bytes)
 0x000000006ffffffe (VERNEED)            0x4005a0
 0x000000006fffffff (VERNEEDNUM)         1
 0x000000006ffffff0 (VERSYM)             0x400590
 0x0000000000000000 (NULL)               0x0

As a result, the program works as expected:

$ ./result/bin/test 
zlib version is: 1.2.8

Patching existing ELF binaries

To summarize, the reason why ELF binaries produced in a Nix build environment work is because they refer to the correct path of the dynamic linker and have an RPATH value that refers to the paths of the shared libraries that it needs.

Fortunately, we can accomplish the same thing with prebuilt binaries by using the PatchELF tool. With PatchELF we can patch existing ELF binaries to have a different dynamic linker and RPATH.

Running the following instruction in a Nix expression allows us to change the dynamic linker of the pngout executable shown earlier:

$ patchelf --set-interpreter \
    ${stdenv.glibc}/lib/ld-linux-x86-64.so.2 $out/bin/pngout

By inspecting the dynamic section of a binary, we can find out what shared libraries it requires:

$ readelf -d ./result/bin/pngout

Dynamic section at offset 0x15968 contains 22 entries:
  Tag        Type                         Name/Value
 0x0000000000000001 (NEEDED)             Shared library: [libm.so.6]
 0x0000000000000001 (NEEDED)             Shared library: [libc.so.6]
 0x000000000000000c (INIT)               0x400ea8
 0x000000000000000d (FINI)               0x413a78
 0x0000000000000004 (HASH)               0x400260
 0x000000006ffffef5 (GNU_HASH)           0x4003b8
 0x0000000000000005 (STRTAB)             0x400850
 0x0000000000000006 (SYMTAB)             0x4003e8
 0x000000000000000a (STRSZ)              379 (bytes)
 0x000000000000000b (SYMENT)             24 (bytes)
 0x0000000000000015 (DEBUG)              0x0
 0x0000000000000003 (PLTGOT)             0x615b20
 0x0000000000000002 (PLTRELSZ)           984 (bytes)
 0x0000000000000014 (PLTREL)             RELA
 0x0000000000000017 (JMPREL)             0x400ad0
 0x0000000000000007 (RELA)               0x400a70
 0x0000000000000008 (RELASZ)             96 (bytes)
 0x0000000000000009 (RELAENT)            24 (bytes)
 0x000000006ffffffe (VERNEED)            0x400a30
 0x000000006fffffff (VERNEEDNUM)         2
 0x000000006ffffff0 (VERSYM)             0x4009cc
 0x0000000000000000 (NULL)               0x0

According to the information listed above, two libraries are required (libm.so.6 and libc.so.6) which can be provided by the glibc package. We can change the executable's RPATH in the Nix expression as follows:

$ patchelf --set-rpath ${stdenv.glibc}/lib $out/bin/pngout

We can write a revised Nix expression for pngout (taking patching into account) that looks as follows:

with import <nixpkgs> {};

stdenv.mkDerivation {
  name = "pngout-20130221";

  src = fetchurl {
    url = http://static.jonof.id.au/dl/kenutils/pngout-20130221-linux.tar.gz;
    sha256 = "1qdzmgx7si9zr7wjdj8fgf5dqmmqw4zg19ypg0pdz7521ns5xbvi";
  };

  installPhase = ''
    mkdir -p $out/bin
    cp x86_64/pngout $out/bin
    patchelf --set-interpreter \
        ${stdenv.glibc}/lib/ld-linux-x86-64.so.2 $out/bin/pngout
    patchelf --set-rpath ${stdenv.glibc}/lib $out/bin/pngout
'';
}

When we build the expression:

$ nix-build pngout.nix

and try to run the executable:

$ ./result/bin/pngout 
PNGOUT [In:{PNG,JPG,GIF,TGA,PCX,BMP}] (Out:PNG) (options...)
by Ken Silverman (http://advsys.net/ken)
Linux port by Jonathon Fowler (http://www.jonof.id.au/pngout)

We will see that the executable works as expected!

A more complex example: Quake 4 demo

The pngout example shown earlier is quite simple as it is only a tarball with only one executable that must be installed and patched. Now that we are familiar with some basic concepts -- how should we a approach a more complex prebuilt package, such as a computer game like the Quake 4 demo?

When we download the Quake 4 demo installer for Linux, we actually get a Loki setup tools based installer that is a self-extracting shell script executing an installer program.

Unfortunately, we cannot use this installer program in NixOS for two reasons. First, the installer executes (prebuilt) executables that will not work. Second, to use the full potential of NixOS, it is better to deploy packages with Nix in isolation in the Nix store.

Fortunately, running the installer with the --help parameter reveals that it is also possible to extract its contents without running the installer:

$ bash ./quake4-linux-1.0-demo.x86.run --noexec --keep

After executing the above command-line instruction, we can find the extracted files in the ./quake4-linux-1.0-demo in the current working directory.

The next step is figuring out where the game files reside and which binaries need to be patched. A rough inspection of the extracted folder:

$ cd quake4-linux-1.0-demo
$ ls
bin
Docs
License.txt
openurl.sh
q4base
q4icon.bmp
README
setup.data
setup.sh
version.info

reveals to me that we have both files of installer (./setup.data) and the game intermixed with each other. Some files seem to be required to run the game, but the some others, such as the setup files (e.g. the ones residing in setup.data/) are unnecessary.

Running the following command helps me to figure out which ELF binaries we may have to patch:

$ file $(find . -type f)         
./Docs/QUAKE4_demo_readme.txt:     Little-endian UTF-16 Unicode text, with CRLF line terminators
./bin/Linux/x86/libstdc++.so.5:    ELF 32-bit LSB shared object, Intel 80386, version 1 (SYSV), dynamically linked, not stripped
./bin/Linux/x86/quake4-demo:       POSIX shell script, ASCII text executable
./bin/Linux/x86/quake4.x86:        ELF 32-bit LSB executable, Intel 80386, version 1 (GNU/Linux), dynamically linked, interpreter /lib/ld-linux.so.2, for GNU/Linux 2.0.30, stripped
./bin/Linux/x86/quake4-demoded:    POSIX shell script, ASCII text executable
./bin/Linux/x86/libgcc_s.so.1:     ELF 32-bit LSB shared object, Intel 80386, version 1 (SYSV), dynamically linked, not stripped
./bin/Linux/x86/q4ded.x86:         ELF 32-bit LSB executable, Intel 80386, version 1 (GNU/Linux), dynamically linked, interpreter /lib/ld-linux.so.2, for GNU/Linux 2.0.30, stripped
./README:                          ASCII text
./version.info:                    ASCII text
./q4base/game100.pk4:              Zip archive data, at least v2.0 to extract
./q4base/mapcycle.scriptcfg:       ASCII text, with CRLF line terminators
./q4base/game000.pk4:              Zip archive data, at least v1.0 to extract
./License.txt:                     ISO-8859 text, with very long lines
./openurl.sh:                      POSIX shell script, ASCII text executable
./q4icon.bmp:                      PC bitmap, Windows 3.x format, 48 x 48 x 24
...

As we can see in the output, the ./bin/Linux/x86 sub folder contains a number of ELF executables and shared libraries that most likely require patching.

As with the previous example (pngout), we can use readelf to inspect what libraries the ELF executables require. The first executable q4ded.x86 has the following dynamic section:

$ cd ./bin/Linux/x86
$ readelf -d q4ded.x86 

Dynamic section at offset 0x366220 contains 25 entries:
  Tag        Type                         Name/Value
 0x00000001 (NEEDED)                     Shared library: [libpthread.so.0]
 0x00000001 (NEEDED)                     Shared library: [libdl.so.2]
 0x00000001 (NEEDED)                     Shared library: [libstdc++.so.5]
 0x00000001 (NEEDED)                     Shared library: [libm.so.6]
 0x00000001 (NEEDED)                     Shared library: [libgcc_s.so.1]
 0x00000001 (NEEDED)                     Shared library: [libc.so.6]
...

According to the above information, the executable requires a couple of libraries that seem to be stored in the same package (in the same folder to be precise): libstdc++.so.5 and libgcc_s.so.1.

Furthermore, it also requires a number of libraries that are not in the same folder. These missing libraries must be provided by external packages. I know from experience that the remaining libraries: libpthread.so.0, libsdl.so.2, libm.so.6, libc.so.6, are provided by the glibc package.

The other ELF executable has the following library references:

$ readelf -d ./quake4.x86 

Dynamic section at offset 0x3779ec contains 29 entries:
  Tag        Type                         Name/Value
 0x00000001 (NEEDED)                     Shared library: [libSDL-1.2.so.0]
 0x00000001 (NEEDED)                     Shared library: [libpthread.so.0]
 0x00000001 (NEEDED)                     Shared library: [libdl.so.2]
 0x00000001 (NEEDED)                     Shared library: [libstdc++.so.5]
 0x00000001 (NEEDED)                     Shared library: [libm.so.6]
 0x00000001 (NEEDED)                     Shared library: [libgcc_s.so.1]
 0x00000001 (NEEDED)                     Shared library: [libc.so.6]
 0x00000001 (NEEDED)                     Shared library: [libX11.so.6]
 0x00000001 (NEEDED)                     Shared library: [libXext.so.6]
...

This executable has a number dependencies that are identical to the previous executable. Additionally, it requires: libSDL-1.2.so.0 that can be provided by SDL, libX11.so.6 by libX11 and libXext.so.6 by libXext

Besides the executables, the shared libraries bundled with the package may also have dependencies on shared libraries. We need to inspect and fix these as well.

Inspecting the dynamic section of libgcc_s.so.1 reveals the following:

$ readelf -d ./libgcc_s.so.1 

Dynamic section at offset 0x7190 contains 23 entries:
  Tag        Type                         Name/Value
 0x00000001 (NEEDED)                     Shared library: [libc.so.6]
...

The above library has a dependency on libc.so.6 which can be provided by glibc

The remaining library (libstdc++.so.5) has the following dependencies:

$ readelf -d ./libstdc++.so.5 

Dynamic section at offset 0xadd8c contains 25 entries:
  Tag        Type                         Name/Value
 0x00000001 (NEEDED)                     Shared library: [libm.so.6]
 0x00000001 (NEEDED)                     Shared library: [libgcc_s.so.1]
 0x00000001 (NEEDED)                     Shared library: [libc.so.6]
...

It seems to depend on libgcc_s.so.1 residing in the same folder. Similar to the previous binaries, libm.so.6, libc.so.6 provided can be provided by glibc.

With the gathered information so far, we can write the following Nix expression that we can use as a first attempt to run the game:

with import <nixpkgs> { system = "i686-linux"; };

stdenv.mkDerivation {
  name = "quake4-demo-1.0";
  src = fetchurl {
    url = ftp://ftp.idsoftware.com/idstuff/quake4/demo/quake4-linux-1.0-demo.x86.run;
    sha256 = "0wxw2iw84x92qxjbl2kp5rn52p6k8kr67p4qrimlkl9dna69xrk9";
  };
  buildCommand = ''
    # Extract files from the installer
    cp $src quake4-linux-1.0-demo.x86.run
    bash ./quake4-linux-1.0-demo.x86.run --noexec --keep

    # Move extracted files into the Nix store
    mkdir -p $out/libexec
    mv quake4-linux-1.0-demo $out/libexec
    cd $out/libexec/quake4-linux-1.0-demo

    # Remove obsolete setup files
    rm -rf setup.data

    # Patch ELF binaries
    cd bin/Linux/x86
    patchelf --set-interpreter ${stdenv.cc.libc}/lib/ld-linux.so.2 ./quake4.x86
    patchelf --set-rpath $(pwd):${stdenv.cc.libc}/lib:${SDL}/lib:${xlibs.libX11}/lib:${xlibs.libXext}/lib ./quake4.x86
    chmod +x ./quake4.x86

    patchelf --set-interpreter ${stdenv.cc.libc}/lib/ld-linux.so.2 ./q4ded.x86
    patchelf --set-rpath $(pwd):${stdenv.cc.libc}/lib ./q4ded.x86
    chmod +x ./q4ded.x86

    patchelf --set-rpath ${stdenv.cc.libc}/lib ./libgcc_s.so.1
    patchelf --set-rpath $(pwd):${stdenv.cc.libc}/lib ./libstdc++.so.5
'';
}

In the above Nix expression, we do the following:

We import the Nixpkgs collection so that we can provide the external dependencies that the package needs. Because the executables are 32-bit x86 binaries, we need to refer to packages built for the i686-linux architecture.
We download the Quake 4 demo installer from Id software's FTP server.
We automate the steps we have done earlier -- we extract the files from the installer, move them into Nix store, prune the obsolete setup files, and finally we patch the ELF executables and libraries with the paths to the dependencies that we have discovered in our investigation.

We should now be able to build the package:

$ nix-build quake4demo.nix

and investigate whether the executables can be started:

./result/libexec/quake4-linux-1.0-demo/bin/Linux/x86/quake4.x86

Unfortunately, it does not seem to work:


...
no 'q4base' directory in executable path /nix/store/0kfgsjryycsk5kfv97phj8ypv66n6caz-quake4-demo-1.0/libexec/quake4-linux-1.0-demo/bin/Linux/x86, skipping
no 'q4base' directory in current durectory /home/sander/quake4, skipping

According to the output, it cannot find the q4base/ folder. Running the same command with strace reveals why:

$ strace -f ./result/libexec/quake4-linux-1.0-demo/bin/Linux/x86/quake4.x86
...
stat64("/nix/store/0kfgsjryycsk5kfv97phj8ypv66n6caz-quake4-demo-1.0/libexec/quake4-linux-1.0-demo/bin/Linux/x86/q4base", 0xffd7b230) = -1 ENOENT (No such file or directory)
write(1, "no 'q4base' directory in executa"..., 155no 'q4base' directory in executable path /nix/store/0kfgsjryycsk5kfv97phj8ypv66n6caz-quake4-demo-1.0/libexec/quake4-linux-1.0-demo/bin/Linux/x86, skipping
) = 155
...

It seems that the program searches relative to the current working directory. The missing q4base/ folder apparently resides in the base directory of the extracted folder.

By changing the current working directory and invoking the executable again, the q4base/ directory can be found:

$ cd result/libexec/quake4-linux-1.0-demo
$ ./bin/Linux/x86/quake4.x86
...
--------------- R_InitOpenGL ----------------
Initializing SDL subsystem
Loading GL driver 'libGL.so.1' through SDL
libGL error: unable to load driver: i965_dri.so
libGL error: driver pointer missing
libGL error: failed to load driver: i965
libGL error: unable to load driver: swrast_dri.so
libGL error: failed to load driver: swrast
X Error of failed request:  BadValue (integer parameter out of range for operation)
  Major opcode of failed request:  154 (GLX)
  Minor opcode of failed request:  3 (X_GLXCreateContext)
  Value in failed request:  0x0
  Serial number of failed request:  33
  Current serial number in output stream:  34

Despite fixing the problem, we have run into another one! Apparently the OpenGL driver cannot be loaded. Running the same command again with the following environment variable (source):

$ export LIBGL_DEBUG=verbose

shows us what is causing it:

--------------- R_InitOpenGL ----------------
Initializing SDL subsystem
Loading GL driver 'libGL.so.1' through SDL
libGL: OpenDriver: trying /run/opengl-driver-32/lib/dri/tls/i965_dri.so
libGL: OpenDriver: trying /run/opengl-driver-32/lib/dri/i965_dri.so
libGL: dlopen /run/opengl-driver-32/lib/dri/i965_dri.so failed (/nix/store/0kfgsjryycsk5kfv97phj8ypv66n6caz-quake4-demo-1.0/libexec/quake4-linux-1.0-demo/bin/Linux/x86/libgcc_s.so.1: version `GCC_3.4' not found (required by /run/opengl-driver-32/lib/dri/i965_dri.so))
libGL error: unable to load driver: i965_dri.so
libGL error: driver pointer missing
libGL error: failed to load driver: i965
libGL: OpenDriver: trying /run/opengl-driver-32/lib/dri/tls/swrast_dri.so
libGL: OpenDriver: trying /run/opengl-driver-32/lib/dri/swrast_dri.so
libGL: dlopen /run/opengl-driver-32/lib/dri/swrast_dri.so failed (/nix/store/0kfgsjryycsk5kfv97phj8ypv66n6caz-quake4-demo-1.0/libexec/quake4-linux-1.0-demo/bin/Linux/x86/libgcc_s.so.1: version `GCC_3.4' not found (required by /run/opengl-driver-32/lib/dri/swrast_dri.so))
libGL error: unable to load driver: swrast_dri.so
libGL error: failed to load driver: swrast
X Error of failed request:  BadValue (integer parameter out of range for operation)
  Major opcode of failed request:  154 (GLX)
  Minor opcode of failed request:  3 (X_GLXCreateContext)
  Value in failed request:  0x0
  Serial number of failed request:  33
  Current serial number in output stream:  34

Apparently, the libgcc_so.1 library bundled with the game is conflicting with Mesa3D. According to this GitHub issue, replacing the conflicting version with the host system's GCC's version fixes it.

In our situation, we can accomplish this by appending the path to the host system's GCC library folder to the RPATH of the binaries referring to it and by removing the conflicting library from the package.

Moreover, we can address the annoying issue with the missing q4base/ folder by creating wrapper scripts that change the current working folder and invoke the executable.

The revised expression taking these aspect into account will be as follows:

with import <nixpkgs> { system = "i686-linux"; };

stdenv.mkDerivation {
  name = "quake4-demo-1.0";
  src = fetchurl {
    url = ftp://ftp.idsoftware.com/idstuff/quake4/demo/quake4-linux-1.0-demo.x86.run;
    sha256 = "0wxw2iw84x92qxjbl2kp5rn52p6k8kr67p4qrimlkl9dna69xrk9";
  };
  buildCommand = ''
    # Extract files from the installer
    cp $src quake4-linux-1.0-demo.x86.run
    bash ./quake4-linux-1.0-demo.x86.run --noexec --keep

    # Move extracted files into the Nix store
    mkdir -p $out/libexec
    mv quake4-linux-1.0-demo $out/libexec
    cd $out/libexec/quake4-linux-1.0-demo

    # Remove obsolete setup files
    rm -rf setup.data

    # Patch ELF binaries
    cd bin/Linux/x86
    patchelf --set-interpreter ${stdenv.cc.libc}/lib/ld-linux.so.2 ./quake4.x86
    patchelf --set-rpath $(pwd):${stdenv.cc.cc}/lib:${stdenv.cc.libc}/lib:${SDL}/lib:${xlibs.libX11}/lib:${xlibs.libXext}/lib ./quake4.x86
    chmod +x ./quake4.x86

    patchelf --set-interpreter ${stdenv.cc.libc}/lib/ld-linux.so.2 ./q4ded.x86
    patchelf --set-rpath $(pwd):${stdenv.cc.cc}/lib:${stdenv.cc.libc}/lib ./q4ded.x86
    chmod +x ./q4ded.x86

    patchelf --set-rpath $(pwd):${stdenv.cc.libc}/lib ./libstdc++.so.5

    # Remove libgcc_s.so.1 that conflicts with Mesa3D's libGL.so
    rm ./libgcc_s.so.1

    # Create wrappers for the executables
    mkdir -p $out/bin
    cat > $out/bin/q4ded <&ltEOF
    #! ${stdenv.shell} -e
    cd $out/libexec/quake4-linux-1.0-demo
    ./bin/Linux/x86/q4ded.x86 "\$@"
    EOF
    chmod +x $out/bin/q4ded

    cat > $out/bin/quake4 <<EOF
    #! ${stdenv.shell} -e
    cd $out/libexec/quake4-linux-1.0-demo
    ./bin/Linux/x86/quake4.x86 "\$@"
    EOF
    chmod +x $out/bin/quake4
'';
}

We can install the revised package in our Nix profile as follows:

$ nix-env -f quake4demo.nix -i quake4-demo

and conveniently run it from the command-line:

$ quake4

Happy playing!

(As a sidenote: besides creating a wrapper script, it is also possible to create a Freedesktop compliant .desktop entry file, so that it can be launched from the KDE/GNOME applications menu, but I leave this an open exercise to the reader!)

Conclusion

In this blog post, I have explained that prebuilt binaries do not work out of the box in NixOS. The main reason is that they cannot find their dependencies in their "usual locations", because these do not exist in NixOS. As a solution, it is possible to patch binaries with a tool called PatchELF to provide them the correct location to the dynamic linker and the paths to the libraries they need.

Moreover, I have shown two example packaging approaches (a simple and complex one) that should be relatively easy to repeat as an exercise.

Although source deployments typically work out of the box with few or no modifications, getting prebuilt binaries to work is often a journey that requires patching, wrapping, and experimentation. In this blog post I have described a few tricks that can be applied to make prebuilt packages work.

The approach described in this blog post is not the only solution to get prebuilt binaries to work in NixOS. An alternative approach is composing FHS-compatible chroot environments from Nix packages. This solution simulates an environment in which dependencies can be found in their common FHS locations. As a result, we do not require any modifications to a binary.

Although FHS chroot environments are conceptually nice, I would still prefer the patching approach described in this blog post unless there is no other way to make a package work properly -- it has less overhead, does not require any special privileges (e.g. super user rights), we can use the distribution mechanisms of Nix in its full extent, and we can also install a package as an unprivileged user.

Steam is a notable exception for using FHS compatible choot environments, because it is a deployment tool that conflicts with Nix's deployment properties.

As a final practical note: if you want to repeat the Quake 4 demo packing process, please check the following:

To enable hardware accelerated OpenGL for 32-bit applications in a 64-bit NixOS, add the following property to /etc/nixos/configuration.nix:
```
hardware.opengl.driSupport32Bit = true;
```
Id sofware's FTP server seems to be quite slow to download from. You can also obtain the demo from a different download site (e.g. Fileplanet) and run the following command to get it imported into the Nix store:
```
$ nix-prefetch-url file:///home/sander/quake4-linux-1.0-demo.x86.run
```

Last week I was in Berlin to visit the first official Nix conference: NixCon 2015. Besides just being there, I have also given a talk about deploying (micro)services with Disnix.

In my talk, I have elaborated about various kinds of aspects, such as microservices in general, their implications (such as increasing operational complexity), the concepts of Disnix, and a number of examples including a real-life usage scenario.

I have given two live demos in the talk. The first demo is IMHO quite interesting, because it shows the full potential of Disnix when you have to deal with many heterogeneous traits of service-oriented systems and their environments -- we deploy services to a network of machines running multiple kinds of operating systems, having multiple kinds of CPU architectures and they must be reached by using multiple connection protocols (e.g. SSH and SOAP/HTTP).

Furthermore, I consider it a nice example that should be relatively straight forward to repeat by others. The good parts of the example are that it is small (only two services that communicate through a TCP socket), and it has no specific requirements on the target systems, such as infrastructure components (e.g. a DBMS or application server) that must be preinstalled first.

In this blog post, I will describe what I did to set up the machines and I will explain how to repeat the example deployment scenarios shown in the presentation.

Configuring the target machines

Despite being a simple example, the thing that makes repeating the demo hard is that Disnix expects the target machines to be present already running the Nix package manager and the Disnix service that is responsible for executing deployment steps remotely.

For the demo, I have manually pre-instantiated these VirtualBox VMs. Moreover, I have installed their configurations manually as well, which took me quite a bit of effort.

Instantiating the VMs

For instantiation of the VirtualBox VMs, most of the standard settings were sufficient -- I simply provided the operating system type and CPU architecture to VirtualBox and used the recommended disk and RAM settings that VirtualBox provided me.

The only modification I have made to the VM configurations is adding an additional network interface. The first network interface is used to connect to the host machine and the internet (with the host machine being the gateway). The second interface is used to allow the host machine to connect to any VM belonging to the same private subnet.

To configure the second network interface, I right click on the corresponding VM, pick the 'Network' option and open the 'Adapter 2' tab. In this tab, I provide the following settings:

Installing the operating systems

For the Kubuntu and Windows 7 machine, I have just followed their standard installation procedures. For the NixOS machine, I have used the following NixOS configuration file:


{ pkgs, ... }:

{
  boot.loader.grub.device = "/dev/sda";
  fileSystems = {
"/" = { label = "root"; };
  };
  networking.firewall.enable = false;
  services.openssh.enable = true;
  services.tomcat.enable = true;
  services.disnix.enable = true;
  services.disnix.useWebServiceInterface = true;

  environment.systemPackages = [ pkgs.mc ];
}

The above configuration file captures a machine configuration providing OpenSSH, Apache Tomcat (for hosting the web service interface) and the Disnix service with the web service interface enabled.

Configuring SSH

The Kubuntu and Windows 7 machine require the OpenSSH to be running to allow deployment operations to be executed from a remote location.

I ran the following command-line instruction to enable the OpenSSH server on Kubuntu:


$ sudo apt-get install openssh-server

I ran the following command on Cygwin to configure the OpenSSH server:


$ ssh-host-config

One of the things the above script does is setting up a Windows service that runs the SSH daemon. It can be started by opening the 'Control Panel -> System and Security -> Administrative Tools -> Services', right clicking on 'CYGWIN sshd' and then selecting 'Start'.

Setting up user accounts

We need to set up specialized user accounts to allow the coordinator machine to connect to the target machines. By default, the coordinator machine connects as the same user which carries out the deployment process. I have configured all the three VMs to have a user account named: 'sander'.

To prevent the SSH client from asking for a password for each request, we must set up a pair of public-private SSH keys. This can be done by running:


$ ssh-keygen

After generating the keys, we must upload the public key (~/.ssh/id_rsa.pub) to all the target machines in the network and configure them so that they can be used. Basically, we need to modify their authorized_keys configuration files and set the correct file permissions:


$ mkdir -p ~/.ssh
$ chmod 700 ~/.ssh
$ cat id_rsa.pub >> ~/.ssh/authorized_keys
$ chmod 600 ~/.ssh/authorized_keys

Installing Nix, Dysnomia and Disnix

The next step is installing the required deployment tools on the host machines. For the NixOS machine all required tools have been installed as part of the system configuration, so no additional installation steps are required. For the other machines we must manually install Nix, Dysnomia and Disnix.

On the Kubuntu machine, I first did a single user installation of the Nix package manager under my own user account:


$ curl https://nixos.org/nix/install | sh

After installing Nix, I deploy Dysnomia from Nixpkgs. The following command-line instruction configures Dysnomia to use the direct activation mechanism for processes:


$ nix-env -i $(nix-build -E 'with import <nixpkgs> {}; dysnomia.override { jobTemplate = "direct"; }')

Installing Disnix can be done as follows:


$ nix-env -f '<nixpkgs>' -iA disnix

We must run a few additional steps to get the Disnix service running. The following command copies the Disnix DBus configuration file allowing it to run on the system bus granting permissions to the appropriate class of users:


$ sudo cp /nix/var/nix/profiles/default/etc/dbus-1/system.d/disnix.conf \
    /etc/dbus-1/system.d

Then I manually edit /etc/dbus-1/system.d/disnix.conf and change the line:


<policy user="root">

into:


<policy user="sander">

to allow the Disnix service to run under my own personal user account (that has a single user Nix installation).

We also need an init.d script that starts the server on startup. The Disnix distribution has a Debian-compatible init.d script included that can be installed as follows:


$ sudo cp /nix/var/nix/profiles/default/share/doc/disnix/disnix-service.initd /etc/init.d/disnix-service
$ sudo ln -s ../init.d/disnix-service /etc/rc2.d/S06disnix-service
$ sudo ln -s ../init.d/disnix-service /etc/rc3.d/S06disnix-service
$ sudo ln -s ../init.d/disnix-service /etc/rc4.d/S06disnix-service
$ sudo ln -s ../init.d/disnix-service /etc/rc5.d/S06disnix-service

The script has been configured to run the service under my user account, because it contains the following line:


DAEMONUSER=sander

The username should correspond to the user under which the Nix package manager has been installed.

After executing the previous steps, the DBus daemon needs to be restarted so that it can use the Disnix configuration. Since DBus is a critical system service, it is probably more convenient to just reboot the entire machine. After rebooting, the Disnix service should be activated on startup.

Installing the same packages on the Windows/Cygwin machine is much more tricky -- there is no installer provided for the Nix package manager on Cygwin, so we need to compile it from source. I installed the following Cygwin packages to make source installations of all required packages possible:


curl
patch
perl
libbz2-devel
sqlite3
make
gcc-g++
pkg-config
libsqlite3-devel
libcurl-devel
openssl-devel
libcrypt-devel
libdbus1-devel
libdbus-glib-1-devel
libxml2-devel
libxslt-devel
dbus
openssh

Besides the above Cygwin packages, we also need to install a number of Perl packages from CPAN. I opened a Cygwin terminal in administrator mode (right click, run as: Administrator) and ran the following commands:


$ perl -MCPAN -e shell
install DBD::SQLite
install WWW::Curl

Then I installed the Nix package manager by obtaining the source tarball and running:


tar xfv nix-1.10.tar.xz
cd nix-1.10
./configure
make
make install

I installed Dysnomia by obtaining the source tarball and running:


tar xfv dysnomia-0.5pre1234.tar.gz
cd dysnomia-0.5pre1234
./configure --with-job-template=direct
make
make install

And Disnix by running:


tar xfv disnix-0.5pre1234.tar.gz
cd disnix-0.5pre1234
./configure
make
make install

As with the Kubuntu machine, we must provide a service configuration file for DBus allowing the Disnix service to run on the system bus:


$ cp /nix/var/nix/profiles/default/etc/dbus-1/system.d/disnix.conf \
    /etc/dbus-1/system.d

Also, I have to manually edit /etc/dbus-1/system.d/disnix.conf and change the line:


<policy user="root">

into:


<policy user="sander">

to allow operations to be executed under my own less privileged user account.

To run the Disnix service, we must define two Windows services. The following command-line instruction creates a Windows service for DBus:


$ cygrunsrv -I dbus -p /usr/bin/dbus-daemon.exe \
    -a '--system --nofork'

The following command-line instruction creates a Disnix service running under my own user account:


$ cygrunsrv -I disnix -p /usr/local/bin/disnix-service.exe \
  -e 'PATH=/bin:/usr/bin:/usr/local/bin' \
  -y dbus -u sander

In order to make the Windows service work, the user account requires login rights. To check if this right has been granted, we can run:


$ editrights -u sander -l

which should list SeServiceLogonRight. If this is not the case, this permission can be granted by running:


$ editrights -u sander -a SeServiceLogonRight

Finally, we must start the Disnix service. This can be done by opening the services configuration screen (Control Panel -> System and Security -> Administrative Tools -> Services), right clicking on: 'disnix' and selecting: 'Start'.

Deploying the example scenarios

After deploying the virtual machines and their configurations, we can start doing some deployment experiments with the Disnix TCP proxy example. The Disnix deployment models can be found in the deployment/DistributedDeployment sub folder:


$ cd deployment/DistributedDeployment

Before we can do any deployment, we must write an infrastructure model (infrastructure.nix) reflecting the machines' configuration properties that we have deployed previously:


{
  test1 = { # x86 Linux machine (Kubuntu) reachable with SSH
    hostname = "192.168.56.101";
    system = "i686-linux";
    targetProperty = "hostname";
    clientInterface = "disnix-ssh-client";
  };

  test2 = { # x86-64 Linux machine (NixOS) reachable with SOAP/HTTP
    hostname = "192.168.56.102";
    system = "x86_64-linux";
    targetEPR = http://192.168.56.102:8080/DisnixWebService/services/DisnixWebService;
    targetProperty = "targetEPR";
    clientInterface = "disnix-soap-client";
  };

  test3 = { # x86-64 Windows machine (Windows 7) reachable with SSH
    hostname = "192.168.56.103";
    system = "x86_64-cygwin";
    targetProperty = "hostname";
    clientInterface = "disnix-ssh-client";
  };
}

and write the distribution model to reflect the initial deployment scenario shown in the presentation:


{infrastructure}:

{
  hello_world_server = [ infrastructure.test2 ];
  hello_world_client = [ infrastructure.test1 ];
}

Now we can deploy the system by running:


$ disnix-env -s services-without-proxy.nix \
  -i infrastructure.nix -d distribution.nix

If we open a terminal on the Kubuntu machine, we should be able to run the client:


$ /nix/var/nix/profiles/disnix/default/bin/hello-world-client

When we type: 'hello' the client should respond by saying: 'Hello world!'. The client can be exited by typing: 'quit'.

We can also deploy a second client instance by changing the distribution model:


{infrastructure}:

{
  hello_world_server = [ infrastructure.test2 ];
  hello_world_client = [ infrastructure.test1 infrastructure.test3 ];
}

and running the same command-line instruction again:


$ disnix-env -s services-without-proxy.nix \
  -i infrastructure.nix -d distribution.nix

After the redeployment has been completed, we should be able to start the client that connects to the same server instance on the second test machine (the NixOS machine).

Another thing we could do is moving the server to the Windows machine:


{infrastructure}:

{
  hello_world_server = [ infrastructure.test3 ];
  hello_world_client = [ infrastructure.test1 infrastructure.test3 ];
}

However, running the following command:


$ disnix-env -s services-without-proxy.nix \
  -i infrastructure.nix -d distribution.nix

probably leads to a build error, because the host machine (that runs Linux) is unable to build packages for Cygwin. Fortunately, this problem can be solved by enabling building on the target machines:


$ disnix-env -s services-without-proxy.nix \
  -i infrastructure.nix -d distribution.nix \
  --build-on-targets

After deploying the new configuration, you will observe that the clients have been disconnected. You can restart any of the clients to observe that they have been reconfigured to connect to the new server instance that has been deployed to the Windows machine.

Discussion

In this blog post, I have described how to set up and repeat the heterogeneous network deployment scenario that I have shown in my presentation. Despite being a simple example, the thing that makes repeating it difficult is because we need to deploy the machines first, a process which is not automated by Disnix. (As a sidenote: with the DisnixOS extension we can automate the deployment of machines as well, but this does not work with a network of non-NixOS machines, such as Windows installations).

Additionally, the fact that there is no installer (or official support) for the Nix deployment tools on other platforms than Linux and Mac OS X makes it even more difficult. (Fortunately, compiling from source on Cygwin should work and there are also some ongoing efforts to revive FreeBSD support).

To alleviate some of these issues, I have improved the Disnix documentation a bit to explain how to work with single user Nix installations on non-NixOS platforms and included the Debian init.d script in the Disnix distribution as an example. These changes have been integrated into the current development version Disnix.

I am also considering writing a simple infrastructure model generator for static deployment purposes (a more advanced prototype already exists in the Dynamic Disnix toolset) and include it with the basic Disnix toolset to avoid some repetition while deploying target machines manually.

References

I have published the slides of my talk on SlideShare. For convenience, I have embedded them into this web page:

Deploying (micro)services with Disnix from Sander van der Burg

Furthermore, the recordings of the NixCon 2015 talks are also online.

I have written quite a few blog posts on service deployment with Disnix this year. The deployment mechanics that Disnix implements work quite well for my own purposes.

Unfortunately, having a relatively good deployment solution does not necessarily mean that a system functions well in a production environment -- there are also many other concerns that must be dealt with.

Another important concern of service-oriented systems is dealing with resource consumption, such as RAM, CPU and disk space. Obviously, services need them to accomplish something. However, since they are typically long running, they also consume resources even if they are not doing any work.

These problems could become quite severe if services have been poorly developed. For example, they may leak memory and never fully release the RAM they have allocated. As a result, an entire machine may eventually run out of memory. Moreover, "idle" services may degrade the performance of other services running on the same machine.

There are various ways to deal with resource problems:

The most obvious solution is buying bigger or additional hardware resources, but this typically increases the costs of maintaining a production environment. Moreover, it does not take the source of some of the problems away.
Another solution would be to fix and optimize problematic services, but this could be a time consuming and costly process, in particular when there is a high technical debt.
A third solution would be to support on-demand service activation and self termination -- a service gets activated the first time it is consulted and terminates itself after a period of idleness.

In this blog post, I will describe how to implement and deploy a system supporting the last solution.

To accomplish this goal, we need to modify the implementations of the services -- we must retrieve an incoming connection from the host system's service manager that activates a service when a client connects and self terminate when the moment is right.

Furthermore, we need to adapt a service's deployment procedure to use these facilities.

Retrieving a socket from the host system's service manager

In many conventional deployment scenarios, the services themselves are responsible for creating the sockets to which clients can connect. However, if we want to activate them on-demand this property conflicts -- the socket must already exist before the process runs, so that it can be started when a client connects.

We can use a service manager that supports socket activation to accomplish on-demand activation. There are various solutions supporting this property. The most prominently advertised solution is probably systemd, but there are other solutions that can do this as well, such as launchd, inetd, or xinetd, albeit the protocols that activated processes must implement differ.

In one of my toy example systems used for testing Disnix (the TCP proxy example) I used to do the following:


static int create_server_socket(int source_port)
{
    int sockfd, on = 1;
    struct sockaddr_in client_addr;

    /* Create socket */
    sockfd = socket(AF_INET, SOCK_STREAM, 0);
    if(sockfd < 0)
    {
        fprintf(stderr, "Error creating server socket!\n");
        return -1;
    }    

    /* Create address struct */
    memset(&client_addr, '\0', sizeof(client_addr));
    client_addr.sin_family = AF_INET;
    client_addr.sin_addr.s_addr = htonl(INADDR_ANY);
    client_addr.sin_port = htons(source_port);

    /* Set socket options to reuse the address */
    setsockopt(sockfd, SOL_SOCKET, SO_REUSEADDR, &on, 4);

    /* Bind the name (ip address) to the socket */
    if(bind(sockfd, (struct sockaddr *)&client_addr, sizeof(client_addr)) < 0)
        fprintf(stderr, "Error binding on port: %d, %s\n", source_port, strerror(errno));

    /* Listen for connections on the socket */
    if(listen(sockfd, 5) < 0)
        fprintf(stderr, "Error listening on port %d\n", source_port);

    /* Return the socket file descriptor */
    return sockfd;
}

The function listed above is responsible for creating a socket file descriptor, binding the socket to an IP address and TCP port, and listening for incoming connections.

To support on-demand activation, I need to modify this function to retrieve the server socket from the service manager. Systemd's socket activation protocol works by passing the socket as the third file descriptor to the process that it spawns. By adjusting the previously listed code into the following:


static int create_server_socket(int source_port)
{
    int sockfd, on = 1;

#ifdef SYSTEMD_SOCKET_ACTIVATION
    int n = sd_listen_fds(0);

    if(n > 1)
    {
        fprintf(stderr, "Too many file descriptors received!\n");
        return -1;
    }
    else if(n == 1)
        sockfd = SD_LISTEN_FDS_START + 0;
    else
    {
#endif
        struct sockaddr_in client_addr;

        /* Create socket */
        sockfd = socket(AF_INET, SOCK_STREAM, 0);
        if(sockfd < 0)
        {
            fprintf(stderr, "Error creating server socket!\n");
            return -1;
        }

        /* Create address struct */
        memset(&client_addr, '\0', sizeof(client_addr));
        client_addr.sin_family = AF_INET;
        client_addr.sin_addr.s_addr = htonl(INADDR_ANY);
        client_addr.sin_port = htons(source_port);

        /* Set socket options to reuse the address */
        setsockopt(sockfd, SOL_SOCKET, SO_REUSEADDR, &on, 4);

        /* Bind the name (ip address) to the socket */
        if(bind(sockfd, (struct sockaddr *)&client_addr, sizeof(client_addr)) < 0)
            fprintf(stderr, "Error binding on port: %d, %s\n", source_port, strerror(errno));

        /* Listen for connections on the socket */
        if(listen(sockfd, 5) < 0)
            fprintf(stderr, "Error listening on port %d\n", source_port);

#ifdef SYSTEMD_SOCKET_ACTIVATION
    }
#endif

    /* Return the socket file descriptor */
    return sockfd;
}

the server will use the socket that has been created by systemd (and passed as a third file descriptor). Moreover, if the server is started as a standalone process, it will revert to its old behaviour and allocates the server socket itself.

I have wrapped the systemd specific functionality inside a conditional preprocessor block so that it only gets included when I explicitly ask for it. The downside of supporting systemd's socket activation protocol is that we require some functionality that is exposed by a shared library that has been bundled with systemd. As systemd is Linux (and glibc) specific, it makes no sense to build a service with this functionality enabled on non-systemd based Linux distributions and non-Linux operating systems.

Besides conditionally including the code, I also made linking against the systemd library conditional in the Makefile:


CC = gcc

ifeq ($(SYSTEMD_SOCKET_ACTIVATION),1)
    EXTRA_BUILDFLAGS=-DSYSTEMD_SOCKET_ACTIVATION=1 $(shell pkg-config --cflags --libs libsystemd)
endif

all:
 $(CC) $(EXTRA_BUILDFLAGS) hello-world-server.c -o hello-world-server

...

so that the systemd-specific code block and library only get included if I run 'make' with socket activation explicitly enabled:


$ make SYSTEMD_SOCKET_ACTIVATION=1

Implementing self termination

As with on-demand activation, there is no way to do self termination generically and we must modify the service to support this property in some way.

In the TCP proxy example, I have implemented a simple approach using a counter (that is initially set to 0):


volatile unsigned int num_of_connections = 0;

For each client that connects to the server, we fork a child process that handles the connection. Each time we fork, I also raise the connection counter in the parent process:


while(TRUE)
{
    /* Create client socket if there is an incoming connection */
    if((client_sockfd = wait_for_connection(server_sockfd)) >= 0)
    {
        /* Fork a new process for each incoming client */
        pid_t pid = fork();

        if(pid == 0)
        {
            /* Handle the client's request and terminate
             * when it disconnects */
        }
        else if(pid == -1)
            fprintf(stderr, "Cannot fork connection handling process!\n");
#ifdef SELF_TERMINATION
        else
            num_of_connections++;
#endif
    }

    close(client_sockfd);
    client_sockfd = -1;
}

(As with socket activation, I have wrapped the termination functionality in a conditional preprocessor block -- it makes no sense to include this functionality into a service that cannot be activated on demand).

When a client disconnects, the process handling its connection terminates and sends a SIGCHLD signal to the parent. We can configure a signal handler for this type of signal as follows:


#ifdef SELF_TERMINATION
    signal(SIGCHLD, sigreap);
#endif

and use the corresponding signal handler function to decrease the counter and wait for the client process to terminate:


#ifdef SELF_TERMINATION

void sigreap(int sig)
{
    pid_t pid;
    int status;
    num_of_connections--;

    /* Event handler when a child terminates */
    signal(SIGCHLD, sigreap);

    /* Wait until all child processes terminate */
    while((pid = waitpid(-1, &status, WNOHANG)) > 0);

Finally, the server can terminate itself when the counter has reached 0 (which means that it is not handling any connections and the server has become idle):


    if(num_of_connections == 0)
        _exit(0);
}
#endif

Deploying services with on demand activation and self termination enabled

Besides implementing socket activation and self termination, we must also deploy the server with these features enabled. When using Disnix as a deployment system, we can write the following service expression to accomplish this:


{stdenv, pkgconfig, systemd}:
{port, enableSystemdSocketActivation ? false}:

let
  makeFlags = "PREFIX=$out port=${toString port}${stdenv.lib.optionalString enableSystemdSocketActivation " SYSTEMD_SOCKET_ACTIVATION=1"}";
in
stdenv.mkDerivation {
  name = "hello-world-server";
  src = ../../../services/hello-world-server;
  buildInputs = if enableSystemdSocketActivation then [ pkgconfig systemd ] else [];
  buildPhase = "make ${makeFlags}";
  installPhase = ''
    make ${makeFlags} install

    mkdir -p $out/etc
    cat > $out/etc/process_config <<EOF
    container_process=$out/bin/process
    EOF

    ${stdenv.lib.optionalString enableSystemdSocketActivation ''
      mkdir -p $out/etc
      cat > $out/etc/socket <<EOF
      [Unit]
      Description=Hello world server socket

      [Socket]
      ListenStream=${toString port}
      EOF
''}
'';
}

In the expression shown above, we do the following:

We make the socket activation and self termination features configurable by exposing it as a function parameter (that defaults to false disabling it).
If the socket activation parameter has been enabled, we pass the SYSTEMD_SOCKET_ACTIVATION=1 flag to 'make' so that these facilities are enabled in the build system.
We must also provide two extra dependencies: pkgconfig and systemd to allow the program to find the required library functions to retrieve the socket from systemd.
We also compose a systemd socket unit file that configures systemd on the target system to allocate a server socket that activates the process when a client connects to it.

Modifying Dysnomia modules to support socket activation

As explained in an older blog post, Disnix consults a plugin system called Dysnomia that takes care of executing various kinds of deployment activities, such as activating and deactivating services. The reason that a plugin system is used, is because services can be any kind of deployment unit with no generic activation procedure.

For services of the 'process' and 'wrapper' type, Dysnomia integrates with the host system's service manager. To support systemd's socket activation feature, we must modify the corresponding Dysnomia modules to start the socket unit instead of the service unit on activation. For example:


$ systemctl start disnix-53bb1pl...-hello-world-server.socket

starts the socket unit, which in turn starts the service unit with the same name when a client connects to it.

To deactivate the service, we must first stop the socket unit and then the service unit:


$ systemctl stop disnix-53bb1pl...-hello-world-server.socket
$ systemctl stop disnix-53bb1pl...-hello-world-server.service

Discussion

In this blog post, I have described an on-demand service activation and self termination approach using systemd, Disnix, and a number of code modifications. Some benefits of this approach are that we can save system resources such as RAM and CPU, improve the performance of non-idle services running on a same machine, and reduce the impact of poorly implemented services that (for example) leak memory.

There are also some disadvantages. For example, connecting to an inactive service introduces latency, in particular when a service has a slow start up procedure making it less suitable for systems that must remain responsive.

Moreover, it does not cope with potential disk space issues -- a non-running service still consumes disk space for storing its package dependencies and persistent state, such as databases.

Finally, there are some practical notes on the solutions described in the blog post. The self termination procedure in the example program terminates the server immediately after it has discovered that there are no active connections. In practice, it may be better to implement a timeout to prevent unnecessary latencies.

Furthermore, I have only experimented with systemd's socket activation features. However, it is also possible to modify the Dysnomia modules to support different kinds of activation protocols, such as the ones provided by launchd, inetd or xinetd.

The TCP proxy example uses C as an implementation language, but systemd's socket activation protocol is not limited to C programs. For instance, an example program on GitHub demonstrates how a Python program running an embedded HTTP server can be activated with systemd's socket activation mechanism.

References

I have modified the development version of Dysnomia to support the socket activation feature of systemd. Moreover, I have extended the TCP proxy example package with a sub example that implements the on-demand activation and self termination approach described in this blog post.

Both packages can be obtained from my GitHub page.

Today, it's my blog's fifth anniversary. As usual, this is a nice opportunity to reflect over last year's writings.

Disnix

Something that I cannot leave unmentioned is Disnix, a toolset that I have developed as part of my master's and PhD research. For quite some time, its development was progressing at a very low pace, mainly because I had other obligations -- I had to finish my PhD thesis, and after I left academia, I was working on other kinds of aspects.

Fortunately, things have changed considerably. Since October last year I have been actively using Disnix to maintain the deployment of a production system that can be decomposed into independently deployable services. As a result, the development of Disnix also became much more progressive, which resulted in a large number of Disnix related blog posts and some major improvements.

In the first blog post, I compared Disnix with another tool from the Nix project: NixOps, described their differences and demonstrated that they can be combined to fully automate all deployment aspects of a service-oriented system. Shortly after publishing this blog post, I announced the next Disnix release: Disnix 0.3, 4 years after its previous release.

A few months later, I announced yet another Disnix release: Disnix 0.4 in which I have integrated the majority of state deployment facilities from the prototype described in the HotSWUp 2012 paper.

The remainder of blog posts provide solutions for additional problems and describe some optimizations. I have formulated a port assignment problem which may manifest itself while deploying microservices and developed a tool that can be used to provide a solution. I also modified Disnix to deploy target-specific services (in addition to target-agnostic services), which in some scenarios, make deployments more efficient.

Another optimization that I have developed is on demand activation and self termination of services. This is particularly useful for poorly developed services, that for example, leak memory.

Finally, I have attended NixCon2015 where I gave a talk about Disnix (including two live demos) and shown how it can be used to deploy (micro)services. An interesting aspect of the presentation is the first live demo in which I deploy a simple example system into a network of heterogeneous machines (machines running multiple operating systems, having multiple CPU architectures, reachable by multiple connection protocols).

The Nix project

In addition to Disnix, I have also written about some general Nix aspects. In February, I have visited FOSDEM. In this year's edition, we had a NixOS stand to promote the project (including its sub projects). From my own personal experience, I know that advertising Nix is quite challenging. For this event, I crafted a sales pitch explanation recipe, that worked quite well for me in most cases.

A blog post that I am particularly proud of is my evaluation and comparison of Snappy Ubuntu with Nix/NixOS, in which I describe the deployment properties of Snappy and compare how they conceptually relate to Nix/NixOS. It attracted a huge amount of visitors breaking my old monthly visitors record from three years ago!

I also wrote a tutorial blog post demonstrating how we can deploy prebuilt binaries with the Nix package manager. In some cases, packaging prebuilt software can be quite challenging, and the purpose of this blog post to show a number techniques that can be used to accomplish this.

Methodology

Besides deployment, I have also written two methodology related blog posts. In the first blog post, I have described my experiences with Agile software development and Scrum. Something that has been bothering me for quite a while is these people claiming that "implementing" such a methodology considerably improves development and quality of software.

In my opinion this is ridiculous! These methodologies provide some structure, but the "secret" lies in its undefined parts -- to be agile you should accept that nothing will completely go as planned, you should remain focussed, take small steps (not huge leaps), and most importantly: continuously adapt and improve. But no methodology provides a universally applicable recipe that makes you successful in doing it.

However, despite being critical, I think that implementing a methodology is not bad per se. In another blog post, I have described how I implemented a basic software configuration management process in a small organization.

Development

I have also reflected over my experiences while developing command-line utilities and wrote a blog post with some considerations I take into account.

Side projects and research

In my previous reflections, there was always a section dedicated to research and side projects. Unfortunately, this year there is not much to report about -- I made a number of small changes and additions to my side projects, but I did not made any significant advancements.

Probably the fact that Disnix became a main and side project contributes to that. Moreover, I also have other stuff to do that has nothing to do with software development or research. I hope that I can find more time next year to report about my other side projects, but I guess this is basically just a luxury problem. :-)

Blog posts

As with my previous annual blog reflections, I will also publish the top 10 of my most frequently read blog posts:

On Nix and GNU Guix. As with the previous three blog reflections, this blog post remains on top. However, its popularity finally seems to be challenged by the number two!
An evaluation and comparison of Snappy Ubuntu. This is the only blog post I have written this year that ended up in the overall top 10. It attracted a record number of visitors in one month and now rivals the number one in popularity.
An alternative explanation of the Nix package manager. This was last year's number two and dropped to the third place, because of the Snappy Ubuntu blog post.
Setting up a multi-user Nix installation on non-NixOS systems. This blog post was also in last year's top 10 but it seems to have become even more popular. I think this is probably caused by the fact that it is still hard to set up a multi-user installation.
Managing private Nix packages outside the Nixpkgs tree. I wrote this blog for newcomers and observed that people keep frequently consulting it. As a consequence, it has entered the overall top 10.
Asynchronous programming with JavaScript. This blog post was also in last year's top 10 and became slightly more popular. As a result, it moved to the 6th position.
Yet another blog post about Object Oriented Programming and JavaScript. Another JavaScript related blog post that was in last year's top 10. It became slightly more popular and moved to the 7th place.
Composing FHS-compatible chroot environments with Nix (or deploying Steam in NixOS). This blog post was the third most popular last year, but now seems to be not that interesting anymore.
Setting up a Hydra build cluster for continuous integration and testing (part 1). Remains a popular blog post, but also considerably dropped in popularity compared to last year.
Using Nix while doing development. A very popular blog post last year, but considerably dropped in popularity.

Conclusion

I am still not out of ideas yet, so stay tuned! The remaining thing I want to say is:

HAPPY NEW YEAR!!!!!!!!!!!

It has been quiet for a while on my blog in the programming language domain. Over two years ago, I started writing a series of blog posts about asynchronous programming with JavaScript.

In the first blog post, I explained some general asynchronous programming issues, code structuring issues and briefly demonstrated how the async library can be used to structure code more properly. Later, I have written a blog post about promises, another abstraction mechanism dealing with asynchronous programming complexities. Finally, I have developed my own abstraction functions by investigating how JavaScript's structured programming language constructs (that are synchronous) translate to the asynchronous programming world.

In these blog posts, I have used two kinds of function invocation styles -- something that I call the Node.js-function invocation style, and the promises invocation style. As the name implies, the former is used by the Node.js standard library, as well as many Node.js-based APIs. The latter is getting more common in the browser world. As a matter of fact, many modern browsers, provide a Promise prototype as part of their DOM API allowing others to construct their own Promise-based APIs with it.

In this blog post, I will compare both function invocation styles and describe some of their differences. Additionally, there are situations in which I have to mix APIs using both styles and I have observed that it is quite annoying to combine them. I will show how to alleviate this pain a bit by developing my own generically applicable adapter functions.

Two example invocations

The most frequently used invocation style in my blog posts is something that I call the Node.js-function invocation style. An example code fragment that uses such an invocation is the following:


fs.readFile("hello.txt", function(err, data) {
    if(err) {
        console.log("Error while opening file: "+err);
    } else {
        console.log("File contents is: "+data);
    }
});

As you may see in the code fragment above, when we invoke the readFile() function, it returns immediately (to be precise: it returns, but it returns no value). We use a callback function (that is typically the last function parameter) to retrieve the results of the invocation (or the error if something went wrong) at a later point in time.

By convention, the first parameter of the callback is an error parameter that is not null if some error occurs. The remaining parameters are optional and can be used to retrieve the corresponding results.

When using promises (more specifically: promises that conform to the Promises/A and Promises/A+ specifications), we use a different invocation pattern that may look as follows:


Task.findAll().then(function(tasks) {
    for(var i = 0; i < tasks.length; i++) {
        var task = tasks[i];
        console.log(task.title + ": "+ task.description);
    }
}, function(err) {
    console.log("An error occured: "+err);
});

As with the previous example, the findAll() function invocation shown above also returns immediately. However, it also does something different compared to the Node.js-style function invocation -- it returns an object called a promise whereas the invocation in the previous example never returns anything.

By convention, the resulting promise object provides a method called then() in which (according the Promises/A and A+ standards) the first parameter is a callback that gets invoked when the function invocation succeeds and the second callback gets invoked when the function invocation fails. The parameters of the callback functions represent result objects or error objects.

Comparing the invocation styles

At first sight, you may probably notice that despite having different styles, both function invocations return immediately and need an "artificial facility" to retrieve the corresponding results (or errors) at a later point in time, as opposed to directly returning a result in a function.

The major difference is that in the promises invocation style, you will always get a promise as a result of an invocation. A promise provides a reference to something which corresponding result will be delivered in the future. For example, when running:


var tasks = Task.findAll();

I will obtain a promise that, at some point in the future, provides me an array of tasks. I can use this reference to do other things by passing the promise around (for example) as a function argument to other functions.

For example, I may want to construct a UI displaying the list of tasks. I can already construct pieces of it without waiting for the full list of tasks to be retrieved:


displayTasks(tasks);

The above function could, for example, already start rendering a header, some table cells and buttons without the results being available yet. The display function invokes the then() function when it really needs the data.

By contrast, in the Node.js-callback style, I have no reference to the pending invocation at all. This means that I always have to wait for its completion before I can render anything UI related. Because we are forced to wait for its completion, it will probably make the application quite unresponsive, in particular when we have to retrieve many task records.

So in general, in addition to better structured code, promises support composability whereas Node.js-style callbacks do not. Because of this reason, I consider promises to be more powerful.

However, there is also something that I consider a disadvantage. In my first blog post, I have shown the following Node.js-function invocation style pyramid code example as a result of nesting callbacks:


var fs = require('fs');
var path = require('path');

fs.mkdir("out", 0755, function(err) {
    if(err) throw err;

    fs.mkdir(path.join("out, "test"), 0755, function(err) {
        if (err) throw err;        
        var filename = path.join("out", "test", "hello.txt");

        fs.writeFile(filename, "Hello world!", function(err) {
            if(err) throw err;

            fs.readFile(filename, function(err, data) {
                if(err) throw err;

                if(data == "Hello world!")
                    process.stderr.write("File is correct!\n");
                else
                    process.stderr.write("File is incorrect!\n");
            });
        });
    });
});

I have also shown in the same blog post, that I can use the async.waterfall() abstraction to flatten its structure:


var fs = require('fs');
var path = require('path');

filename = path.join("out", "test", "hello.txt");

async.waterfall([
    function(callback) {
        fs.mkdir("out", 0755, callback);
    },

    function(callback) {
        fs.mkdir(path.join("out, "test"), 0755, callback);
    },

    function(callback) {
        fs.writeFile(filename, "Hello world!", callback);
    },

    function(callback) {
        fs.readFile(filename, callback);
    },

    function(data, callback) {
        if(data == "Hello world!")
            process.stderr.write("File is correct!\n");
        else
            process.stderr.write("File is incorrect!\n");
    }

], function(err, result) {
    if(err) throw err;
});

As you may probably notice, the code fragment above is much more readable and better maintainable.

In my second blog post, I implemented a promises-based variant of the same example:


var fs = require('fs');
var path = require('path');
var Promise = require('rsvp').Promise;

/* Promise object definitions */

var mkdir = function(dirname) {
    return new Promise(function(resolve, reject) {
        fs.mkdir(dirname, 0755, function(err) {
            if(err) reject(err);
            else resolve();
        });
    });
};

var writeHelloTxt = function(filename) {
    return new Promise(function(resolve, reject) {
        fs.writeFile(filename, "Hello world!", function(err) {
            if(err) reject(err);
            else resolve();
        });
    });
};

var readHelloTxt = function(filename) {
    return new Promise(function(resolve, reject) {
        fs.readFile(filename, function(err, data) {
            if(err) reject(err);
            else resolve(data);
        });
    });
};

/* Promise execution chain */

var filename = path.join("out", "test", "hello.txt");

mkdir(path.join("out"))
.then(function() {
    return mkdir(path.join("out", "test"));
})
.then(function() {
    return writeHelloTxt(filename);
})
.then(function() {
    return readHelloTxt(filename);
})
.then(function(data) {
    if(data == "Hello world!")
        process.stderr.write("File is correct!\n");
    else
        process.stderr.write("File is incorrect!\n");
}, function(err) {
    console.log("An error occured: "+err);
});

As you may notice, because the then() function invocations can be chained, we also have a flat structure making the code better maintainable. However, the code fragment is also considerably longer than the async library variant and the unstructured variant -- for each asynchronous function invocation, we must construct a promise object, adding quite a bit of overhead to the code.

From my perspective, if you need to do many ad-hoc steps (and not having to compose complex things), callbacks are probably more convenient. For reusable operations, promises are typically a nicer solution.

Mixing function invocations from both styles

It may happen that function invocations from both styles need to be mixed. Typically mixing is imposed by third-party APIs -- for example, when developing a Node.js web application we may want to use express.js (callback based) for implementing a web application interface in combination with sequelize (promises based) for accessing a relational database.

Of course, you could write a function constructing promises that internally only use Node.js-style invocations or the opposite. But if you have to regularly intermix calls, you may end up writing a lot of boilerplate code. For example, if I would use the async.waterfall() abstraction in combination with promise-style function invocations, I may end up writing:


async.waterfall([
    function(callback) {
        Task.sync().then(function() {
            callback();
        }, function(err) {
            callback(err);
        });
    },

    function(callback) {
        Task.create({
            title: "Get some coffee",
            description: "Get some coffee ASAP"
        }).then(function() {
            callback();
        }, function(err) {
            callback(err);
        });
    },

    function(callback) {
        Task.create({
            title: "Drink coffee",
            description: "Because I need caffeine"
        }).then(function() {
            callback();
        }, function(err) {
            callback(err);
        });
    },

    function(callback) {
        Task.findAll().then(function(tasks) {
            callback(null, tasks);
        }, function(err) {
            callback(err);
        });
    },

    function(tasks, callback) {
        for(var i = 0; i < tasks.length; i++) {
            var task = tasks[i];
            console.log(task.title + ": "+ task.description);
        }
    }
], function(err) {
    if(err) {
        console.log("An error occurred: "+err);
        process.exit(1);
    } else {
        process.exit(0);
    }
});

For each Promise-based function invocation, I need to invoke the then() function and in the corresponding callbacks, I must invoke the callback of each function block to propagate the results or the error. This makes the amount of code I have to write unnecessary long, tedious to write and a pain to maintain.

Fortunately, I can create a function that abstracts over this pattern:


function chainCallback(promise, callback) {
    promise.then(function() {
        var args = Array.prototype.slice.call(arguments, 0);

        args.unshift(null);
        callback.apply(null, args);
    }, function() {
        var args = Array.prototype.slice.call(arguments, 0);

        if(args.length == 0) {
            callback("Promise error");
        } else if(args.length == 1) {
            callback(args[0]);
        } else {
            callback(args);
        }
    });
}

The above code fragment does the following:

We define a function takes a promise and a Node.js-style callback function as parameters and invokes the then() method of the promise.
When the promise has been fulfilled, it sets the error parameter of the callback to null (to indicate that there is no error) and propagates all resulting objects as remaining parameters to the callback.
When the promise has been rejected, we propagate the resulting error object. Because the Node.js-style-callback requires a single defined object, we compose one ourselves if no error object was returned, and we return an array as an error object, if multiple error objects were returned.

Using this abstraction function, we can rewrite the earlier pattern as follows:


async.waterfall([
    function(callback) {
        prom2cb.chainCallback(Task.sync(), callback);
    },

    function(callback) {
        prom2cb.chainCallback(Task.create({
            title: "Get some coffee",
            description: "Get some coffee ASAP"
        }), callback);
    },

    function(callback) {
        prom2cb.chainCallback(Task.create({
            title: "Drink coffee",
            description: "Because I need caffeine"
        }), callback);
    },

    function(callback) {
        prom2cb.chainCallback(Task.findAll(), callback);
    },

    function(tasks, callback) {
        for(var i = 0; i < tasks.length; i++) {
            var task = tasks[i];
            console.log(task.title + ": "+ task.description);
        }
    }
], function(err) {
    if(err) {
        console.log("An error occurred: "+err);
        process.exit(1);
    } else {
        process.exit(0);
    }
});

As may be observed, this code fragment is more concise and significantly shorter.

The opposite mixing pattern also leads to issues. For example, we can first retrieve the list of tasks from the database (through a promise-style invocation) and then write it as a JSON file to disk (through a Node.js-style invocation):


Task.findAll().then(function(tasks) {
    fs.writeFile("tasks.txt", JSON.stringify(tasks), function(err) {
        if(err) {
            console.log("error: "+err);
        } else {
            console.log("everything is OK");
        }
    });
}, function(err) {
    console.log("error: "+err);
});

The biggest annoyance is that we are forced to do the successive step (writing the file) inside the callback function, causing us to write pyramid code that is harder to read and tedious to maintain. This is caused by the fact that we can only "chain" a promise to another promise.

Fortunately, we can create a function abstraction that wraps an adapter around any Node.js-style function taking the same parameters (without the callback) that returns a promise:


function promisify(Promise, fun) {
    return function() {
       var args = Array.prototype.slice.call(arguments, 0);

       return new Promise(function(resolve, reject) {
            function callback() {
                var args = Array.prototype.slice.call(arguments, 0);
                var err = args[0];
                args.shift();

                if(err) {
                    reject(err);
                } else {
                    resolve(args);
                }
            }

            args.push(callback);

            fun.apply(null, args);
        });
    };
}

In the above code fragment, we do the following:

We define a function that takes two parameters: a Promise prototype that can be used to construct promises and a function representing any Node.js-style function (which the last parameter is a Node.js-style callback).
In the function, we construct (and return) a wrapper function that returns a promise.
We construct an adapter callback function, that invokes the Promise toolkit's reject() function in case of an error (with the corresponding error object provided by the callback), and resolve() in case of success. In case of success, it simply propagates any result object provided by the Node.js-style callback.
Finally, we invoke the Node.js-function with the given function parameters and our adapter callback.

With this function abstraction we can rewrite the earlier example as follows:


Task.findAll().then(function(tasks) {
    return prom2cb.promisify(Promise, fs.writeFile)("tasks.txt", JSON.stringify(tasks));
})
.then(function() {
    console.log("everything is OK");
}, function(err) {
    console.log("error: "+err);
});

as may be observed, we can convert the writeFile() Node.js-style function invocation into an invocation returning a promise, and nicely structure the find and write file invocations by chaining then() invocations.

Conclusions

In this blog post, I have explored two kinds of asynchronous function invocation patterns: Node.js-style and promise-style. You may probably wonder which one I like the most?

I actually hate them both, but I consider promises to be the more powerful of the two because of their composability. However, this comes at a price of doing some extra work to construct them. The most ideal solution to me is still a facility that is part of the language, instead of "forgetting" about existing language constructs and replacing them by custom-made abstractions.

I have also explained that we may have to combine both patterns, which is often quite tedious. Fortunately, we can create function abstractions that convert one into another to ease the pain.

Related work

I am not the first one comparing the function invocation patterns described in this blog post. Parts of this blog post are inspired by a blog post titled: "Callbacks are imperative, promises are functional: Node’s biggest missed opportunity". In this blog post, a comparison between the two invocation styles is done from a programming language paradigm perspective, and is IMO quite interesting to read.

I am also not the first to implement conversion functions between these two styles. For example, promises constructed with the bluebird library implement a method called .asCallback() allowing a user to chain a Node.js-style callback to a promise. Similarly, it provides a function: Promise.promisify() to wrap a Node.js-style function into a function returning a promise.

However, the downside of bluebird is that these facilities can only be used if bluebird is used as a toolkit in an API. Some APIs use different toolkits or construct promises themselves. As explained earlier, Promises/A and Promises/A+ are just interface specifications and only the purpose of then() is defined, whereas the other facilities are extensions.

My function abstractions only make a few assumptions and should work with many implementations. Basically it only requires a proper .then() method (which should be obvious) and a new Promise(function(resolve, reject) { ... }) constructor.

Besides the two function invocation styles covered in this blog post, there are others as well. For example, Zef's blog post titled: "Callback-Free Harmonious Node.js" covers a mechanism called 'Thunks'. In this pattern, an asynchronous function returns a function, which can be invoked to retrieve the corresponding error or result at a later point in time.

References

The two conversion abstractions described in this blog post are part of a package called prom2cb. It can be obtained from my GitHub page and the NPM registry.

In this blog post, I'd like to announce the next Disnix release. At the same time, I noticed that it has been eight years ago that I started developing it, so this would also be a nice opportunity to do some reflection.

Some background information

The idea was born while I was working on my master's thesis. A few months prior, I got familiar with Nix and NixOS -- I read Eelco Dolstra's PhD thesis, managed to package some software, and wrote a couple of services for NixOS.

Most of my packing work was done to automate the deployment of WebDSL applications, a case study in domain-specific language engineering, that is still an ongoing research project in my former research group. WebDSL's purpose is to be a domain-specific language for developing dynamic web applications with a rich data model.

Many aspects in Nix/NixOS were quite "primitive" compared to today's implementations -- there was no NixOS module system, making it less flexible to create additions. Many packages that I needed were missing and I had to write Nix expressions for them myself, such as Apache Tomcat, MySQL, and Midnight Commander. Also the desktop experience, such as KDE, was quite primitive, as only the base package was supported.

As part of my master's thesis project, I did an internship at the Healthcare Systems Architecture group at Philips Research. They had been developing a platform called SDS2, which purpose was to provide asset tracking and utilization analysis services for medical equipment.

SDS2 qualifies itself as a service-oriented system (a term that people used to talk frequently about in the past, but not anymore :) ). As such, it can be decomposed into a set of distributable components (a.k.a. services) that interact with each other through "standardized protocols" (e.g. SOAP), sometimes through network links.

There are a variety of reasons why SDS2 has a distributed architecture. For example, data that has been gathered from medical devices may have to be physically stored inside a hospital for privacy reasons. The analysis components may require a lot of computing power and would perform better if they run in a data center with a huge amount of system resources.

Being able to distribute services is good for many reasons (e.g. in meeting certain non-functional requirements such as privacy), but it also has a big drawback -- services are software components, and one of their characteristics is that they are units of deployment. Deploying a single service without any (or proper) automation to one machine is already complicated and time consuming, but deploying a network of machines is many times as complex.

The goal of my thesis assignment was to automate SDS2's deployment in distributed environments using the Nix package manager as a basis. Nix provides a number of unique properties compared to many conventional deployment solutions, such as fully automated deployment from declarative specifications, and reliable and reproducible deployment. However, it was also lacking a number of features to provide the same or similar kinds of quality properties to deployment processes of service-oriented systems in networks of machines.

The result of my master's thesis project was the first prototype of Disnix that I never officially released. After my internship, I started my PhD research and resumed working on Disnix (as well as several other aspects). This resulted in a second prototype and two official releases eventually turning Disnix into what it is today.

Prototype 1

This was the prototype resulting from my master's thesis and was primarily designed for deploying SDS2.

The first component that I developed was a web service (using similar kinds of technologies as SDS2, such as Apache Tomcat and Apache Axis2) exposing a set of deployment operations to remote machines (most of them consulting the Nix package manager).

To cope with permissions and security, I decided to make the web service just an interface around a "core service" that was responsible for actually executing the deployment activities. The web service used the D-Bus protocol to communicate with the core.

On top of the web service layer, I implemented a collection of tools each executing a specific deployment activity in a network of machines, such as building, distributing and activating services. There were also a number of tools combining deployment activities, such as the "famous"disnix-env command responsible for executing all the activities required to deploy a system.

The first prototype of disnix-env, in contrast to today's implementation, provided two deployment procedure variants: building on targets and building on the coordinator.

The first variant was basically inspired by the manual workflow I used to carry out to get SDS2 deployed -- I manually installed a couple of NixOS machines, then used SSH to remotely connect to them, there I would do a checkout of Nixpkgs and all the other Nix expressions that I need, then I would deploy all packages from source and finally I modified the system configuration (e.g. Apache Tomcat) to run the web services.

Unfortunately, transferring Nix expressions is not an easy process, as they are rarely self contained and typically rely on other Nix expression files scattered over the file system. While thinking about a solution, I "discovered" that the Nix expression evaluator creates so-called store derivation files (low-level build specifications) for each package build. Store derivations are also stored in the Nix store next to ordinary packages, including their dependencies. I could instead instantiate a Nix expression on the coordinator, transfer the closure of store derivation files to a remote machine, and build them there.

After some discussion with my company supervisor Merijn de Jonge, I learned that compiling on target machines was undesired, in particular in production environments. Then I learned more about Nix's purely functional nature, and "discovered" that builds are referentially transparent -- for example, it should not matter where a build has been performed. As long as the dependencies remain the same, the outcome would be the same as well. With this "new knowledge" in mind, I implemented a second deployment procedure variant that would do the package builds on the coordinator machine, and transfer their closures (dependencies) to the target machines.

As with the current implementation, deployment in Disnix was driven by three kinds of specifications: the services model, infrastructure model and distribution model. However, their notational conventions were a bit different -- the services model already knew about inter-dependencies, but propagating the properties of inter-dependencies to build functions was an ad-hoc process. The distribution model was a list of attribute sets also allowing someone to specify the same mappings multiple times (which resulted in undefined outcomes).

Another primitive aspect was the activation step, such as deploying web applications inside Apache Tomcat. It was basically done by a hardcoded script that only knew about Java web applications and Java command-line tools. Database activation was completely unsupported, and had to be done by hand.

I also did a couple of other interesting things. I studied the "two-phase commit protocol" for upgrading distributed systems atomically and mapped its concepts to Nix operations, to support (almost) atomic upgrades. This idea resulted in a research paper that I have presented at HotSWUp 2008.

Finally, I sketched a simple dynamic deployment extension (and wrote a partial implementation for it) that would calculate a distribution model, but time did not permit me to finish it.

Prototype 2

The first Disnix prototype made me quite happy in the early stages of my PhD research -- I gave many cool demos to various kinds of people, including our industry partner: Philips Healthcare and NWO/Jacquard: the organization that was funding me. However, I soon realized that the first prototype became too limited.

The first annoyance was my reliance on Java. Most of the tools in the Disnix distribution were implemented in Java and depended on the Java Runtime Environment, which is quite a big dependency for a set of command-line utilities. I reengineered most of the Disnix codebase and rewrote it in C. I only kept the core service (which was implemented in C already) and the web service interface, that I separated into an external package called DisnixWebService.

I also got rid of the reliance on a web service to execute remote deployment operations, because it was quite tedious to deploy it. I made the communication aspect pluggable and implemented an SSH plugin that became the default communication protocol (the web service protocol could still be used as an external plugin).

For the activation and deactivation of services, I developed a plugin system (Disnix activation scripts) and a set of modules supporting various kinds of services replacing the hardcoded script. This plugin system allowed me to activate and deactivate many kinds of components, including databases.

Finally, I unified the two deployment procedure variants of disnix-env into one procedure. Building on the targets became simply an optional step that was carried out before building on the coordinator.

Disnix 0.1

After my major reengineering effort, I was looking into publishing something about it. While working on a paper (which first version got badly rejected), I realized that services in a SOA-context are "platform independent" because of their interfaces, but they still have implementations underneath that could depend on many kinds of technologies. Heterogeneity makes deployment extra complicated.

There was still one piece missing to bring service-oriented systems to their full potential -- there was no multiple operating systems support in Disnix. The Nix package manager could also be used on several other operating systems besides Linux, but Disnix was bound to one operating system only (Linux).

I did another major reengineering effort to make the system architecture of the target systems configurable requiring me to change many things internally. I also developed new notational conventions for the Disnix models. Each service expression became a nested function in which the outer function corresponds to the intra-dependencies and the inner function to the inter-dependencies, and look quite similar to expressions for ordinary Nix packages. Moreover, I removed the ambiguity problem in distribution model by making it an attribute set.

The resulting Disnix version was first described in my SEAA 2010 paper. Shortly after the paper got accepted, I decided to officially release this version as Disnix 0.1. Many external aspects of this version are still visible in the current version.

Disnix 0.2

After releasing the first of Disnix, I realized that there were still a few pieces missing while automating deployment processes of service-oriented systems. One of the limitations of Disnix is that it expects machines to be present already that may have to run a number of preinstalled system services, such as MySQL, Apache Tomcat, and the Disnix service exposing remote deployment operations. These machines had to be deployed by other means first.

Together with Eelco Dolstra I had been working on declarative deployment and testing of networked NixOS configurations, resulting in a tool called nixos-deploy-network that deploys networks of NixOS machines and a NixOS test driver capable of spawning networks of NixOS virtual machines in which system integration tests can be run non-interactively. These contributions were documented in a tech report and the ISSRE 2010 paper.

I made Disnix more modular so that extensions could be built on top of it. The most prominent extension was DisnixOS that integrates NixOS deployment and the NixOS test driver's features with Disnix service deployment so that a service oriented system's deployment process could be fully automated.

Another extension was Dynamic Disnix, a continuation of the dynamic deployment extension that I never finished during my internship. Dynamic Disnix extends the basic toolset with an infrastructure discovery tool and a distribution generator using deployment planning algorithms from the academic literature to map services to machines. The extended architecture is described in the SEAMS 2011 paper.

The revised Disnix architecture has been documented in both the WASDeTT 2010 and SCP 2014 papers and was released as Disnix 0.2.

Disnix 0.3

After the 0.2 release I got really busy, which was partly caused by the fact that I had to write my PhD thesis and yet another research paper for an unfinished chapter.

The last Disnix-related research contribution was a tool called Dysnomia, which I had based on the Disnix activation scripts package. I augmented the plugins with experimental state deployment operations and changed the package into a new tool, that in (theory) could be combined with other tools as well, or used independently.

Unfortunately, I had to quickly rush out a paper for HotSWUp 2012 and the code was in a barely usable state. Moreover, the state management facilities had some huge drawbacks, so I was not that eager to get them integrated into the mainstream version.

Then I had to fully dedicate myself to completing my PhD thesis and for more than six months, I hardly wrote any code.

After finishing my first draft of my PhD thesis and waiting for feedback from my committee, I left academia and switched jobs. Because I had no use practical use cases for Disnix, and other duties in my new job, its development was done mostly in my spare time at a very low pace -- some things that I accomplished in that period is creating a 'slim' version of Dysnomia that supported all the activities in the HotSWUp paper without any snapshotting facilities.

Meanwhile, nixops-deploy-network got replaced by a new tool named Charon, that later became NixOps. In addition to deployment, NixOps could also instantiate virtual machines in IaaS environments, such as Amazon EC2. I modified DisnixOS to also integrate with NixOps to use its capabilities.

Three and a half years after the previous release (late 2014), my new employer wanted to deploy their new microservices-based system to a production environment, which made me quite motivated to work on Disnix again. I did some huge refactorings and optimized a few aspects to make it work for larger systems. Some interesting optimizations were concurrent data transfers and concurrent service activations.

I also implemented multi-connection protocol support. For example, you could use SSH to connect to one machine and SOAP to another.

After implementing the optimizations, I realized that I had reached a stable point and decided that it was a good time to announce the next release, after a few years of only little development activity.

Disnix 0.4

Despite being happy with the recent Disnix 0.3 release and using it to deploy many services to production environments, I quickly ran into another problem -- the services that I had to manage store data in their own dedicated databases. Sometimes I had to move services from one machine to another. Disnix (like the other Nix tools) does not manage state, requiring me to manually migrate data, which was quite painful.

I decided to dig up the state deployment facilities from the HotSWUp 2012 paper to cope with this problem. Despite having a number of limitations, the databases that I had to manage were relatively small (tens of megabytes), so the solution was still a good fit.

I integrated the state management facilities described in the paper from the prototype into the "production" version of Dysnomia, and modified Disnix to use them. I left out the incremental snapshot facilities described in the paper, because there was no practical use for them. When the work was done, I announced the next release.

Disnix 0.5

With Disnix 0.4, all my configuration management work was automated. However, I spotted a couple of inefficiencies, such as many unnecessary redeployments while upgrading. I solved this issue by making the target-specific services concept a first class citizen in Disnix. Moreover, I regularly had to deal with RAM issues and added on-demand activation support (by using the operating system's service manager, such as systemd).

There were also some user-unfriendly aspects that I improved -- better and more concise logging, more helpful error messages, --rollback, --switch-generation options for disnix-env, and some commands that work on the deployment manifest were extended to take the last deployed manifest into account when no parameters have been provided (e.g. disnix-visualize).

Conclusion

This long blog post describes how the current Disnix version (0.5) came about after nearly eight years of development. I'd like to announce its immediate availability! Consult the Disnix homepage for more information.

Some time ago, I have reengineered npm2nix and described some of its underlying concepts in a blog post. In the reengineered version, I have ported the implementation from CoffeeScript to JavaScript, refactored/modularized the code, and I have been improving the implementation to more accurately simulate NPM's dependency organization, including many of its odd traits.

I have observed that in the latest Node.js (the 5.x series) NPM's behaviour has changed significantly. To cope with this, I did yet another major reengineering effort. In this blog post, I will describe the path that has lead to the latest implementation.

The first attempt

Getting a few commonly used NPM packages deployed with Nix is not particularly challenging, but to make it work completely right turns out to be quite difficult -- the early npm2nix implementations generated Nix expressions that build every package and all of its dependencies in separate derivations (in other words: each package and dependency translates to a separate Nix store path). To allow a package to find its dependencies, the build script creates a node_modules/ sub folder containing symlinks that refer to the Nix store paths of the packages that it requires.

NPM packages have loose dependency specifiers, e.g. wildcards and version ranges, whereas Nix package dependencies are exact, i.e. they bind to packages that are identified by unique hash codes derived from all build time dependencies. npm2nix makes this translation by "snapshotting" the latest conforming version and turning that into into a Nix package.

For example, one my own software projects (NiJS) has the following package configuration file:


{
"name" : "nijs",
"version" : "0.0.23",
"dependencies" : {
"optparse" : ">= 1.0.3",
"slasp": "0.0.4"
  }
  ...
}

The package configuration states that it requires optparse version 1.0.3 or higher, and slasp version 0.0.4. Running npm install results in the following directory structure of dependencies:


nijs/
  ...
  package.json
  node_modules/
    optparse/
      package.json
      ...
    slasp/
      package.json
      ...

A node_modules/ folder gets created in which each sub directory represents an NPM package that is a dependency of NiJS. In the older versions of npm2nix, it gets translated as follows:


/nix/store/ab12pq...-nijs-0.0.24/
  ...
  package.json
  node_modules/
    optparse -> /nix/store/4pq1db...-optparse-1.0.5
    slasp -> /nix/store/8j12qp...-slasp-0.0.4
/nix/store/4pq1db...-optparse-1.0.5/
  ...
  package.json
/nix/store/8j12qp...-slasp-0.0.4/
  ...
  package.json

Each involved package is stored in its own private folder in the Nix store. The NiJS package has a node_modules/ folder containing symlinks to its dependencies. For many packages, this approach works well enough, as it at least provides a conforming version for each dependency that it requires.

Unfortunately, it is possible to run into oddities as well. For example, a package that does not work properly in such a model is ironhorse.

For example, we could declare mongoose and ironhorse dependencies of a project:


{
"name": "myproject",
"version": "0.0.1",
"dependencies": {
"mongoose": "3.8.5",
"ironhorse": "0.0.11"
  }
}

Ironhorse has an overlapping dependency with the project's dependencies -- it also depends on mongoose, as shown in the following package configuration:


{
"name": "ironhorse",
"version": "0.0.11",
"license" : "MIT",
"dependencies" : {
"underscore": "~1.5.2",
"mongoose": "*",
"temp": "*",
    ...
  },
  ...
}

Running 'npm install' on project level yields the following directory structure:


myproject/
  ...
  package.json
  node_modules/
    mongoose/
      ...
    ironhorse/
      ...
      package.json
      node_modules/
        underscore/
        temp/

Note that the mongoose only appears one time in the hierarchy of node_modules/ folders despite that it has been declared as a dependency twice.

In contrast, when using an older version of npm2nix, the following directory structure gets generated:


/nix/store/67ab07...-myproject-0.0.1
  ...
  package.json
  node_modules/
    mongoose -> /nix/store/ec704c...-mongoose-3.8.5
    ironhorse -> /nix/store/3ee85e...-ironhorse-0.0.11
/nix/store/3ee85e...-ironhorse-0.0.11
  ...
  package.json
  node_modules/
    underscore -> /nix/store/10af96...-underscore-1.5.2
    mongoose -> /nix/store/a37f75...-mongoose-4.4.5
    temp -> /nix/store/fae379...-temp-0.8.3
/nix/store/ec704c...-mongoose-3.8.5
  package.json
  ...
/nix/store/a37f75...-mongoose-4.4.5
  package.json
  ...
/nix/store/10af96...-underscore-1.5.2
/nix/store/fae379...-temp-0.8.3

In the above directory structure, we can observe that two different versions of mongoose have been deployed -- version 3.8.5 (as a dependency for the project) and version 4.4.5 (as a dependency for ironhorse). Having two different versions of mongoose deployed typically leads to problems.

The reason why npm2nix produces a different result is because whenever NPM encounters a dependency specification, it recursively searches the parent directories to find a conforming version. If a conforming version has been found that fits within the version range of a package dependency, it will not be included again. This is also the reason why NPM can "handle" cyclic dependencies (despite the fact that they are a bad practice) -- when a dependency has been encountered a second time, it will not be deployed again causing NPM to break the cycle.

npm2nix did not implement this kind behaviour -- it always binds a dependency to the latest conforming version, but as can be observed in the last example, this is not what NPM always does -- it could also bind to a shared dependency that may be older than the latest version in the NPM registry (As a sidenote: I wonder how many NPM users actually know about this detail!).

Second attempt: simulating shared dependency behaviour

One of the main objectives in the reengineered version (as described in my previous blog post), is to more accurately mimic NPM's shared dependency behaviour, as the old behaviour was particularly problematic for packages having cyclic dependencies -- Nix does not allow them and causes the evaluation of the entire Nixpkgs set on the Hydra build server to fail as a result.

The reengineered version worked, but the solution was quite expensive and controversial -- I compose Nix expressions of all packages involved, in which each dependency resolves to the latest corresponding version.

Each time a package includes a dependency, I propagate an attribute set to its build function telling it which dependencies have already been resolved by any of the parents. Resolved dependencies get excluded as a dependency.

To check whether a resolved dependency fits within a version range specifier, I have to consult semver. Because semver is unsupported in the Nix expression language, I use a trick in which I import Nix expressions generated by a build script (that invokes the semver command-line utility) to figure out which dependencies have been resolved already.

Besides consulting semver, I used another hack -- packages that have been resolved by any of the parents must be excluded as a dependency. However, NPM packages in Nix are deployed independently from each other in separate build functions and will fail because NPM expects them to present. To solve this problem, I create shims for the excluded packages, by substituting them by empty packages with the same name and version, and removing them after the package has been built.

Symlinking the dependencies also no longer worked reliably -- the CommonJS module system dereferences the location of the includer first and looks in the parent directories for shared dependencies relative from there. This means in case of a symlink, it incorrectly resolves to a Nix store path that has no meaningful parent directories. The only solution I could think of is copying dependencies instead of symlinking them.

To summarize: the new solution worked more accurately than the original version (and can cope with cyclic dependencies) but it is quite inefficient as well -- making copies of dependencies causes a lot of duplication (that would be a waste of disk space) and building Nix expressions in the instantiation phase makes the process quite slow.

Third attempt: computing the dependency graph ahead of time

Apart from the earlier described inefficiencies, the main reason that I had to do yet another major revision is that Node.js 5.x (that includes npm 3.x) executes so-called "flat-module installations. The idea is that when a package includes a dependency, it will be stored in a node_modules/ folder as high in the directory structure as possible without breaking any dependencies.

This new approach has a number of implications. For example, deploying the Disnix virtual hosts test web application with the old npm 2.x used to yield the following directory structure:


webapp/
  ...
  package.json
  node_modules/
    express/
      ...
      package.json
      node_modules/
        accepts/
        array-flatten/
        content-disposition/
        ...
    ejs/
      ...
      package.json

As can be observed in the structure above, the test webapp depends on two packages: express and ejs. Express has dependencies of its own, such as accepts, array-flatten, content-disposition. Because no parent node_modules/ folder provides them, they are included privately for the express package.

Running 'npm install' with the new npm 3.x yields the following directory structure:


webapp/
  ...
  package.json
  node_modules/
    accepts/
    array-flatten/
    content-disposition/
    express/
      ...
      package.json
    ejs/
      ...
      package.json

Since the libraries that express requires do not conflict with the includer's dependencies, they have been moved one level up to the parent package's node_modules/ folder.

Flattening the directory structure makes deploying a NPM project even more imperative -- previously, the dependencies that were included with a package depend on the state of the includer. Now we must also modify the entire directory hierarchy of dependencies by moving packages up in the directory structure. It also makes the resulting dependency graph less predictable. For example, the order in which dependencies are installed matters -- unless all dependencies are discarded and reinstalled from scratch, it may result in different kinds of dependency graphs.

If this flat module approach has all kinds of oddities, why would NPM uses such an approach, you may wonder? It turns out that the only reason is: better Windows support. On Windows, there is a limit on the length on paths and flattening the directory structure helps to prevent hitting it. Unfortunately, it comes at the price of making deployments more imperative and less predictable.

To simulate this flattening strategy, I had to revise npm2nix again. Because of its previous drawbacks and the fact that we have to perform even more imperative operations, I have decided to implement a new strategy in which I compute the entire dependency graph ahead of time by the generator, instead of hacking it into the evaluation phase of the Nix expressions.

Supporting private and shared dependencies works exactly the same as in the old implementation, but is now performed ahead of time. Additionally, I simulate the flat dependency structure as follows:

When a package requires a dependency: I check whether the parent directory has a conflicting dependency. This means: it either has a dependency bundled with the same name and a different version or indirectly binds to another parent that provides a conflicting version.
If the dependency conflicts: bundle the dependency in the current package.
If the dependency does not conflict: bind the package to the dependency (but do not include it) and consult the parent package one level higher.

Besides computing the dependency graph ahead of time, I also deploy the entire dependency graph in one Nix build function -- because including dependencies is stateful, it no longer makes sense to build them as individual Nix packages, that are supposed to be pure.

I have made the flattening algorithm optional. By default, the new npm2nix generates Nix expressions for Node.js 4.x (using the old npm 2.x) release:


$ npm2nix

By appending the -5 parameter, it generates Nix expressions for usage with Node.js 5.x (using the new npm 3.x with flat module installations):


$ npm2nix -5

I have tested the new approach on many packages including my public projects. The good news is: they all seem to work!

Unfortunately, despite the fact that I could get many packages working, the approach is not perfect and hard to get 100% right. For example, in a private project I have encountered bundled dependencies (dependencies that are statically included with a package). NPM also moves them up, while npm2nix merely generates an expression composing the dependency graph (that reflects flat module installations as much as possible). To fix this issue, we must also run a post processing step that moves dependencies up that are in the wrong places. Currently, this step has not been implemented yet in npm2nix.

Another issue is that we want Nix to obtain all dependencies instead of NPM. To prevent NPM from consulting external resources, we substitute some version specifiers (such as Git repositories) by a wildcard: *. These version specifiers sometimes confuse NPM, despite the fact that the directory structure matches NPM's dependency structure.

To cope with these imperfections, I have also added an option to npm2nix to refrain it from running npm install -- in many cases, packages still work fine despite NPM being confused. Moreover, the npm install step in the Nix builder environment merely serves as a validation step -- the Nix builder script is responsible for actually providing the dependencies.

Discussion

In this blog post, I have described the path that has lead to a second reengineered version of npm2nix. The new version computes dependency graphs ahead of time and can mostly handle npm 3.x's flat module installations. Moreover, compared to the previous version, it does no longer rely on very expensive and ugly hacks.

Despite the fact that I can now more or less handle flat installations, I am still not quite happy. Some things that bug me are:

The habit of "reusing" modules that have been bundled with any of the includers, makes it IMO difficult and counter-intuitive to predict which version will actually be used in a certain context. In some cases, packagers might expect that the latest version of a version range will be used, but this is not guaranteed to be the case. This could, for example, reintroduce security and stability issues without end users noticing (or expecting) it.
Flat module installations are less deterministic and make it really difficult to predict what a dependency graph looks like -- the dependencies that appear at a certain level in the directory structure depend on the order in which dependencies are installed. Therefore, I do not consider this an improvement over npm 2.x.

Because of these drawbacks, I expect that NPM will reconsider some of its concepts again in the future causing npm2nix to break again.

I would recommend the NPM developers to use the following approach:

All involved packages should be stored in a singlenode_modules/ folder instead of multiple nested hierarchies of node_modules/ folders.
When a module requests another module, the module loader should consult the package.json configuration file of the package where the includer module belongs to. It should take the latest conforming version in the central node_modules/ folder. I consider taking the last version of a version range to be less counter-intuitive than taking any conforming version.
To be able to store multiple versions of packages in a single node_modules/ folder, a better directory naming convention should be adopted. Currently, NPM only identifies modules by name in a node_modules/ folder, making it impossible to store two versions next to each other in one directory.

If they would, for example, use both the name and version numbers in the directory names, more things are possible. Adding more properties in the path names makes sharing even better -- for example, a package with a name and version number could originate from various sources, e.g. the NPM registry or a Git repository -- reflecting this in the path makes it possible to store more variants next to each other in a reliable way.

Naming things to improve shareability is not really rocket science -- Nix uses hash codes (that are derived from all build-time dependencies) to uniquely identify packages and the .NET Global Assembly Cache uses so-called strong names that include various naming attributes, such as cryptographic keys to ensure that no library conflicts. I am convinced that adopting a better naming convention for storing NPM packages would be quite beneficial as well.
To cope with cyclic dependencies: I would simply say that it suffices to disallow them. Packages are supposed to be units of reuse, and if two packages mutually depend on each other, then they should be combined into one package.

Availability

The second reengineered npm2nix version can be obtained from my GitHub page. The code resides in the reengineering2 branch.

Last weekend in was in Wrocław, Poland to attend wroc_love.rb, a conference tailored towards (but not restricted to) applications Ruby related. The reason for me to go to there is because I was invited to give a talk about NixOS.

As I have never visited neither Poland nor a Ruby-related conference before, I did not really know what to expect, but it turned out to be a nice experience. The city, venue and people were all quite interesting, and I liked it very much.

In my talk I basically had two objectives: providing a brief introduction to NixOS and diving into one of its underlying visions: declarative deployment. From my perspective, the former aspect is not particularly new as I have given talks about the NixOS project many times (for example, I also crafted three explanation recipes).

Something that I have not done before is diving into the latter aspect. In this blog post, I'd like to elaborate about it, discuss why it is appealing, and in what extent certain tools reach it.

On being declarative

I have used the word declarative in many of my articles. What is supposed to mean?

I have found a nice presentation online that elaborates on four kinds sentences in linguistics. One of the categories covered in the slides are declarative sentences that (according to the presentation) can be defined as:

A declarative sentence makes a statement. It is punctuated by a period.

As an example, the presentation shows:

The dog in the neighbor's yard is barking.

Another class of sentences that the presentation describes are imperative sentences which it defines as:

An imperative sentence is a command or polite request. It ends in a period or exclamation mark.

The following xkcd comic shows an example:

(Besides these two categories of sentences described earlier, the presentation also covers interrogative sentences and exclamatory sentences, but I won't go into detail on that).

On being declarative in programming

In linguistics, the distinction between declarative and imperative sentences is IMO mostly clear -- declarative sentences state facts and imperative sentences are commands or requests.

A similar distinction exists in programming as well. For example, on Wikipedia I found the following definition for declarative programming (the Wikipedia article cites the article: "Practical Advantages of Declarative Programming" written by J.W. Lloyd, which I unfortunately could not find anywhere online):

In computer science, declarative programming is a programming paradigm -- a style of building the structure and elements of computer programs -- that expresses the logic of a computation without describing its control flow.

Imperative programming is sometimes seen as the opposite of declarative programming, but not everybody agrees. I found an interesting discussion blog post written by William Cook that elaborates on their differences.

His understanding of the declarative and imperative definitions are:

Declarative: describing "what" is to be computed rather than "how" to compute the result/behavior

Imperative: a description of a computation that involves implicit effects, usually mutable state and input/output.

Moreover, he says the following:

I agree with those who say that "declarative" is a spectrum. For example, some people say that Haskell is a declarative language, but I my view Haskell programs are very much about *how* to compute a result.

I also agree with William Cook's opinion that declarative is a spectrum -- contrary to linguistics, it is hard to draw a hard line between what and how in programming. Some programming languages that are considered imperative, e.g. C, modify mutable state such as variables:


int a = 5;
a += 3;

But if we would modify the code to work without mutable state, it still remains more a "how" description than a "what" description IMO:


int sum(int a, int b)
{
    return a + b;
}

int result = sum(5, 3);

Two prominent languages that are more about what than how are HTML and CSS. Both technologies empower the web. For example, in HTML I can express the structure of a page:


<!DOCTYPE html>

<html>
<head>
<title>Test</title>
<link rel="stylesheet" href="style.css" type="text/css">
</head>
<body>
<div id="outer">
<div id="inner">
<p>HTML and CSS are declarative and so cool!</p>
</div>
</div>
</body>
</html>

In the above code fragment, I define two nested divisions in which a paragraph of text is displayed.

In CSS. I can specify what the style is of these page elements:


#outer {
    margin-left: auto;
    margin-right: auto;
    width: 20%;
    border-style: solid;
}

#inner {
    width: 500px;
}

In the above example, we state that the outer div should be centered, have a width of 20% of the page, and a solid border should be drawn around it. The inner div has a width of 500 pixels.

This approach can be considered declarative, because you do not have to specify how to render the page and the style of the elements (e.g. the text, the border). Instead, this is what the browser's layout engine figures out. Besides being responsible for rendering, it has a number of additional benefits as well, such as:

Because it does not matter (much) how a page is rendered, we can fully utilize a system's resources (e.g. a GPU) to render a page in a faster and more fancy way, and optionally degrade a page's appearance if a system's resources are limited.
We can also interpret the page in many ways. For example, we can pass the text in paragraphs to a text to speech engine, for people that are visually impaired.

Despite listing some potential advantages, HTML and CSS are not perfect at all. If you would actually check how the example gets rendered in your browser, then you will observe one of CSS's many odd traits, but I am not going to reveal what it is. :-)

Moreover, despite being more declarative (than code written in an imperative programming language such as C) even HTML and CSS can sometimes be considered a "how" specification. For example, you may want to render a photo gallery on your web page. There is nothing in HTML and CSS that allows you to concisely express that. Instead, you need to decompose it into "lower level" page elements, such as paragraphs, hyperlinks, forms and images.

So IMO, being declarative depends on what your goal is -- in some contexts you can exactly express what you want, but in others you can only express things that are in service of something else.

On being declarative in deployment

In addition to development, you eventually have to deploy a system (typically to a production environment) to make it available to end users. To deploy a system you must carry out a number of activities, such as:

Building (if a compiled language is used, such as Java).
Packaging (e.g. into a JAR file).
Distributing (transferring artifacts to the production machines).
Activating (e.g. a Java web application in a Servlet container).
In case of an upgrade: deactivating obsolete components.

Deployment is often much more complicated than most people expect. Some things that make it complicated are:

Many kinds of steps need to be executed, in particular when the technology used is diverse. Without any automation, it becomes extra complicated and time consuming.
Deployment in production must be typically done on a large scale. In development, a web application/web service typically serves one user only (the developer), while in production it may need to serve thousands or millions of users. In order to serve many users, you need to manage a cluster of machines having complex constraints in terms of system resources and connectivity.
There are non-functional requirements that must be met. For example, while upgrading you want to minimize a system's downtime as much possible. You probably also want to roll back to a previous version if an upgrade went wrong. Accomplishing these properties is often much more complicated than expected (sometimes even impossible!).

As with linguistics and programming, I see a similar distinction in deployment as well -- carrying out the above listed activities are simply the means to accomplish deployment.

What I want (if I need to deploy) is that my system on my development machine becomes available in production, while meeting certain quality attributes of the system that is being deployed (e.g. it could serve thousands of users) and quality attributes of the deployment process itself (e.g. that I can easily roll back in case of an error).

Mainstream solutions: convergent deployment

There are a variety of configuration management tools claiming to support declarative deployment. The most well-known category of tools implement convergent deployment, such as: CFEngine, Puppet, Chef, Ansible.

For example, Chef is driven by declarative deployment specifications (implemented in a Ruby DSL) that may look as follows (I took this example from a Chef tutorial):


...

wordpress_latest = Chef::Config[:file_cache_path] + "/wordpress-latest.tar.gz"

remote_file wordpress_latest do
  source "http://wordpress.org/latest.tar.gz"
  mode "0644"
end

directory node["phpapp"]["path"] do
  owner "root"
  group "root"
  mode "0755"
  action :create
  recursive true
end

execute "untar-wordpress" do
  cwd node['phpapp']['path']
  command "tar --strip-components 1 -xzf " + wordpress_latest
  creates node['phpapp']['path'] + "/wp-settings.php"
end

The objective of the example shown above is deploying a Wordpress web application. What the specification defines is a tarball that must be fetched from the Wordpress web site, a directory that must be created in which a web application is hosted and a tarball that needs to be extracted into that directory.

The specification can be considered declarative, because you do not have to describe the exact steps that need to be executed. Instead, the specification captures the intended outcome of a set of changes and the deployment system converges to the outcome. For example, for the directory that needs to be created, it first checks if it already exists. If so, it will not be created again. It also checks whether it can be created, before attempting to do it.

Converging, instead of directly executing steps, provides additional safety mechanisms and makes deployment processes more efficient as duplicate work is avoided as much as possible.

There are also a number of drawbacks -- it is not guaranteed (in case of an upgrade) that the system can converge to a new set of outcomes. Moreover, while upgrading a system we may observe downtime (e.g. when a new version of the Wordpress is being unpacked). Also, doing a roll back to a previous configuration cannot be done instantly.

Finally, convergent deployment specifications do not guarantee reproducible deployment. For example, the above code does not capture the configuration process of a web server and a PHP extension module, which are required dependencies to run Wordpress. If we would apply the changes to a machine where these components are missing, the changes may still apply but yield a non working configuration.

The NixOS approach

NixOS also supports declarative deployment, but in a different way. The following code fragment is an example of a NixOS configuration:


{pkgs, ...}:

{
  boot.loader.grub.device = "/dev/sda";

  fileSystems = [ { mountPoint = "/"; device = "/dev/sda2"; } ];
  swapDevices = [ { device = "/dev/sda1"; } ];

  services = {
    openssh.enable = true;

    xserver = {
      enable = true;
      desktopManager.kde4.enable = true;
    };
  };

  environment.systemPackages = [ pkgs.mc pkgs.firefox ];
}

In a NixOS configuration you describe what components constitute a system, rather than the outcome of changes:

The GRUB bootloader should be installed on the MBR of partition: /dev/sda.
The /dev/sda2 partition should be mounted as a root partition, /dev/sda1 should be mounted as a swap partition.
We want Mozilla Firefox and Midnight Commander as end user packages.
We want to use the KDE 4.x desktop.
We want to run OpenSSH as a system service.

The entire machine configuration can be deployed by running single command-line instruction:


$ nixos-rebuild switch

NixOS executes all required deployment steps to deploy the machine configuration -- it downloads or builds all required packages from source code (including all its dependencies), it generates the required configuration files and finally (if all the previous steps have succeeded) it activates the new configuration including the new system services (and deactivating the system services that have become obsolete).

Besides executing the required deployment activities, NixOS has a number of important quality attributes as well:

Reliability. Nix (the underlying package manager) ensures that all dependencies are present. It stores new versions of packages next to old versions, without overwriting them. As a result, you can always switch back to older versions if needed.
Reproducibility. Undeclared dependencies do not influence builds -- if a build works on one machine, then it works on others as well.
Efficiency. Nix only deploys packages and configuration files that are needed.

NixOS is a Linux distribution, but the NixOS project provides other tools bringing the same (or similar) deployment properties to other areas. Nix works on package level (and works on other systems besides NixOS, such as conventional Linux distributions and Mac OS X), NixOps deploys networks of NixOS machines and Disnix deploys (micro)services in networks of machines.

The Nix way of deploying is typically my preferred approach, but these tools also have their limits -- to benefit from the quality properties they provide, everything must be deployed with Nix (and as a consequence: specified in Nix expressions). You cannot take an existing system (deployed by other means) first and change it later, something that you can actually do with convergent deployment tools, such as Chef.

Moreover, Nix (and its sub projects) only manage the static parts of a system such as packages and configuration files (which are made immutable by Nix by making them read-only), but not any state, such as databases.

For managing state, external solutions must be used. For example, I developed a tool called Dysnomia with similar semantics to Nix but it is not always good solution, especially for big chunks of state.

How declarative are these deployment solutions?

I have heard some people claiming that the convergent deployment models are not declarative at all, and the Nix deployment models are actually declarative because they do not specify imperative changes.

Again, I think it depends on how you look at it -- basically, the Nix tools solve problems in a technical domain from declarative specifications, e.g. Nix deploys packages, NixOS entire machine configurations, NixOps networks of machines etc., but typically you would do these kinds of things to accomplish something else, so in a sense you could still consider these approach a "how" rather than a "what".

I have also developed domain-specific deployment tools on top of the tools part of the Nix project allowing me to express concisely what I want in a specific domain:

WebDSL

WebDSL is a domain-specific language for developing web applications with a rich data model, supporting features such as domain modelling, user interfaces and access control. The WebDSL compiler produces Java web applications.

In order to deploy a WebDSL application in a production environment, all kinds of complicated tasks need to be carried out -- we must install a MySQL server, Apache Tomcat server, deploy the web application to the Tomcat server, tune specific settings, and install a reverse proxy that does caching etc.

You typically do not want to express such things in a deployment model. I have developed a tool called webdsldeploy allowing someone to only express the deployment properties that matter for WebDSL applications on a high level. Underneath, the tool consults NixOps (formerly known as Charon) to compose system configurations hosting the components required to run the WebDSL application.

Conference compass

Conference Compass sells services to conference organizers. The most visible part of their service are apps for conference attendees, providing features such as displaying a conference program, list of speakers and floor maps of the venue.

Each customer basically gets "their own app" -- an app for a specific customers has their preferred colors, artwork, content etc. We use a single code base to produce specialized apps.

To produce such specialized apps, we do not want to specify things such as how to build an app for Android through Nix, an app for iOS through Nix, and how to produce debug and release versions etc. These are basically just technical details.

Instead, we have developed our own custom tool that is driven by a specification that concisely expresses what customizations we want (e.g. artwork) and produces the artefacts we want accordingly.

We use a similar approach for our backends -- each app connects to its own dedicated backend allowing users to configure the content displayed in the app. The configurator can also be used to dynamically update the content that is displayed in the apps. For big customers, we offer an additional service in which we develop programs that automatically import data from their information systems.

For the deployment of these backend instances, we do not want to express things such as machines, database services, and the deployment of NPM and Python packages.

Instead, we use a domain-specific tool that is driven by a model that concisely expresses what configurators we want and which third party integrations they provide. The tool is responsible for instantiating virtual machines in the cloud and deploying the services to it.

Conclusion

In this blog post I have elaborated about being declarative in deployment and discussed in what extent certain tools reach it. As with declarative programming, being declarative in deployment is a spectrum.

References

Some aspects discussed in this blog post are covered in my PhD thesis:

I did a more elaborate comparison of infrastructure deployment solutions in Chapter 6. I also cover convergent deployment and used CFEngine as an example.
I have covered webdsldeploy in Chapter 11, including some background information about WebDSL and its deployment aspects.
The overall objective of my PhD thesis is constructing deployment tools for specific domains. Most of the chapter cover the ingredients to do so, but Chapter 3 explains a reference architecture for deployment tools, having similar (or comparable) properties to tools in the Nix project.

For convenience, I have also embedded the slides of my presentation into this web page:

The NixOS project and deploying systems declaratively from Sander van der Burg

In an old blog post (and research paper) from a couple of years ago, I have described a prototype version of Dysnomia -- a toolset that can be used to deploy so-called "mutable components". In the middle of last year, I have integrated the majority of its concepts into the mainstream version of Dysnomia, because I had found some practical use for it.

So far, I have only used Dysnomia in conjunction with Disnix -- Disnix executes all activities required to deploy a service-oriented system, such as:

Building services and their intra-dependencies from source code. By default, Disnix performs the builds on the coordinator machine, but can also optionally delegate them to target machines in the network.
Distributing services and their intra-dependency closures to the appropriate target machines in the network.
Activating newly deployed services, and deactivating obsolete services.
Optionally snapshotting, transferring and restoring the state of services (or a subset of services) that have moved from a target machine to another.

For carrying out the building and distribution activities, Disnix invokes the Nix package manager as it provides a number of powerful features that makes deployment of packages more reliable and reproducible.

However, not all activities required to deploy service-oriented systems are supported by Nix and this is where Dysnomia comes in handy -- one of Dysnomia's objectives is to uniformly activate and deactivate mutable components in containers by modifying the latter's state. The other objective is to uniformly support snapshotting and restoring the state of mutable components deployed in a container.

The definitions of mutable components and containers are deliberately left abstract in a Dysnomia context. Basically, they can represent anything, such as:

A MySQL database schema component and a MySQL DBMS container.
An Java web application component (WAR file) and an Apache Tomcat container.
A UNIX process component and a systemd container.
Even NixOS configurations can be considered mutable components.

To support many kinds of component and container flavours, Dysnomia has been designed as a plugin system -- each Dysnomia module has a standardized interface (basically a process taking two standard command line parameters) and implement a set of standard deployment activities (e.g. activate, deactivate, snapshot and restore) for each type of container.

Despite the fact that Dysnomia has originally been designed for use with Disnix (the package was historically known as Disnix activation scripts), it can also be used a standalone tool or in combination with other deployment solutions. (As a sidenote: the reason why I picked the name Dysnomia is, because like Nix, it is the name of a moon of a Trans-Neptunian object).

Similar to Disnix, when deploying NixOS configurations, all activities to deploy the static parts of a system are carried out by the Nix package manager.

However, in the final step (the activation step) a big generated shell script is executed that is responsible for deploying the dynamic parts of a system, such as the updating the GRUB bootloader, reloading systemd units, creating folders that store variable data (e.g. /var), creating user accounts and so on.

In some cases, it may also be desired to deploy mutable components as part of a NixOS system configuration:

Some systems are monolithic and cannot be be decomposed into services (i.e. distributable units) of deployment.
Some NixOS modules have scripts to initialize the state of a system service on first startup, such as a database, but do it in their own ad-hoc way, e.g. there is no real formalism behind it.
You may also want to use Dysnomia's (primitive) snapshotting facilities for backup purposes.

Recently I did some interesting experiments with Dysnomia on NixOS-level. In this blog post, I will show how Dysnomia can be used in conjunction with NixOS.

Deploying NixOS configurations

As described in earlier blog posts, in NixOS, deployment is driven by a single NixOS configuration file (/etc/nixos/configuration.nix), such as:

{pkgs, ...}:

{
  boot.loader.grub = {
    enable = true;
    device = "/dev/sda";
  };

  fileSystems."/" = {
    device = "/dev/disk/by-label/nixos";
    fsType = "ext4";  
  };

  services = {
    openssh.enable = true;

    mysql = {
      enable = true;
      package = pkgs.mysql;
      rootPassword = ../configurations/mysqlpw;
    };
  };
}

The above configuration file states that we want to deploy a system using the GRUB bootloader, having a single root partition, running OpenSSH and MySQL as system services. The configuration can be deployed with a single-command line instruction:

$ nixos-rebuild switch

When running the above command-line instruction, the Nix package manager deploys all required packages and configuration files. After all packages have been successfully deployed, the activation script gets executed. As a result, we have a system running OpenSSH and MySQL.

By modifying the above configuration and adding another service after MySQL:

...

mysql = {
  enable = true;
  package = pkgs.mysql;
  rootPassword = ../configurations/mysqlpw;
};

tomcat = {
  enable = true;
  commonLibs = [ "${pkgs.mysql_jdbc}/share/java/mysql-connector-java.jar" ];
  catalinaOpts = "-Xms64m -Xmx256m";
};

...

and running the same command-line instruction again:

$ nixos-rebuild switch

The NixOS configuration gets upgraded to also run Apache Tomcat as a system service in addition to MySQL and OpenSSH. When upgrading, Nix only builds or downloads the packages that have not been deployed before making the upgrade process much more efficient than rebuilding it from scratch.

Managing collections of mutable components

Similar to NixOS configurations (that represent entire system configurations), we need to manage the deployment of mutable components belonging to a system configuration as a whole. I have developed a new tool called: dysnomia-containers for this purpose.

The following command-line instruction queries all available containers on a system that serve as potential deployment targets:

$ dysnomia-containers --query-containers
mysql-database
process
tomcat-webapplication
wrapper

What the above command-line instruction does is searching all folders in the DYSNOMIA_CONTAINERS_PATH environment variable (that defaults to: /etc/dysnomia/containers) for container configuration files and displays their names, such as mysql-database corresponding to a MySQL DBMS server, and process and wrapper that are virtual containers integrating with the host system's service manager, such as systemd.

We can also query the available mutable components that we can deploy to the above listed containers:

$ dysnomia-containers --query-available-components
mysql-database/rooms
mysql-database/staff
mysql-database/zipcodes
tomcat-webapplication/GeolocationService
tomcat-webapplication/RoomService
tomcat-webapplication/StaffService
tomcat-webapplication/StaffTracker
tomcat-webapplication/ZipcodeService

The above command-line instruction displays all the available mutable component configurations that reside in directories provided by the DYSNOMIA_COMPONENTS_PATH environment variable, such as three MySQL databases and five Apache Tomcat web applications.

We can deploy all the available mutable components to the available containers, by running:

$ dysnomia-containers --deploy
Activating component: rooms in container: mysql-database
Activating component: staff in container: mysql-database
Activating component: zipcodes in container: mysql-database
Activating component: GeolocationService in container: tomcat-webapplication
Activating component: RoomService in container: tomcat-webapplication
Activating component: StaffService in container: tomcat-webapplication
Activating component: StaffTracker in container: tomcat-webapplication
Activating component: ZipcodeService in container: tomcat-webapplication

Besides displaying the available mutable components and deploying them, we can also query which ones have been deployed already:

$ dysnomia-containers --query-activated-components
mysql-database/rooms
mysql-database/staff
mysql-database/zipcodes
tomcat-webapplication/GeolocationService
tomcat-webapplication/RoomServiceWrapper
tomcat-webapplication/StaffService
tomcat-webapplication/StaffTracker
tomcat-webapplication/ZipcodeService

The dysnomia-containers tool uses the set of available and activated components to make an upgrade more efficient -- when deploying a new system configuration, it will deactivate the components that have been activated that are not available anymore, and activate the available components that have not been activated yet. The components that are both in the old and new configuration remain untouched.

For example, if we would run dysnomia-containers --deploy again, then nothing will be deployed or undeployed as the configuration remained identical.

We can also take snapshots of all activated mutable components (for example, for backup purposes):

$ dysnomia-containers --snapshot

After running the above command, the Dysnomia snapshot utility may show you the following output:

$ dysnomia-snapshots --query-all
mysql-database/rooms/faede34f3bf658884020a31ca98f16503da9a90bf3313cc96adc5c2358c0b054
mysql-database/staff/e9af7042064c33379ba9fe9272f61986b5a85de63c57732f067695e499a3a18f
mysql-database/zipcodes/637faa3e79ec6c2db71ac4023e86f29890e54233ea6592680fd88481725d44a3

As may be noticed, for each MySQL database (we have three of them) we have taken a snapshot. (For the Apache Tomcat web applications, no snapshots have been taken because state management for these kinds of components is unsupported).

We can also restore the state from the snapshots that we just have taken:

$ dysnomia-containers --restore

The above command restores the state of all three databases.

Finally, as with services deployed by Disnix, deactivating a mutable component does not imply that its state is removed automatically. Instead, it has been marked as garbage and must be explicitly removed by running:

$ dysnomia-containers --collect-garbage

NixOS integration

To actually make the previously shown deployment activities work, we need configuration files for all the containers and mutable components and put them into locations that are reachable from the DYSNOMIA_CONTAINERS_PATH and DYSNOMIA_COMPONENTS_PATH environment variables.

Obviously, they can be written by hand (as demonstrated in my previous blog post about Dysnomia), but this is not always very practical to do on a system-level. Moreover, there is some repetition involved as a NixOS configuration and container configuration files capture common properties.

I have developed a Dysnomia NixOS module to automate Dysnomia's configuration through NixOS. It can be enabled by adding the following property to a NixOS configuration file:

dysnomia.enable = true;

We can specify container properties in a NixOS configuration file as follows:

dysnomia.containers = {
  mysql-database = {
    mysqlUsername = "root";
    mysqlPassword = "secret";
    mysqlPort = 3306;
  };
  tomcat-webapplication = {
    tomcatPort = 8080;
  };
  ...
};

The Dysnomia module generates the corresponding container configuration files having the same names as each attribute name in the dysnomia.containers set and composes their contents from the sub attribute sets by translating them to text files with key=value pairs.

Most of the dysnomia.containers properties can be automatically generated by the Dysnomia NixOS module as well, since most of them have already been specified elsewhere in a NixOS configuration. For example, by enabling MySQL in a Dysnomia-enabled NixOS configuration:

services.mysql = {
  enable = true;
  package = pkgs.mysql;
  rootPassword = ../configurations/mysqlpw;
};

The Dysnomia module automatically generates the corresponding container properties as shown previously. The Dysnomia NixOS module integrates with all NixOS features for which Dysnomia provides a plugin.

In addition to containers, we can also specify the available mutable components as part of a NixOS configuration:

dysnomia.components = {
  mysql-database = {
    rooms = pkgs.writeTextFile {
      name = "rooms";
      text = ''
        create table room
        ( Room     VARCHAR(10)    NOT NULL,
          Zipcode  VARCHAR(6)     NOT NULL,
          PRIMARY KEY(Room)
        );
'';
    };
    staff = ...
    zipcodes = ...
  };

  tomcat-webapplication = {
    ...
  };
};

As can be observed in the above example, the dysnomia.components attribute set captures the available mutable components per container. For the mysql-database container, we have defined three databases: rooms, staff and zipcodes. Each attribute refers to a Nix build function that produces an SQL file representing the initial state of the database on first activation (typically a schema).

Besides MySQL databases, we can use the tomcat-webapplication attribute to automatically deploy Java web applications to the Apache Tomcat servlet container. The corresponding values of each mutable component refer to the result of a Nix build function that produce a Java web application archive (WAR file).

The Dysnomia module automatically composes a directory with symlinks referring to the generated mutable component configurations reachable through the DYSNOMIA_COMPONENTS_PATH environment variable.

Distributed infrastructure state management

In addition to deploying mutable components belonging to a single NixOS configuration, I have mapped the NixOS-level Dysnomia deployment concepts to networks of NixOS machines by extending the DisnixOS toolset (the Disnix extension integrating Disnix' service deployment concepts with NixOS' infrastructure deployment).

It may not have been stated explicitly in any of my previous blog posts, but DisnixOS can also be used deploy a network of NixOS configurations to target machines in a network. For example, we can compose a networked NixOS configuration that includes the machine configuration shown previously:

{
  test1 = import ./configurations/mysql-tomcat.nix;
  test2 = import ./configurations/empty.nix;
}

The above configuration file is an attribute set defining two machine configurations. The first attribute (test1) refers to our previous NixOS configuration running MySQL and Apache Tomcat as system services.

We can deploy the networked configuration with the following command-line instruction:

$ disnixos-deploy-network network.nix

As a sidenote: although DisnixOS can deploy networks of NixOS configurations, NixOps does a better job in accomplishing this. Moreover, DisnixOS only supports deployment of NixOS configurations to bare-metal servers and cannot instantiate any VMs in the cloud.

Furthermore, what DisnixOS also does differently compared to NixOps, is invoking Dysnomia to activate or deactivate NixOS configurations -- the corresponding NixOS plugin executes the big monolithic NixOS activation script for the activation step and runs nixos-rebuild --rollback switch for the deactivation step.

I have extended the Dysnomia's nixos-configuration plugin with state management operations. Snapshotting the state of a NixOS configuration simply means running:

$ dysnomia-containers --snapshot

Likewise, restoring the state of a NixOS configuration can be done with:

$ dysnomia-containers --restore

And removing obsolete state with:

$ dysnomia-containers --collect-garbage

When using Disnix to manage state, we may have mutable components deployed as part of a system configuration and mutable components deployed as services in the same environment. To prevent the snapshots of the services to conflict with the ones belonging to a machine's system configuration, we set the DYSNOMIA_STATEDIR environment variable to: /var/state/dysnomia-nixos for system-level state management and to /var/state/dysnomia for service-level state management to keep them apart.

With these additional operations, we can capture the state of all mutable components part of the system configurations in a network:

$ disnixos-snapshot-network network.nix

This yields a snapshot of the test1 machine stored in the Dysnomia snapshot store on the coordinator machine:

$ dysnomia-snapshots --query-latest
nixos-configuration/nixos-system-test1-16.03pre-git/4c4751f10648dfbbf8e25c924391e80913c8a6a600f7b481d73cd88ff3d32730

When inspecting the contents of the NixOS system configuration snapshot, we will observe:

$ cd /var/state/dysnomia/snapshots/$(dysnomia-snapshots --query-latest)
$ find -maxdepth 3 -mindepth 3 -type d
./mysql-database/rooms/faede34f3bf658884020a31ca98f16503da9a90bf3313cc96adc5c2358c0b054
./mysql-database/staff/e9af7042064c33379ba9fe9272f61986b5a85de63c57732f067695e499a3a18f
./mysql-database/zipcodes/637faa3e79ec6c2db71ac4023e86f29890e54233ea6592680fd88481725d44a3

The contents of the NixOS system configuration snapshot consist all snapshots of the mutable components belonging to its system configuration.

Similar to restoring the state of individual mutable components, we can restore the state of all mutable components part of a system configuration in a network of machines:

$ disnixos-snapshot-network network.nix

And remove their obsolete state, by running:

$ disnixos-delete-network-state network.nix

TL;DR: Discussion

In this blog post, I have described an extension to Dysnomia that makes it possible to manage the state of mutable components belonging to a system configuration, and a NixOS module making it possible to automatically configure Dysnomia from a NixOS configuration file.

This new extension makes it possible to deploy mutable components belonging to systems that cannot be divided into distributable deployment units (or services in a Disnix-context), such as monolithic system configurations.

To summarize: if it is desired to manage the state of mutable components in a NixOS configuration, you need to provide a number of additional configuration settings. First, we must enable Dysnomia:

dysnomia.enable = true;

Then enable a number of container services, such as MySQL:

services.mysql.enable = true;

(As explained earlier, the Dysnomia module will automatically generate its corresponding container properties).

Finally, we can specify a number of available mutable components that can be deployed automatically, such as a MySQL database:

dysnomia.components = {
  mysql-database = {
    rooms = pkgs.writeTextFile {
      name = "rooms";
      text = ''
        create table room
        ( Room     VARCHAR(10)    NOT NULL,
          Zipcode  VARCHAR(6)     NOT NULL,
          PRIMARY KEY(Room)
        );
'';
    };
  };
}

After deploying a Dysnomia-enabled NixOS system configuration through:

$ nixos-rebuild switch

We can deploy the mutable components belonging to it, by running:

$ dysnomia-containers --deploy

Unfortunately, managing mutable components on a system-level also has a huge drawback, in particular in distributed environments. Snapshots of entire system configurations are typically too coarse -- whenever the state of any of the mutable components change, a new system-level composite snapshot is generated that is composed of the snapshots of all mutable components.

Typically, these snapshots contain redundant data that is not shared among snapshot generations (although there are potential solutions to cope with this, I have not implemented any optimizations yet). As explained in my previous Dysnomia-related blog posts, snapshotting individual components can already be quite expensive (such as large databases), and these costs may become significantly larger on a system-level.

Likewise, restoring state on system-level implies that the state of all mutable components will be restored. This is also typically undesired as it may be too destructive and time consuming. Moreover, moving the state from one machine to another when a mutable components gets migrated is also much more expensive.

For more control and more efficient deployment of mutable components, it would typically be better to develop a Disnix service-model so that they can be managed individually.

Because of these drawbacks, I am not prominently advertising DisnixOS' distributed state management features. Moreover, I also did not attempt to integrate these features into NixOps, for the same reasons.

References

The dysnomia-containers tool as well as the distributed infrastructure management facilities have been integrated into the development versions of Dysnomia and DisnixOS, and will become part of the next Disnix release.

I have also added a sub example to the Java version of the Disnix staff tracker example to demonstrate how these features can be used.

As a final note, the Dysnomia NixOS module has not yet been integrated in NixOS. Instead, the module must be imported from a Dysnomia Git clone, by adding the following line to a NixOS configuration file:


imports = [ /home/sander/dysnomia/dysnomia-module.nix ];