Building CUDA images on github runners with nix

Seán Murphy
6 min readFeb 7, 2024

In a previous post, I described how I set up github runners which could build standard docker images via github actions — I focused on a nix based solution both for the github runner and the container build. In this post, I describe how I extended that scenario to support building container images requiring CUDA.

The content relating to this post is in this git repo.

Enabling the kernel drivers in NixOS

For this to work, it was necessary to enable the nvidia kernel drivers in NixOS. Note that this does not require enabling CUDA support — meaning that the NixOS installation does not require installation of CUDA libraries — but it does require enabling unfree software in the configuration (using the allowUnfree setting). In my case, I made the following modifications to my configuration.nix :

  # Enable OpenGL
hardware.opengl = {
enable = true;
driSupport = true;
driSupport32Bit = true;
};

# Load nvidia driver for Xorg and Wayland
services.xserver.videoDrivers = ["nvidia"];

hardware.nvidia = {

# Modesetting is required.
modesetting.enable = true;

# Nvidia power management. Experimental, and can cause sleep/suspend to fail.
powerManagement.enable = false;
# Fine-grained power management. Turns off GPU when not in use.
# Experimental and only works on modern Nvidia GPUs (Turing or newer).
powerManagement.finegrained = false;

# Use the NVidia open source kernel module (not to be confused with the
# independent third-party "nouveau" open source driver).
# Support is limited to the Turing and later architectures. Full list of
# supported GPUs is at:
# https://github.com/NVIDIA/open-gpu-kernel-modules#compatible-gpus
# Only available from driver 515.43.04+
# Currently alpha-quality/buggy, so false is currently the recommended setting.
open = false;

# Enable the Nvidia settings menu,
# accessible via `nvidia-settings`.
nvidiaSettings = true;

# Optionally, you may need to select the appropriate driver version for your specific GPU.
package = config.boot.kernelPackages.nvidiaPackages.stable;
};

And after a system reboot the necessary kernel modules were installed:

And the version of the kernel module can be seen here — 545.29.02 in this case:

In the above configuration, I installed the latest stable nvidia driver; it is possible to pin the driver to a specific version in the configuration as necessary.

Run nvidia-smi to confirm that the GPU is visible and accessible:

More information on using nvidia generally and CUDA in particular in NixOS is here.

Adding labels to the github runner

I wanted to add labels to the github runner specifying that it’s a runner which can build CUDA containers. For an existing runner, labels can be added via the github web interface but doing so programmatically can only be done at runner registration time, i.e. it’s not possible to add labels to an existing runner, restart the runner and see new labels being picked up. For me, it was necessary to remove the old runner, create a new token and register a new runner.

In NixOS, adding labels was a simple as specifying the extraLabels attribute:

  services.github-runners = {
testrunner = {
enable = true;
name = "test-runner";
tokenFile = <FILE-CONTAINING-TOKEN>;
url = <REPO-URL>;
extraLabels = [ "nixos" "cuda-12.3" "nvidia-545" ];
};
};

The labels were immediately visible in github and a job sent to this runner could now make use of the GPU and ultimately build CUDA container images.

Modifications to the container buildOm

It took some time to figure out how to build a CUDA compatible container image. In the previous post, I used the pyproject.nix library; I also used that here — however, it was necessary to modify the system configuration such that CUDA support was enabled.

That was done as follows:

      # to add cuda stuff
pkgs = import nixpkgs {
inherit system;
config = {
allowUnfree = true;
cudaSupport = true;
cudaCapabilities = [ "7.5" "8.6" ];
cudaForwardCompat = false;
};
};

It was possible to test this locally using nix build .#ociPackageImage which built a container image and compiled the necessary CUDA libraries and dependencies; on my system this took approximately an hour the first time I did it.

This resulted in a container image which could be added to docker using docker load < result as before. However, this container image would not run. After some investigation, I found two specific issues with this container image:

  • it did not container a /tmp folder — this folder was necessary and assumed by the nvidia-docker runtime as part of the mechanism to make the necessary host libraries available inside the container image. This was built using the fakeRootCommands parameter;
  • it did not contain a valid LD_LIBRARY_PATH setting which indicated where these libraries could be found by the python interpreter.

These two issues were fixed as follows:

      buildPackageImage =
pkgs.dockerTools.buildLayeredImage
{
config = {
# just run bash when the container starts; note that bash is only installed if the package is specified above
Cmd = [
"${pkgs.bash}/bin/bash"
];
# this is probably too static but it works
Env = [ "LD_LIBRARY_PATH=/usr/lib64" ];
};

# when building a container image this way, no /tmp is created; this is required to make libs
# available from the host system to the container
# note that the leading slash MUST NOT be there; otherwise this does not work
fakeRootCommands = ''
#!${pkgs.runtimeShell}
mkdir -p tmp
chmod -R 1777 tmp
'';
};

With the above modifications, it was possible to build the container, run it in docker, start a python interpreter and show that CUDA capabilities are available:

Comments on running it in nvidia-docker

To test the container, I just enabled nvidia support in docker on the same system I was using for the build process. The above issues, particularly with the /tmp folder required troubleshooting nvidia-docker configuration. When doing this, I found that the nvidia-docker package within nix currently has no maintainer and it is quite old. It is getting some attention now and evolving, in particular with the addition of the Container Device Interface (CDI) to support a more standardized way of making devices available within running containers. Modifying the nvidia-dockerconfiguration to generate debug output required using the mkNvidiaContainerPackage function — this has recently been removed from the current version of nixpkgs and for this reason I don’t include the approach here.

Although I did experiment with newer nvidia-docker versions, particularly versions using the newer nvidia-container-toolkit , my main issue was with the absence of a /tmp folder and once this was resolved, the nvidia-docker version in NixOS 23.11 worked fine.

Final thoughts

Building container images using nix and running them within a nix-managed nvidia-docker environment was more complex than I had envisaged. The solution described above can probably work well in case the CUDA versions and nvidia driver versions are pinned, which is well supported with Nixos, but when there are differences care must be taken (although nvidia does offer some compatibility assurances).

The resulting nix container images can definitely be improved, currently having a large amount of layers without any particular structure or ordering. The bigger question of whether it makes sense to use nix based container images in this context is still unclear.

References

https://sebastian-staffa.eu/posts/nvidia-docker-with-nix/

--

--

Seán Murphy

Tech Tinkerer, Curious Thinker(er). Lost Leprechaun. Always trying to improve.