Embedded Linux Completely from Scratch

Goal: Let’s research and build an embedded ARM64 Linux OS from scratch from canonical Git sources to network-boot bare-metal compute modules.

Each node of my cluster computer has 2GiB of memory. RAM is at a premium. Each node is nameless and stateless like an AWS Lambda (or Rancher cattle, or Saltstack Minion), so the entire OS must reside in scarce memory along with the user space for jobs like map-reduce operations.

Having explored minimal Debian, Alpine Linux, and even RancherOS1 the most rewarding conclusion is to put in the time to learn to compile Linux from scratch for ARM64 (also called ARMv8 and AArch64).

Heads up: I’ve failed at this several times before finally succeeding. Wait until the end to see how I ultimately make this work.

Here are the talking points:

  1. Kernel, Initramfs, DTBs
  2. Deep-Learning Settings
  3. U-Boot and iPXE Bootloaders
  4. User Processes and Daemons
  5. Network Bootloading on Real Hardware
  6. Gotchas and More Tips
  7. Conclusion

Creating an Embedded Linux Distro

Here is a checklist of items needed to build and network-boot a Linux distro into RAM:

  • Bootloader stage 1 (U-Boot in SPI flash)
  • Bootloader stage 2 (iPXE for HTTP transport)
  • Linux kernel and drivers
  • Device hardware tree (DTB)
  • Initramfs – RAM filesystem
  • User-space daemons
  • PXE server

The plan is first to compile the Linux kernel and initramfs and run them directly in a virtual environment (QEMU). Next, compile U-Boot and iPXE bootloaders. Then, we’ll see if I can bootload the distro in QEMU. I’ll then add a daemon that makes a light blink to show boot was successful. Penultimately, I’ll set up Dnsmasq and a TFTP server to boot some real hardware. Finally, I’ll try to optimize the kernel image size.


Part 1 – Kernel, Initramfs, DTBs

The first milestone is to compile the kernel, the RAM filesystem, and device trees (DTBs) from Git source.

Cross-Compiled Linux From Scratch (CLFS)

Has anyone done this before? Using the work of CLFS and Cortex A53 hackers, I first attempt to reproduce their written steps in a Dockerfile.

Most of this work is going to take place in a Dockerfile because one’d prefer not to wget large archives each time the compilation fails while learning. Docker can cache previous steps, including previous downloads and build stages. The first six chapters of the CLFS guide can be scripted (with my updates added, like using arm64 instead of 32-bit arm, and using armv8-a instead of… nothing), so here is my initial Dockerfile (remember, please do not use this) following the old CLFS guide to getting to the interactive Linux kernel compilation:

Failure: Following the steps in CLFS did not pan out. The following is just for reference as it was useful to follow manual steps just for understanding.

At first blush, this appears to be succeeding because next, we’ll configure the Linux kernel.

Linux Kernel Configuration

When the Linux kernel configuration menu appeared for the first time after a day of writing the above Dockerfile, I had no idea what I was doing. Do I need ARM Accelerated Cryptographic Algorithms? What is “Accelerated scalar and SIMD Poly1305 hash implementations”? What is the Kernel Hacking sub-menu? Should I debug “Oops, Lockups, and Hangs”? Do I need the new Google Firmware Drivers? What is all this?

Many Linux kernel configuration options
Many Linux kernel configuration options

Fortunately, there is a body of work from some Cortex A53 (Pine64) kernel hackers23, though abandoned since 2018. What they left for me to reverse engineer is two kernel configurations for the same hardware I’m targeting with nominal settings and minimal settings. One line item I noticed is CONFIG_KERNEL_MODE_NEON=y. What is this? According to Arm4,

Neon technology is a packed SIMD architecture. Neon registers are considered as vectors of elements… Neon can also accelerate signal processing algorithms and functions to speed up applications such as audio and video processing, voice and facial recognition, computer vision, and deep learning.

That sounds wonderful – let’s keep that, right? I admit I have no idea what to include, so I’ll include all I can for now and whittle down features as I learn about them.

Failure: It turns out the CLFS guide is not geared toward AArch64. The kernel compilation breaks in all kinds of entertaining ways. For example, there are problems with __int128_t not being defined (available on x64 toolchains), which means the toolchain needs to be reworked on my own.

Cryptography and Deep-Learning Settings

Let’s take a moment to see what is under the hood in the Cortex A53 CPU since CPU instruction sets came up.

The CLFS guide suggests using CLFS_ARM_ARCH=armv7. I know the 64-bit Arm chips are at least armv8. Digging around the net I came to an Arm article that piqued my curiosity – it made reference to minor versions of the Armv8 architecture: armv8.1-a, armv8.2-a, etc. that add vector support and SIMD instructions for deep learning. My Cortex A53 chips are ubiquitous, but they only use the armv8.0-a instruction set. Adding to that, we can select extensions to the instruction set:

  • +crc – The Cyclic Redundancy Check (CRC) instructions.
  • +simd – The ARMv8-A Advanced SIMD and floating-point instructions.
  • +crypto – The cryptographic instructions.
  • +sb – Speculation Barrier Instruction5.
  • +predres – Execution and Data Prediction Restriction Instructions.
  • +nofp – Disable the floating-point, Advanced SIMD, and cryptographic instructions.

How do we want to tune the compiled code? Looking at the above, if my tuned code is aimed at processing whole-number time-series data (and it is), then I want floating-points off to speed up whole-number arithmetic. However, if performing ML (and I will), then floats are critical to have in hardware. Decisions, decisions. For now, I will leave hardware floats on and tweak these in the future. Regardless, this is valuable information about tuned instruction sets for specific work.

AArch64 Musl Cross-Compile Toolchain

Trying to keep this a manageable project, I forked and updated an existing toolchain project: https://github.com/ericdraken/musl-cross-make. This is the latest, stable cross-compile build tool using Musl for ARM64 with SHA1 download verifications. With this, I’ll continue on to build the kernel again, next.

These tools compiled beautifully, so let’s update the initial Dockerfile to use the new toolchain. That should work, right?

Failure: After reworking the previous Dockerfile to use the toolchain above, I compiled the latest kernel and the rootfs successfully (no errors), but QEMU hangs seemingly mocking me. I can see myself tinkering for weeks, so I’ll try another route, next.

Enter Buildroot – A Cross-Compilation Embedded-Linux Build Tool

After manually going through all the CLFS steps to compile a Linux distro, it turns out there is a project by the team at the Buildroot Association to do exactly my goal: compile a kernel and rootfs using the latest Linux source, musl, GCC 10+, and all on AArch64 for Cortex A53 and with the device trees for my SoC boards. Just beautiful. Here, I’ll offer my automated and optimized scripts for building complete embedded Linux “distros” for my cluster computer.

Buildroot has a handy Xconfig GUI to visually edit the configuration settings, but most of the time I will edit them in the shell by hand and for faster diffing between tweaks.

Buildroot GUI for editing textual config files
Buildroot GUI for editing textual config files (Source: buildroot.org)

Increase Build Speed

What works great is to use a Docker image with host build tools, and when I run it for a build, I mount a local folder to /downloads, and a wipeable Docker volume to /ccache. This way compilation is greatly sped up with caching, and downloads are already present. This saves downloading hundreds of megabytes per build.

Compiling the toolchain takes about 70% of the build time. Let’s cache that too. Buildroot suggests to compile a toolchain SDK, save to a tarball, and use it again as an external toolchain.

Additionally, I like to blow away my entire work folder on each run. Why? Believe me, runs from previous builds are additive, not selective. I like to maintain individual config files and a pristine git folder of Buildroot. After cleaning, I copy those back to a work folder. Clean.

Tip: Use a persistent download folder and a GCC ccache folder (or Docker volumes) to avoid downloading over 800MB of packages each build, and speed up GCC on recompilation of gigabytes of object files. A full rebuild takes about 8 minutes.

External Toolchain

This is the most important section. I’ve experimented with several toolchains like Linaro, ARM’s, glibc, musl, and others. I prefer musl because builds compile smaller, and it feels like I’m working with Alpine Docker images.

Building my own toolchain, I can use the latest kernel headers, GCC, and musl version. Plus I can tune the builds for performance and size closely. Nice.

Tip: Compile the toolchain from scratch and save the whole shebang (not #!) as a tarball and set Buildroot to use an external toolchain. This way, the toolchain doesn’t need to be re-compiled after each clean. Also, you can swap in and out external toolchains to see if there are any time/space savings on kernel builds.

Linux Kernel Settings

We can compile the latest kernel from source. However, configuring the kernel is amazingly complex – there are far too many options and tri-state drivers to choose from for all kinds of hardware. What I’ve learned is to include all possible USB and USB HID modules because soon I will control a Blinkstick from this forthcoming tiny distro. One way to go is to select all options and drivers and bloat the kernel image like a blowfish, then whittle away drivers once all hardware works outright.

Tip: Enable all the USB modules and drivers (XHCI, OHCI, OTG, etc.) and hotplug drivers for the kernel first just to ensure USB devices work, then incrementally renove maybe-unneeded drivers and bisect the builds to arrive at the minimal set of drivers required.

Here is a sample of those complex build options again:

Many Linux kernel configuration options
Many Linux kernel configuration options

BusyBox

Space is at a premium, but only a subset of shell utilities is required in embedded systems. BusyBox handles most of what an embedded system needs. Some features are missing, like sort -h (human-readable number sort), but the space savings greatly make up for these inconveniences at around ~720KiB for one binary that does it all.

BusyBox must also be compiled from source and can be heavily configured. Again, include everything so that userspace apps work, then remove features, build, and bisect until an optimal BusyBox binary is reached.

Device Trees (DTBs)

This is an interesting one: the Linux kernel can compile the DTBs, or not, or the U-Boot bootloader can compile them. The DTBs (FIT) can be baked into the Kernel, specified externally with “bootargs”, or referenced from memory when U-Boot loads the DTB independently. I spent a good deal of time experimenting here because with the wrong settings, the kernel doesn’t display anything to the console. Fun.

Tip: Let U-Boot compile the DTB(s) and include the hardware-specific one in the SPI U-Boot flash image. If a DTB overlay is needed, or say, GPIOs need custom addressing, then a new DTB can be compiled and given to the kernel with dtb=... through the bootloader.

QEMU Kernel Emulation

This was a beast. I put QEMU into a Docker image to protect my host from bloat, then I figured out how to pipe USB devices into the container, safely, without the dangerous --privileged flag using Cgroups. Templating this script to be included with each kernel build, here is a sample QEMU script to illustrate some useful flags. See near the end of the script below for the QEMU flags.

Success. The kernel and RAM filesystem boot up to an interactive shell.

With custom and maintainable build scripts, the result is stripped config files (defconfigs), the device-tree binaries (DTBs), the initramfs and Image files, and even a handy Graphviz graph of the dependencies.

Here is a sample of a Buildroot dependency graph.

Buildroot dependency graph sample
Buildroot dependency graph sample
Milestone: We now have the ability to compile the Linux kernel, hardware device tree, and a RAM-based filesystem.

Part 2 – U-Boot and iPXE Bootloaders

The next milestone is to compile mainline U-Boot and iPXE, the latter allowing booting over HTTP.

U-Boot – Bare-Metal Bootloader

Credit where credit is due, I reverse-engineered this project to figure out how to flash SPI memory with U-Boot. With this information, I turned to mainline U-Boot to make some custom build scripts. Now I can keep up with mainline releases. Here are my U-Boot settings for my Pine64 modules with detailed logging enabled:

The embedded commands to flash the SPI with U-Boot are quite small (named flash-spi.cmd):

Then, the most important addition to the mainline U-Boot Makefile is some appended rules to build the SPI flasher:

With this technique of appending Makefile targets, we can continue to use mainline and simply append my SPI-flasher targets on new releases, then build with make all && make spi-flasher.

iPXE – Secondary HTTP Bootloader

iPXE (Pre-eXecution Environment) is overkill, but as a secondary bootloader that can load the kernel and initramfs over HTTP(S) with a lot of flexibility, it’s wonderful. If we need to drop into the iPXE shell, there are several handy networking and debugging tools available as well. Plus, we can chainload over HTTP meaning iPXE can ask for bootscripts to make further HTTP network-boot requests. The responses can be dynamic since the requests call an HTTP server. Imagination, activate.

Not everything is rainbows, as iPXE uses interrupts, but U-Boot (which launches iPXE) runs on a single thread, so hacking is needed. Also, several ARM64 features have not been added to iPXE (they exist in x86-64) and from reading the room in the message boards they will not be added, so, more hacking. Honestly, I was a hair away from abandoning iPXE and bootloading over NFS, but I persisted. Maybe this will be useful to others.

QEMU PXE Bootload Emulation

This is slightly easier, and good news: QEMU has a built-in TFTP server (through SLiRP), so with a few modifications to my early QEMU launcher script, we can demonstrate bootloading in QEMU. One caveat is that the bootloader script must be baked into the iPXE binary because we don’t have a Dnsmasq server, but just for QEMU.

iPXE bootloader script for QEMU:

Modified QEMU launcher script:

The relevant lines are:

Milestone: We now can compile a primary bootloader (U-Boot) and a secondary bootloader (iPXE) which can perform advanced chainloading over HTTP.

Part 3 – User Processes and Daemons

The next milestone is to add background daemons to run on boot. One could be a SaltStack listener. Another could be a performance monitor (RAM, temperature, etc.). But, how to add these in Buildroot?

Let’s have some fun and install a background process that lights a USB-connected LED green when Linux has booted, and goes dark if Linux crashes or hasn’t booted yet, or when shutdown gracefully. Let’s call this the Blinkstick daemon. This is an opportunity to explore compiling custom modules with Buildroot.

Add Modules to Buildroot

Needed: Python3, libusb1.0, a python controller script, and all the kernel USB drivers we can get. The fun part is that all these prerequisites can be added as modules within the Buildroot configuration menu. Let’s see how to do that. For example,

Then, in the main Buildroot Config.in, we can add these menu items to include the Blinkstick module:

Tip: Both libusb and pyusb are required. They seem like they both install libusb1.0, but they complement each other.

I’m quite proud of this next part. With Buildroot, you can add Make instructions for additional modules. In the case of Blinkstick, it was abandoned at Python2, and clearly written in Windows based on the line endings (CRLF). Let’s make it work with Python3, on Linux, and fix a critical bug where libusb cannot be found when using musl (our toolchain).

In this makefile fragment, we convert the line endings to Linux (LF) and hardcode the libusb-1.0.so path due to incompatibilities with Linux and musl. This is just a proof-of-concept; I’m determined to rewrite this controller in C++.

BusyBoxy Init

Finally, with BusyBox’s inittab, we can write a single line of code, or place a script in an init.d folder.

The invoked script is then:

A background daemon that keeps a green light alive as long as the kernel is running is now resident.

Milestone: We now can compile custom modules and run scripts and background processes at bootup using inittab.

Part 4 – Network Bootloading on Real Hardware

It’s not wizardry, but yes you can have two DHCP servers on the same LAN. With Dnsmasq, you can create some neat forwarding rules. Actually, with plain DHCP you can specify an upstream/next server on PXE requests. Let’s PXE boot the kernel on real hardware.

Main DHCP Server

With just the principal (home) router on the LAN running Tomato, the extra Dnsmasq options are as simple as:

U-Boot DHCP requests come in and the response contains the instruction to fetch ipxe.efi (iPXE binary) from the TFTP server at 192.168.1.118. When iPXE loads and makes a DHCP request, its request is tagged with “ipxe” so this time the DHCP server responds with the instruction to fetch menu.ipxe from the same TFTP server. This avoids an infinite loop of iPXE loading itself.

Dnsmasq settings for PXE boot in Tomato
Dnsmasq settings for PXE boot in Tomato

Docker TFTP and HTTP Servers

Docker comes in handy again. One of the reasons I opted away from NFS bootloading is because NFS is tied to the kernel so it is not possible to dockerize an NFS service independent of the host kernel. TFTP is good for minimal file transfers like the secondary bootloader iPXE, and we’ll need an HTTP server, so this can all be trivially rolled into a Docker Compose YAML file. Just for some oompf, I added a Syslog server to catch log messages from U-Boot and iPXE.

Run this on a non-NAT’d host, or else port-forward ports 69/udp, 8080/tcp, and 1514/udp (for Syslog) to the containers above.

Tip: Use ports above 1024 to avoid needing the NET_ADMIN capability, which explains why port 1514 is mapped to 514 for Syslog.

Results and Demo

Success. On real hardware wired to a USB serial adapter, we can see the PXE boot process from power-on to shell prompt. We start with connecting the serial cable to the terminal, then running docker compose up, and then powering up a Pine64 compute module. You can see U-Boot, iPXE, then the kernel load. Lastly, there is a user script to report on the system, and the demo ends with the Blinkstick daemon starting.

PXE booting on real hardware through U-Boot, iPXE, and Linux
PXE booting on real hardware through U-Boot, iPXE, and Linux

This has been an educational, long row to hoe, but so much has been accomplished and learned in this project.

PXE booting custom Linux kernel on real hardware
PXE booting custom Linux kernel on real hardware

Part 5 – Gotchas and More Tips

When I come back to this project periodically to compile future major-release kernels, here some gotchas I’d like to remember.

Disk Space

Have plenty of disk space. This project takes gigabytes and gigabytes of space to compile the kernel and the other binaries. Plus there is over a gigabyte in downloads and GCC cache. When building in Docker containers, be sure to docker system prune often or else you’ll wonder where 100GiB went in a couple of days.

USB Power Management (PM)

These compute modules should act like warm lambdas, so it makes life easier to disable power management. Why? On boot, I find that USB devices enter low-power mode, so my Blinkstick daemon cannot find the USB devices. By blanket disabling CONFIG_SUSPEND and/or CONFIG_PM in the kernel config, this issue goes away and all USB ports have full power.

Large Packages

Including Git core through Buildroot incurs 23Mi, and Python 3 adds another 16 MiB in the initramfs (filesystem in memory). The BTRFS filesystem adds 1.7MiB to the kernel image. This may seem like nickel-and-diming, but I want this system to be a lean greyhound and lend as much memory to user processes as possible.

Bisecting Configurations

First include most options and drivers, save backups, then remove large swaths of maybe-unneeded options in the kernel, BusyBox, and Buildroot configs. When components fail, perform a diff and add back options. Repeat until the sweet spot is found.

Diff tool on configuration settings
Diff tool on configuration settings

Even better, generate the so-call defconfigs for each build for even easier diffing.


Part 6 – Conclusion

Let’s see how we’re doing with “disk” usage, RAM, and even benchmark the memory.

Kernel Footprint

Most would be pretty happy with a 12MiB compressed Linux kernel footprint. But, how much memory does a minimally-booted Linux kernel with no userspace apps take up? Let’s find out with some useful commands.

So far, here are my respectable stats on a 2GiB compute module:

Glass-half-full, there is about 92% of the real RAM available with minimal BusyBox and rootfs tuning on boot. We can do better, but this is a win.

Bootup Time

Excluding the bootloading stages and the final DCHP request, booting to the shell prompt takes about 5 seconds. Very respectable.

IOZone and RAMspeed

Let’s end this project with some benchmarks. First, running IOZone with iozone -a shows results similar to the following with several gigabyte-speed operations.

IOzone "disk" IO speed testing
IOzone “disk” IO speed testing

More succinct, here is sample output from the ramspeed integer and floating-point memory tests.

RAMspeed test on the Pine64 SOPINE module
RAMspeed test on the Pine64 SOPINE module

These speeds are on target for LPDDR3 RAM from 2016, and still quite decent.


Success: We’ve learned how to compile Linux and bootloaders from scratch, tune the kernel, add userspace code, and network-boot a bare-metal ARM64 device over HTTP as a precursor to controlling a large cluster computer.

Notes:

  1. RancherOS promises so much, but for the AMD64 community.
  2. Ayufan – https://github.com/ayufan-pine64/linux-build
  3. Longsleep – https://github.com/longsleep/linux-pine64
  4. https://developer.arm.com/architectures/instruction-sets/simd-isas/neon
  5. Used to prevent a form of Arm attack.