From 9c45ac817d2d1f1c314c3305af517eb883d4f2be Mon Sep 17 00:00:00 2001 From: Nico Schottelius Date: Tue, 31 Aug 2021 18:01:37 +0200 Subject: [PATCH] ++blog: disk based booting --- .../contents.lr | 131 ++++++++++++++++++ 1 file changed, 131 insertions(+) create mode 100644 content/u/blog/2021-08-31-datacenterlight-bye-bye-netboot/contents.lr diff --git a/content/u/blog/2021-08-31-datacenterlight-bye-bye-netboot/contents.lr b/content/u/blog/2021-08-31-datacenterlight-bye-bye-netboot/contents.lr new file mode 100644 index 0000000..96bc279 --- /dev/null +++ b/content/u/blog/2021-08-31-datacenterlight-bye-bye-netboot/contents.lr @@ -0,0 +1,131 @@ +title: Bye, bye netboot +--- +pub_date: 2021-08-31 +--- +author: ungleich infrastructure team +--- +twitter_handle: ungleich +--- +_hidden: no +--- +_discoverable: yes +--- +abstract: +Data Center Light servers are switching to disk based boot +--- +body: + +## Introduction + +Since the very beginning of the [Data Center Light +project](/u/projects/data-center-light) our servers have been +*somewhat stateless* and booted from their operating system from the +network. + +From today on this changes and our servers are switched to boot from +an disk (SSD/NVMe/HDD). While this first seems counter intuitive with +growing a data center, let us explain why this makes sense for us. + +## Netboot in a nutshell + +There are different variants of how to netboot a server. In either +case, the server loads an executable from the network, typically via +TFTP or HTTP and then hands over execution to it. + +The first option is to load the kernel and then later switch to an NFS +based filesystem. If the filesystem is read write, you usually need +one location per server or you mount it read only and possibly apply +an overlay for runtime configuration. + +The second option is to load the kernel and an initramfs into memory +and stay inside the initramfs. The advantage of this approach is that +no NFS server is needed, but the whole operating system is inside the +memory. + +The second option is what we used in Data Center Light for the last +couple of years. + +## Netboot history at Data Center Light + +Originally all our servers started with IPv4 PXE based +netboot. However as our data center is generally speaking IPv6 only, +the IPv4 DHCP+TFTP combination is an extra maintenance and also a +hindrance for network debugging: if you are in a single stack IPv6 +only network, things are much easier to debug. No need to look for two +routing tables, no need to work around DHCP settings that might +interfere with what one wants to achieve via IPv6. + +As the IPv4 addresses became more of a technical debt in our +infrastructure, we started flashing our network cards with +[ipxe](https://ipxe.org/), which allows even older network cards to +boot in IPv6 only networks. + +Also in an IPv6 only netboot environment, it is easier to run +active-active routers, as hosts are not assigned DHCP leases. They +assign addresses themselves, which scales much nicer. + +## Migrating away from netbooting + +So why are we migrating away from netbooting, even after we migrated +to IPv6 only networking? There are multiple aspects: + +On power failure, netbooted hosts lose their state. The operating +system that is loaded is the same for every server and needs some +configuration post-boot. We have solved this using +[cdist](https://www.cdi.st/), however the authentication-trigger +mechanism is non-trivial, if you want to keep your netboot images and +build steps public. + +The second reason is state synchronisation: as we are having multiple +boot servers, we need to maintain the same state on multiple +machines. That is solvable via CI/CD pipelines, however the level of +automation on build servers is rather low, because the amount of OS +changes are low. + +The third and main point is our ongoing migration towards +[kubernetes](https://kubernetes.io/). Originally our servers would +boot up, get configured for providing ceph storage or to be a +virtualisation host. The amount of binaries to keep in our in-memory +image was tiny, in the best case around 150MB. With the migration +towards kubernetes, every node is downloading the containers, which +can be comparable huge (gigabytes of data). The additional pivot_root +workarounds that are required for initramfs usage are just an +additional minor point that made us question our current setup. + +## Automating disk based boot + +We have servers from a variety of brands and each of them comes with a +variety of disk controllers: from simple pass-through SATA controllers +to full fledged hardware raid with onboard cache and battery for +protecting the cache - everything is in the mix. + +So it is not easily possible to generate a stack of disks somewhere +and then add them, as the disk controller might add some (RAID0) meta +data to it. + +To work around this problem, we insert the disk that is becoming the +boot disk in the future into the netbooted servers, install the +operating system from the running environment and at the next +maintenance window ensure that the server is actually booting from it. + +If you are curious on how this works, you can checkout the script that +we use for +[Devuan/Debian](https://code.ungleich.ch/ungleich-public/ungleich-tools/-/blob/master/debian-devuan-install-on-disk.sh) +and +[Alpine Linux](https://code.ungleich.ch/ungleich-public/ungleich-tools/-/blob/master/alpine-install-on-disk.sh) + +## The road continues + +While a data center needs to be stable, it also needs to adapt to +newer technologies or different flows. The disk based boot is our +current solution for our path towards kubernetes migration, but who +knows - in the future things might look different again. + +If you want to join the discussion, we have a +[Hacking and Learning +(#hacking-and-learning:ungleich.ch)](/u/projects/open-chat/) channel +on Matrix for an open exchange. + +Oh and in case [you were wondering what we did +today](https://twitter.com/ungleich/status/1432627966316584968), we +switched to disk based booting ;-).