From 251677cf5765c208716c0ef4baeb77560b958f15 Mon Sep 17 00:00:00 2001 From: Nico Schottelius Date: Sat, 1 May 2021 11:34:11 +0200 Subject: [PATCH] ++ three new blog articles --- .../contents.lr | 161 +++++++++++++ .../contents.lr | 219 ++++++++++++++++++ .../contents.lr | 161 +++++++++++++ 3 files changed, 541 insertions(+) create mode 100644 content/u/blog/datacenterlight-active-active-routing/contents.lr create mode 100644 content/u/blog/datacenterlight-ipv6-only-netboot/contents.lr create mode 100644 content/u/blog/datacenterlight-spring-network-cleanup/contents.lr diff --git a/content/u/blog/datacenterlight-active-active-routing/contents.lr b/content/u/blog/datacenterlight-active-active-routing/contents.lr new file mode 100644 index 0000000..b203cb4 --- /dev/null +++ b/content/u/blog/datacenterlight-active-active-routing/contents.lr @@ -0,0 +1,161 @@ +title: Active-Active Routing Paths in Data Center Light +--- +pub_date: 2019-11-08 +--- +author: Nico Schottelius +--- +twitter_handle: NicoSchottelius +--- +_hidden: no +--- +_discoverable: no +--- +abstract: + +--- +body: + +From our last two blog articles (a, b) you probably already know that +it is spring network cleanup in [Data Center Light](https://datacenterlight.ch). + +In [first blog article]() we described where we started and in +the [second blog article]() you could see how we switched our +infrastructure to IPv6 only netboot. + +In this article we will dive a bit more into the details of our +network architecture and which problems we face with active-active +routers. + +## Network architecture + +Let's have a look at a simplified (!) diagram of the network: + +... IMAGE + +Doesn't look that simple, does it? Let's break it down into small +pieces. + +## Upstream routers + +We have a set of **upstream routers** which work stateless. They don't +have any stateful firewall rules, so both of them can work actively +without state synchronisation. Moreover, both of them peer with the +data center upstreams. These are fast routers and besides forwarding, +they also do **BGP peering** with our upstreams. + +Over all the upstream routers are very simple machines, mostly running +bird and forwarding packets all day. They also provide a DNS service +(resolving and authoritative), because they are always up and can +announce service IPs via BGP or via OSPF to our network. + +## Internal routers + +The internal routers on the other hand provide **stateful routing**, +**IP address assignments** and **netboot services**. They are a bit +more complicated compared to the upstream routers, but they care only +a small routing table. + +## Communication between the routers + +All routers employ OSPF and BGP for route exchange. Thus the two +upstream routers learn about the internal networks (IPv6 only, as +usual) from the internal routers. + +## Sessions + +Sessions in networking are almost always an evil. You need to store +them (at high speed), you need to maintain them (updating, deleting) +and if you run multiple routers, you even need to sychronise them. + +In our case the internal routers do have session handling, as they are +providing a stateful firewall. As we are using a multi router setup, +things can go really wrong if the wrong routes are being used. + +Let's have a look at this a bit more in detail. + +## The good path + +IMAGE2: good + +If a server sends out a packet via router1 and router1 eventually +receives the answer, everything is fine. The returning packet matches +the state entry that was created by the outgoing packet and the +internal router forwards the packet. + +## The bad path + +IMAGE3: bad + +However if the + +## Routing paths + +If we want to go active-active routing, the server can choose between +either internal router for sending out the packet. The internal +routers again have two upstream routers. So with the return path +included, the following paths exist for a packet: + +Outgoing paths: + +* servers->router1->upstream router1->internet +* servers->router1->upstream router2->internet +* servers->router2->upstream router1->internet +* servers->router2->upstream router2->internet + +And the returning paths are: + +* internet->upstream router1->router 1->servers +* internet->upstream router1->router 2->servers +* internet->upstream router2->router 1->servers +* internet->upstream router2->router 2->servers + +So on average, 50% of the routes will hit the right router on +return. However servers as well as upstream routers are not using load +balancing like ECMP, so once an incorrect path has been chosen, the +packet loss is 100%. + +## Session synchronisation + +In the first article we talked a bit about keepalived and that +it helps to operate routers in an active-passive mode. This did not +turn out to be the most reliable method. Can we do better with +active-active routers and session synchronisation? + +Linux supports this using +[conntrackd](http://conntrack-tools.netfilter.org/). However, +conntrackd supports active-active routers on a **flow based** level, +but not on a **packet** based level. The difference is that the +following will not work in active-active routers with conntrackd: + +``` +#1 Packet (in the original direction) updates state in Router R1 -> + submit state to R2 +#2 Packet (in the reply direction) arrive to Router R2 before state + coming from R1 has been digested. + +With strict stateful filtering, Packet #2 will be dropped and it will +trigger a retransmission. +``` +(quote from Pablo Neira Ayuso, see below for more details) + +Some of you will mumble something like **latency** in their head right +now. If the return packet is guaranteed to arrive after state +synchronisation, then everything is fine, However, if the reply is +faster than the state synchronisation, packets will get dropped. + +In reality, this will work for packets coming and going to the +Internet. However, in our setup the upstream routers are route between +different data center locations, which are in the sub micro second +latency area - i.e. lan speed, because they are interconnected with +dark fiber links. + + +## Take away + +Before moving on to the next blog article, we would like to express +our thanks to Pablo Neira Ayuso, who gave very important input for +session based firewalls and session synchronisation. + +So active-active routing seems not to have a straight forward +solution. Read in the [next blog article](/) on how we solved the +challenge in the end. diff --git a/content/u/blog/datacenterlight-ipv6-only-netboot/contents.lr b/content/u/blog/datacenterlight-ipv6-only-netboot/contents.lr new file mode 100644 index 0000000..0d42d34 --- /dev/null +++ b/content/u/blog/datacenterlight-ipv6-only-netboot/contents.lr @@ -0,0 +1,219 @@ +title: IPv6 only netboot in Data Center Light +--- +pub_date: 2021-05-01 +--- +author: Nico Schottelius +--- +twitter_handle: NicoSchottelius +--- +_hidden: no +--- +_discoverable: no +--- +abstract: +How we switched from IPv4 netboot to IPv6 netboot +--- +body: + +In our [previous blog +article](/u/blog/datacenterlight-spring-network-cleanup) + we wrote about our motivation for the +big spring network cleanup. In this blog article we show how we +started reducing the complexity by removing our dependency on IPv4. + +## IPv6 first + +When you found our blog, you are probably aware: everything at +ungleich is IPv6 first. Many of our networks are IPv6 only, all DNS +entries for remote access have IPv6 (AAAA) entries and there are only +rare exceptions when we utilise IPv4. + +## IPv4 only Netboot + +One of the big exceptions to this paradigm used to be how we boot our +servers. Because our second big paradigm is sustainability, we use a +lot of 2nd (or 3rd) generation hardware. We actually share this +passion with our friends from +[e-durable](https://recycled.cloud/), because sustainability is +something that we need to employ today and not tomorrow. +But back to the netbooting topic: For netbooting we mainly +relied on onboard network cards so far. + +## Onboard network cards + +We used these network cards for multiple reasons: + +* they exist virtually in any server +* they usually have a ROM containing a PXE capable firmware +* it allows us to split real traffic to fiber cards and internal traffic + +However using the onboard devices comes also with a couple of disadvantages: + +* Their ROM is often outdated +* It requires additional cabling + +## Cables + +Let's have a look at the cabling situation first. Virtually all of +our servers are connected to the network using 2x 10 Gbit/s fiber cards. + +On one side this provides a fast connection, but on the other side +it provides us with something even better: distances. + +Our data centers employ a non-standard design due to the re-use of +existing factory halls. This means distances between servers and +switches can be up to 100m. With fiber, we can easily achieve these +distances. + +Additionally, have less cables provides a simpler infrastructure +that is easier to analyse. + +## Reducing complexity 1 + +So can we somehow get rid of the copper cables and switch to fiber +only? It turns out that the fiber cards we use (mainly Intel X520's) +have their own ROM. So we started disabling the onboard network cards +and tried booting from the fiber cards. This worked until we wanted to +move the lab setup to production... + +## Bonding (LACP) and VLAN tagging + +Our servers use bonding (802.3ad) for redundant connections to the +switches and VLAN tagging on top of the bonded devices to isolate +client traffic. On the switch side we realised this using +configurations like + +``` +interface Port-Channel33 + switchport mode trunk + mlag 33 + +... +interface Ethernet33 + channel-group 33 mode active +``` + +But that does not work, if the network ROM at boot does not create an +LACP enabled link on top of which it should be doing VLAN tagging. + +The ROM in our network cards **would** have allowed VLAN tagging alone +though. + +To fix this problem, we reconfigured our switches as follows: + +``` +interface Port-Channel33 + switchport trunk native vlan 10 + switchport mode trunk + port-channel lacp fallback static + port-channel lacp fallback timeout 20 + mlag 33 +``` + +This basically does two things: + +* If there are no LACP frames, fallback to static (non lacp) + configuration +* Accept untagged traffic and map it to VLAN 10 (one of our boot networks) + +Great, our servers can now netboot from fiber! But we are not done +yet... + +## IPv6 only netbooting + +So how do we convince these network cards to do IPv6 netboot? Can we +actually do that at all? Our first approach was to put a custom build of +[ipxe](https://ipxe.org/) on a USB stick. We generated that +ipxe image using **rebuild-ipxe.sh** script +from the +[ungleich-tools](https://code.ungleich.ch/ungleich-public/ungleich-tools) +repository. Turns out using a USB stick works pretty well for most +situations. + +## ROMs are not ROMs + +As you can imagine, the ROM of the X520 cards does not contain IPv6 +netboot support. So are we back at square 1? No, we are not. Because +the X520's have something that the onboard devices did not +consistently have: **a rewritable memory area**. + +Let's take 2 steps back here first: A ROM is an **read only memory** +chip. Emphasis on **read only**. However, modern network cards and a +lot of devices that support on-device firmware do actually have a +memory (flash) area that can be written to. And that is what aids us +in our situation. + +## ipxe + flbtool + x520 = fun + +Trying to write ipxe into the X520 cards initially failed, because the +network card did not recognise the format of the ipxe rom file. + +Luckily the folks in the ipxe community already spotted that problem +AND fixed it: The format used in these cards is called FLB. And there +is [flbtool](https://github.com/devicenull/flbtool/), which allows you +to wrap the ipxe rom file into the FLB format. For those who want to +try it yourself (at your own risk!), it basically involves: + +* Get the current ROM from the card (try bootutil64e) +* Extract the contents from the rom using flbtool +* This will output some sections/parts +* Locate one part that you want to overwrite with iPXE (a previous PXE + section is very suitable) +* Replace the .bin file with your iPXE rom +* Adjust the .json file to match the length of the new binary +* Build a new .flb file using flbtool +* Flash it onto the card + +While this is a bit of work, it is worth it for us, because...: + +## IPv6 only netboot over fiber + +With the modified ROM, basically loading iPXE at start, we can now +boot our servers in IPv6 only networks. On our infrastructure side, we +added two **tiny** things: + +We use ISC dhcp with the following configuration file: + +``` +option dhcp6.bootfile-url code 59 = string; + +option dhcp6.bootfile-url "http://[2a0a:e5c0:0:6::46]/ipxescript"; + +subnet6 2a0a:e5c0:0:6::/64 {} +``` + +(that is the complete configuration!) + +And we used radvd to announce that there are other information, +indicating clients can actually query the dhcpv6 server: + +``` +interface bond0.10 +{ + AdvSendAdvert on; + MinRtrAdvInterval 3; + MaxRtrAdvInterval 5; + AdvDefaultLifetime 600; + + # IPv6 netbooting + AdvOtherConfigFlag on; + + prefix 2a0a:e5c0:0:6::/64 { }; + + RDNSS 2a0a:e5c0:0:a::a 2a0a:e5c0:0:a::b { AdvRDNSSLifetime 6000; }; + DNSSL place5.ungleich.ch { AdvDNSSLLifetime 6000; } ; +}; +``` + +## Take away + +Being able to reduce cables was one big advantage in the beginning. + +Switching to IPv6 only netboot does not seem like a big simplification +in the first place, besides being able to remove IPv4 in server +networks. + +However as you will see in +[the next blog posts](/u/blog/datacenterlight-active-active-routing/), +switching to IPv6 only netbooting is actually a key element on +reducing complexity in our network. diff --git a/content/u/blog/datacenterlight-spring-network-cleanup/contents.lr b/content/u/blog/datacenterlight-spring-network-cleanup/contents.lr new file mode 100644 index 0000000..073b4b1 --- /dev/null +++ b/content/u/blog/datacenterlight-spring-network-cleanup/contents.lr @@ -0,0 +1,161 @@ +title: Data Center Light: Spring network cleanup +--- +pub_date: 2021-05-01 +--- +author: Nico Schottelius +--- +twitter_handle: NicoSchottelius +--- +_hidden: no +--- +_discoverable: no +--- +abstract: +From today on ungleich offers free, encrypted IPv6 VPNs for hackerspaces +--- +body: + +## Introduction + +Spring is the time for cleanup. Cleanup up your apartment, removing +dust from the cabinet, letting the light shine through the windows, +or like in our case: improving the networking situation. + +In this article we give an introduction of where we started and what +the typical setup used to be in our data center. + +## Best practice + +When we started [Data Center Light](https://datacenterlight.ch) in +2017, we orientated ourselves at "best practice" for networking. We +started with IPv6 only networks and used RFC1918 network (10/8) for +internal IPv4 routing. + +And we started with 2 routers for every network to provide +redundancy. + +## Router redundancy + +So what do you do when you have two routers? In the Linux world the +software [keepalived](https://keepalived.org/) +is very popular to provide redundant routing +using the [VRRP protocol](https://en.wikipedia.org/wiki/Virtual_Router_Redundancy_Protocol). + +## Active-Passive + +While VRRP is designed to allow multiple (not only two) routers to +co-exist in a network, its design is basically active-passive: you +have one active router and n passive routers, in our case 1 +additional. + +## Keepalived: a closer look + +A typical keepalived configuration in our network looked like this: + +``` +vrrp_instance router_v4 { + interface INTERFACE + virtual_router_id 2 + priority PRIORITY + advert_int 1 + virtual_ipaddress { + 10.0.0.1/22 dev eth1.5 # Internal + } + notify_backup "/usr/local/bin/vrrp_notify_backup.sh" + notify_fault "/usr/local/bin/vrrp_notify_fault.sh" + notify_master "/usr/local/bin/vrrp_notify_master.sh" +} + +vrrp_instance router_v6 { + interface INTERFACE + virtual_router_id 1 + priority PRIORITY + advert_int 1 + virtual_ipaddress { + 2a0a:e5c0:1:8::48/128 dev eth1.8 # Transfer for routing from outside + 2a0a:e5c0:0:44::7/64 dev bond0.18 # zhaw + 2a0a:e5c0:2:15::7/64 dev bond0.20 # + } +} +``` + +This is a template that we distribute via [cdist](https:/cdi.st). The +strings INTERFACE and PRIORITY are replaced via cdist. The interface +field defines which interface to use for VRRP communication and the +priority field determines which of the routers is the active one. + +So far, so good. However let's have a look at a tiny detail of this +configuration file: + +``` + notify_backup "/usr/local/bin/vrrp_notify_backup.sh" + notify_fault "/usr/local/bin/vrrp_notify_fault.sh" + notify_master "/usr/local/bin/vrrp_notify_master.sh" +``` + +These three lines basically say: "start something if you are the +master" and "stop something in case you are not". And why did we do +this? Because of stateful services. + +## Stateful services + +A typical shell script that we would call containes lines like this: + +``` +/etc/init.d/radvd stop +/etc/init.d/dhcpd stop +``` +(or start in the case of the master version) + +In earlier days, this even contained openvpn, which was running on our +first generation router version. But more about OpenVPN later. + +The reason why we stopped and started dhcp and radvd is to make +clients of the network use the active router. We used radvd to provide +IPv6 addresses as the primary access method to servers. And we used +dhcp mainly to allow servers to netboot. The active router would +carry state (firewall!) and thus the flow of packets always need to go +through the active router. + +Restarting radvd on a different machine keeps the IPv6 addresses the +same, as clients assign then themselves using EUI-64. In case of dhcp +(IPv4) we would have used hardcoded IPv4 addresses using a mapping of +MAC address to IPv4 address, but we opted out for this. The main +reason is that dhcp clients re-request their same leas and even if an +IPv4 addresses changes, it is not really of importance. + +During a failover this would lead to a few seconds interrupt and +re-establishing sessions. Given that routers are usually rather stable +and restarting them is not a daily task, we initially accepted this. + +## Keepalived/VRRP changes + +One of the more tricky things is changes to keepalived. Because +keepalived uses the *number of addresses and routes* to verify +that the received VRRP packet matches its configuration, adding or +deleting IP addresses and routes, causes a problem: + +While one router was updated, the number of IP addresses or routes is +different. This causes both routers to ignore the others VRRP messages +and both routers think they should be the master process. + +This leads to the problem that both routers receive client and outside +traffic. This causes the firewall (nftables) to not recognise +returning packets, if they were sent out by router1, but received back +by router2 and, because nftables is configured *stateful*, will drop +the returning packet. + +However not only changes to the configuration can trigger this +problem, but also any communication problem between the two +routers. Since 2017 we experienced it multiple times that keepalived +was unable to receive or send messages from the other router and thus +both of them again became the master process. + +## Take away + +While in theory keepalived should improve the reliability, in practice +the number of problems due to double master situations we had, made us +question whether the keepalived concept is the fitting one for us. + +You can read how we evolved from this setup in +[the next blog article](/u/blog/datacenterlight-ipv6-only-netboot/).