++ three new blog articles
This commit is contained in:
parent
5d05a28e7d
commit
251677cf57
3 changed files with 541 additions and 0 deletions
161
content/u/blog/datacenterlight-active-active-routing/contents.lr
Normal file
161
content/u/blog/datacenterlight-active-active-routing/contents.lr
Normal file
|
@ -0,0 +1,161 @@
|
|||
title: Active-Active Routing Paths in Data Center Light
|
||||
---
|
||||
pub_date: 2019-11-08
|
||||
---
|
||||
author: Nico Schottelius
|
||||
---
|
||||
twitter_handle: NicoSchottelius
|
||||
---
|
||||
_hidden: no
|
||||
---
|
||||
_discoverable: no
|
||||
---
|
||||
abstract:
|
||||
|
||||
---
|
||||
body:
|
||||
|
||||
From our last two blog articles (a, b) you probably already know that
|
||||
it is spring network cleanup in [Data Center Light](https://datacenterlight.ch).
|
||||
|
||||
In [first blog article]() we described where we started and in
|
||||
the [second blog article]() you could see how we switched our
|
||||
infrastructure to IPv6 only netboot.
|
||||
|
||||
In this article we will dive a bit more into the details of our
|
||||
network architecture and which problems we face with active-active
|
||||
routers.
|
||||
|
||||
## Network architecture
|
||||
|
||||
Let's have a look at a simplified (!) diagram of the network:
|
||||
|
||||
... IMAGE
|
||||
|
||||
Doesn't look that simple, does it? Let's break it down into small
|
||||
pieces.
|
||||
|
||||
## Upstream routers
|
||||
|
||||
We have a set of **upstream routers** which work stateless. They don't
|
||||
have any stateful firewall rules, so both of them can work actively
|
||||
without state synchronisation. Moreover, both of them peer with the
|
||||
data center upstreams. These are fast routers and besides forwarding,
|
||||
they also do **BGP peering** with our upstreams.
|
||||
|
||||
Over all the upstream routers are very simple machines, mostly running
|
||||
bird and forwarding packets all day. They also provide a DNS service
|
||||
(resolving and authoritative), because they are always up and can
|
||||
announce service IPs via BGP or via OSPF to our network.
|
||||
|
||||
## Internal routers
|
||||
|
||||
The internal routers on the other hand provide **stateful routing**,
|
||||
**IP address assignments** and **netboot services**. They are a bit
|
||||
more complicated compared to the upstream routers, but they care only
|
||||
a small routing table.
|
||||
|
||||
## Communication between the routers
|
||||
|
||||
All routers employ OSPF and BGP for route exchange. Thus the two
|
||||
upstream routers learn about the internal networks (IPv6 only, as
|
||||
usual) from the internal routers.
|
||||
|
||||
## Sessions
|
||||
|
||||
Sessions in networking are almost always an evil. You need to store
|
||||
them (at high speed), you need to maintain them (updating, deleting)
|
||||
and if you run multiple routers, you even need to sychronise them.
|
||||
|
||||
In our case the internal routers do have session handling, as they are
|
||||
providing a stateful firewall. As we are using a multi router setup,
|
||||
things can go really wrong if the wrong routes are being used.
|
||||
|
||||
Let's have a look at this a bit more in detail.
|
||||
|
||||
## The good path
|
||||
|
||||
IMAGE2: good
|
||||
|
||||
If a server sends out a packet via router1 and router1 eventually
|
||||
receives the answer, everything is fine. The returning packet matches
|
||||
the state entry that was created by the outgoing packet and the
|
||||
internal router forwards the packet.
|
||||
|
||||
## The bad path
|
||||
|
||||
IMAGE3: bad
|
||||
|
||||
However if the
|
||||
|
||||
## Routing paths
|
||||
|
||||
If we want to go active-active routing, the server can choose between
|
||||
either internal router for sending out the packet. The internal
|
||||
routers again have two upstream routers. So with the return path
|
||||
included, the following paths exist for a packet:
|
||||
|
||||
Outgoing paths:
|
||||
|
||||
* servers->router1->upstream router1->internet
|
||||
* servers->router1->upstream router2->internet
|
||||
* servers->router2->upstream router1->internet
|
||||
* servers->router2->upstream router2->internet
|
||||
|
||||
And the returning paths are:
|
||||
|
||||
* internet->upstream router1->router 1->servers
|
||||
* internet->upstream router1->router 2->servers
|
||||
* internet->upstream router2->router 1->servers
|
||||
* internet->upstream router2->router 2->servers
|
||||
|
||||
So on average, 50% of the routes will hit the right router on
|
||||
return. However servers as well as upstream routers are not using load
|
||||
balancing like ECMP, so once an incorrect path has been chosen, the
|
||||
packet loss is 100%.
|
||||
|
||||
## Session synchronisation
|
||||
|
||||
In the first article we talked a bit about keepalived and that
|
||||
it helps to operate routers in an active-passive mode. This did not
|
||||
turn out to be the most reliable method. Can we do better with
|
||||
active-active routers and session synchronisation?
|
||||
|
||||
Linux supports this using
|
||||
[conntrackd](http://conntrack-tools.netfilter.org/). However,
|
||||
conntrackd supports active-active routers on a **flow based** level,
|
||||
but not on a **packet** based level. The difference is that the
|
||||
following will not work in active-active routers with conntrackd:
|
||||
|
||||
```
|
||||
#1 Packet (in the original direction) updates state in Router R1 ->
|
||||
submit state to R2
|
||||
#2 Packet (in the reply direction) arrive to Router R2 before state
|
||||
coming from R1 has been digested.
|
||||
|
||||
With strict stateful filtering, Packet #2 will be dropped and it will
|
||||
trigger a retransmission.
|
||||
```
|
||||
(quote from Pablo Neira Ayuso, see below for more details)
|
||||
|
||||
Some of you will mumble something like **latency** in their head right
|
||||
now. If the return packet is guaranteed to arrive after state
|
||||
synchronisation, then everything is fine, However, if the reply is
|
||||
faster than the state synchronisation, packets will get dropped.
|
||||
|
||||
In reality, this will work for packets coming and going to the
|
||||
Internet. However, in our setup the upstream routers are route between
|
||||
different data center locations, which are in the sub micro second
|
||||
latency area - i.e. lan speed, because they are interconnected with
|
||||
dark fiber links.
|
||||
|
||||
|
||||
## Take away
|
||||
|
||||
Before moving on to the next blog article, we would like to express
|
||||
our thanks to Pablo Neira Ayuso, who gave very important input for
|
||||
session based firewalls and session synchronisation.
|
||||
|
||||
So active-active routing seems not to have a straight forward
|
||||
solution. Read in the [next blog article](/) on how we solved the
|
||||
challenge in the end.
|
219
content/u/blog/datacenterlight-ipv6-only-netboot/contents.lr
Normal file
219
content/u/blog/datacenterlight-ipv6-only-netboot/contents.lr
Normal file
|
@ -0,0 +1,219 @@
|
|||
title: IPv6 only netboot in Data Center Light
|
||||
---
|
||||
pub_date: 2021-05-01
|
||||
---
|
||||
author: Nico Schottelius
|
||||
---
|
||||
twitter_handle: NicoSchottelius
|
||||
---
|
||||
_hidden: no
|
||||
---
|
||||
_discoverable: no
|
||||
---
|
||||
abstract:
|
||||
How we switched from IPv4 netboot to IPv6 netboot
|
||||
---
|
||||
body:
|
||||
|
||||
In our [previous blog
|
||||
article](/u/blog/datacenterlight-spring-network-cleanup)
|
||||
we wrote about our motivation for the
|
||||
big spring network cleanup. In this blog article we show how we
|
||||
started reducing the complexity by removing our dependency on IPv4.
|
||||
|
||||
## IPv6 first
|
||||
|
||||
When you found our blog, you are probably aware: everything at
|
||||
ungleich is IPv6 first. Many of our networks are IPv6 only, all DNS
|
||||
entries for remote access have IPv6 (AAAA) entries and there are only
|
||||
rare exceptions when we utilise IPv4.
|
||||
|
||||
## IPv4 only Netboot
|
||||
|
||||
One of the big exceptions to this paradigm used to be how we boot our
|
||||
servers. Because our second big paradigm is sustainability, we use a
|
||||
lot of 2nd (or 3rd) generation hardware. We actually share this
|
||||
passion with our friends from
|
||||
[e-durable](https://recycled.cloud/), because sustainability is
|
||||
something that we need to employ today and not tomorrow.
|
||||
But back to the netbooting topic: For netbooting we mainly
|
||||
relied on onboard network cards so far.
|
||||
|
||||
## Onboard network cards
|
||||
|
||||
We used these network cards for multiple reasons:
|
||||
|
||||
* they exist virtually in any server
|
||||
* they usually have a ROM containing a PXE capable firmware
|
||||
* it allows us to split real traffic to fiber cards and internal traffic
|
||||
|
||||
However using the onboard devices comes also with a couple of disadvantages:
|
||||
|
||||
* Their ROM is often outdated
|
||||
* It requires additional cabling
|
||||
|
||||
## Cables
|
||||
|
||||
Let's have a look at the cabling situation first. Virtually all of
|
||||
our servers are connected to the network using 2x 10 Gbit/s fiber cards.
|
||||
|
||||
On one side this provides a fast connection, but on the other side
|
||||
it provides us with something even better: distances.
|
||||
|
||||
Our data centers employ a non-standard design due to the re-use of
|
||||
existing factory halls. This means distances between servers and
|
||||
switches can be up to 100m. With fiber, we can easily achieve these
|
||||
distances.
|
||||
|
||||
Additionally, have less cables provides a simpler infrastructure
|
||||
that is easier to analyse.
|
||||
|
||||
## Reducing complexity 1
|
||||
|
||||
So can we somehow get rid of the copper cables and switch to fiber
|
||||
only? It turns out that the fiber cards we use (mainly Intel X520's)
|
||||
have their own ROM. So we started disabling the onboard network cards
|
||||
and tried booting from the fiber cards. This worked until we wanted to
|
||||
move the lab setup to production...
|
||||
|
||||
## Bonding (LACP) and VLAN tagging
|
||||
|
||||
Our servers use bonding (802.3ad) for redundant connections to the
|
||||
switches and VLAN tagging on top of the bonded devices to isolate
|
||||
client traffic. On the switch side we realised this using
|
||||
configurations like
|
||||
|
||||
```
|
||||
interface Port-Channel33
|
||||
switchport mode trunk
|
||||
mlag 33
|
||||
|
||||
...
|
||||
interface Ethernet33
|
||||
channel-group 33 mode active
|
||||
```
|
||||
|
||||
But that does not work, if the network ROM at boot does not create an
|
||||
LACP enabled link on top of which it should be doing VLAN tagging.
|
||||
|
||||
The ROM in our network cards **would** have allowed VLAN tagging alone
|
||||
though.
|
||||
|
||||
To fix this problem, we reconfigured our switches as follows:
|
||||
|
||||
```
|
||||
interface Port-Channel33
|
||||
switchport trunk native vlan 10
|
||||
switchport mode trunk
|
||||
port-channel lacp fallback static
|
||||
port-channel lacp fallback timeout 20
|
||||
mlag 33
|
||||
```
|
||||
|
||||
This basically does two things:
|
||||
|
||||
* If there are no LACP frames, fallback to static (non lacp)
|
||||
configuration
|
||||
* Accept untagged traffic and map it to VLAN 10 (one of our boot networks)
|
||||
|
||||
Great, our servers can now netboot from fiber! But we are not done
|
||||
yet...
|
||||
|
||||
## IPv6 only netbooting
|
||||
|
||||
So how do we convince these network cards to do IPv6 netboot? Can we
|
||||
actually do that at all? Our first approach was to put a custom build of
|
||||
[ipxe](https://ipxe.org/) on a USB stick. We generated that
|
||||
ipxe image using **rebuild-ipxe.sh** script
|
||||
from the
|
||||
[ungleich-tools](https://code.ungleich.ch/ungleich-public/ungleich-tools)
|
||||
repository. Turns out using a USB stick works pretty well for most
|
||||
situations.
|
||||
|
||||
## ROMs are not ROMs
|
||||
|
||||
As you can imagine, the ROM of the X520 cards does not contain IPv6
|
||||
netboot support. So are we back at square 1? No, we are not. Because
|
||||
the X520's have something that the onboard devices did not
|
||||
consistently have: **a rewritable memory area**.
|
||||
|
||||
Let's take 2 steps back here first: A ROM is an **read only memory**
|
||||
chip. Emphasis on **read only**. However, modern network cards and a
|
||||
lot of devices that support on-device firmware do actually have a
|
||||
memory (flash) area that can be written to. And that is what aids us
|
||||
in our situation.
|
||||
|
||||
## ipxe + flbtool + x520 = fun
|
||||
|
||||
Trying to write ipxe into the X520 cards initially failed, because the
|
||||
network card did not recognise the format of the ipxe rom file.
|
||||
|
||||
Luckily the folks in the ipxe community already spotted that problem
|
||||
AND fixed it: The format used in these cards is called FLB. And there
|
||||
is [flbtool](https://github.com/devicenull/flbtool/), which allows you
|
||||
to wrap the ipxe rom file into the FLB format. For those who want to
|
||||
try it yourself (at your own risk!), it basically involves:
|
||||
|
||||
* Get the current ROM from the card (try bootutil64e)
|
||||
* Extract the contents from the rom using flbtool
|
||||
* This will output some sections/parts
|
||||
* Locate one part that you want to overwrite with iPXE (a previous PXE
|
||||
section is very suitable)
|
||||
* Replace the .bin file with your iPXE rom
|
||||
* Adjust the .json file to match the length of the new binary
|
||||
* Build a new .flb file using flbtool
|
||||
* Flash it onto the card
|
||||
|
||||
While this is a bit of work, it is worth it for us, because...:
|
||||
|
||||
## IPv6 only netboot over fiber
|
||||
|
||||
With the modified ROM, basically loading iPXE at start, we can now
|
||||
boot our servers in IPv6 only networks. On our infrastructure side, we
|
||||
added two **tiny** things:
|
||||
|
||||
We use ISC dhcp with the following configuration file:
|
||||
|
||||
```
|
||||
option dhcp6.bootfile-url code 59 = string;
|
||||
|
||||
option dhcp6.bootfile-url "http://[2a0a:e5c0:0:6::46]/ipxescript";
|
||||
|
||||
subnet6 2a0a:e5c0:0:6::/64 {}
|
||||
```
|
||||
|
||||
(that is the complete configuration!)
|
||||
|
||||
And we used radvd to announce that there are other information,
|
||||
indicating clients can actually query the dhcpv6 server:
|
||||
|
||||
```
|
||||
interface bond0.10
|
||||
{
|
||||
AdvSendAdvert on;
|
||||
MinRtrAdvInterval 3;
|
||||
MaxRtrAdvInterval 5;
|
||||
AdvDefaultLifetime 600;
|
||||
|
||||
# IPv6 netbooting
|
||||
AdvOtherConfigFlag on;
|
||||
|
||||
prefix 2a0a:e5c0:0:6::/64 { };
|
||||
|
||||
RDNSS 2a0a:e5c0:0:a::a 2a0a:e5c0:0:a::b { AdvRDNSSLifetime 6000; };
|
||||
DNSSL place5.ungleich.ch { AdvDNSSLLifetime 6000; } ;
|
||||
};
|
||||
```
|
||||
|
||||
## Take away
|
||||
|
||||
Being able to reduce cables was one big advantage in the beginning.
|
||||
|
||||
Switching to IPv6 only netboot does not seem like a big simplification
|
||||
in the first place, besides being able to remove IPv4 in server
|
||||
networks.
|
||||
|
||||
However as you will see in
|
||||
[the next blog posts](/u/blog/datacenterlight-active-active-routing/),
|
||||
switching to IPv6 only netbooting is actually a key element on
|
||||
reducing complexity in our network.
|
|
@ -0,0 +1,161 @@
|
|||
title: Data Center Light: Spring network cleanup
|
||||
---
|
||||
pub_date: 2021-05-01
|
||||
---
|
||||
author: Nico Schottelius
|
||||
---
|
||||
twitter_handle: NicoSchottelius
|
||||
---
|
||||
_hidden: no
|
||||
---
|
||||
_discoverable: no
|
||||
---
|
||||
abstract:
|
||||
From today on ungleich offers free, encrypted IPv6 VPNs for hackerspaces
|
||||
---
|
||||
body:
|
||||
|
||||
## Introduction
|
||||
|
||||
Spring is the time for cleanup. Cleanup up your apartment, removing
|
||||
dust from the cabinet, letting the light shine through the windows,
|
||||
or like in our case: improving the networking situation.
|
||||
|
||||
In this article we give an introduction of where we started and what
|
||||
the typical setup used to be in our data center.
|
||||
|
||||
## Best practice
|
||||
|
||||
When we started [Data Center Light](https://datacenterlight.ch) in
|
||||
2017, we orientated ourselves at "best practice" for networking. We
|
||||
started with IPv6 only networks and used RFC1918 network (10/8) for
|
||||
internal IPv4 routing.
|
||||
|
||||
And we started with 2 routers for every network to provide
|
||||
redundancy.
|
||||
|
||||
## Router redundancy
|
||||
|
||||
So what do you do when you have two routers? In the Linux world the
|
||||
software [keepalived](https://keepalived.org/)
|
||||
is very popular to provide redundant routing
|
||||
using the [VRRP protocol](https://en.wikipedia.org/wiki/Virtual_Router_Redundancy_Protocol).
|
||||
|
||||
## Active-Passive
|
||||
|
||||
While VRRP is designed to allow multiple (not only two) routers to
|
||||
co-exist in a network, its design is basically active-passive: you
|
||||
have one active router and n passive routers, in our case 1
|
||||
additional.
|
||||
|
||||
## Keepalived: a closer look
|
||||
|
||||
A typical keepalived configuration in our network looked like this:
|
||||
|
||||
```
|
||||
vrrp_instance router_v4 {
|
||||
interface INTERFACE
|
||||
virtual_router_id 2
|
||||
priority PRIORITY
|
||||
advert_int 1
|
||||
virtual_ipaddress {
|
||||
10.0.0.1/22 dev eth1.5 # Internal
|
||||
}
|
||||
notify_backup "/usr/local/bin/vrrp_notify_backup.sh"
|
||||
notify_fault "/usr/local/bin/vrrp_notify_fault.sh"
|
||||
notify_master "/usr/local/bin/vrrp_notify_master.sh"
|
||||
}
|
||||
|
||||
vrrp_instance router_v6 {
|
||||
interface INTERFACE
|
||||
virtual_router_id 1
|
||||
priority PRIORITY
|
||||
advert_int 1
|
||||
virtual_ipaddress {
|
||||
2a0a:e5c0:1:8::48/128 dev eth1.8 # Transfer for routing from outside
|
||||
2a0a:e5c0:0:44::7/64 dev bond0.18 # zhaw
|
||||
2a0a:e5c0:2:15::7/64 dev bond0.20 #
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
This is a template that we distribute via [cdist](https:/cdi.st). The
|
||||
strings INTERFACE and PRIORITY are replaced via cdist. The interface
|
||||
field defines which interface to use for VRRP communication and the
|
||||
priority field determines which of the routers is the active one.
|
||||
|
||||
So far, so good. However let's have a look at a tiny detail of this
|
||||
configuration file:
|
||||
|
||||
```
|
||||
notify_backup "/usr/local/bin/vrrp_notify_backup.sh"
|
||||
notify_fault "/usr/local/bin/vrrp_notify_fault.sh"
|
||||
notify_master "/usr/local/bin/vrrp_notify_master.sh"
|
||||
```
|
||||
|
||||
These three lines basically say: "start something if you are the
|
||||
master" and "stop something in case you are not". And why did we do
|
||||
this? Because of stateful services.
|
||||
|
||||
## Stateful services
|
||||
|
||||
A typical shell script that we would call containes lines like this:
|
||||
|
||||
```
|
||||
/etc/init.d/radvd stop
|
||||
/etc/init.d/dhcpd stop
|
||||
```
|
||||
(or start in the case of the master version)
|
||||
|
||||
In earlier days, this even contained openvpn, which was running on our
|
||||
first generation router version. But more about OpenVPN later.
|
||||
|
||||
The reason why we stopped and started dhcp and radvd is to make
|
||||
clients of the network use the active router. We used radvd to provide
|
||||
IPv6 addresses as the primary access method to servers. And we used
|
||||
dhcp mainly to allow servers to netboot. The active router would
|
||||
carry state (firewall!) and thus the flow of packets always need to go
|
||||
through the active router.
|
||||
|
||||
Restarting radvd on a different machine keeps the IPv6 addresses the
|
||||
same, as clients assign then themselves using EUI-64. In case of dhcp
|
||||
(IPv4) we would have used hardcoded IPv4 addresses using a mapping of
|
||||
MAC address to IPv4 address, but we opted out for this. The main
|
||||
reason is that dhcp clients re-request their same leas and even if an
|
||||
IPv4 addresses changes, it is not really of importance.
|
||||
|
||||
During a failover this would lead to a few seconds interrupt and
|
||||
re-establishing sessions. Given that routers are usually rather stable
|
||||
and restarting them is not a daily task, we initially accepted this.
|
||||
|
||||
## Keepalived/VRRP changes
|
||||
|
||||
One of the more tricky things is changes to keepalived. Because
|
||||
keepalived uses the *number of addresses and routes* to verify
|
||||
that the received VRRP packet matches its configuration, adding or
|
||||
deleting IP addresses and routes, causes a problem:
|
||||
|
||||
While one router was updated, the number of IP addresses or routes is
|
||||
different. This causes both routers to ignore the others VRRP messages
|
||||
and both routers think they should be the master process.
|
||||
|
||||
This leads to the problem that both routers receive client and outside
|
||||
traffic. This causes the firewall (nftables) to not recognise
|
||||
returning packets, if they were sent out by router1, but received back
|
||||
by router2 and, because nftables is configured *stateful*, will drop
|
||||
the returning packet.
|
||||
|
||||
However not only changes to the configuration can trigger this
|
||||
problem, but also any communication problem between the two
|
||||
routers. Since 2017 we experienced it multiple times that keepalived
|
||||
was unable to receive or send messages from the other router and thus
|
||||
both of them again became the master process.
|
||||
|
||||
## Take away
|
||||
|
||||
While in theory keepalived should improve the reliability, in practice
|
||||
the number of problems due to double master situations we had, made us
|
||||
question whether the keepalived concept is the fitting one for us.
|
||||
|
||||
You can read how we evolved from this setup in
|
||||
[the next blog article](/u/blog/datacenterlight-ipv6-only-netboot/).
|
Loading…
Reference in a new issue