362cd32d43
Signed-off-by: Nico Schottelius <nico@bento.schottelius.org>
351 lines
12 KiB
Markdown
351 lines
12 KiB
Markdown
[[!meta title="KVM Virtual Machines managed with cdist and sexy @ local.ch"]]
|
|
|
|
## Introduction
|
|
|
|
This article describes the KVM setup of [local.ch](http://www.local.ch), which is
|
|
managed by [[sexy|software/sexy]] and configured by [[cdist/software/cdist]].
|
|
|
|
If you haven't so far, you may want to have a look at the
|
|
[[Sexy and cdist @ local.ch|sexy-and-cdist-at-local.ch]]
|
|
article before continuing to read this one.
|
|
|
|
## KVM Host configuration
|
|
|
|
The KVM hosts are Dell R815 with CentOS 6.x installed. Why Dell? Because they
|
|
offered a good price/value combination. Why CentOS? Historical
|
|
reasons. The hosts got a minimal set of BIOS tuning to support the VM performance:
|
|
|
|
* Enable the usual virtualisation flags (don't forget to enable the IOMMU!)
|
|
* Change the power profile to **Maximum Perforamnce**
|
|
|
|
Furthermore, as the CentOS kernel is pretty old (2.6.32-279) and
|
|
conservatively configured, the kernel needs the following
|
|
command line option to enable the IOMMU:
|
|
|
|
amd_iommu=on
|
|
|
|
Not enabling this option degrades the performance.
|
|
In our case, enabling it reduced the latency of the
|
|
application running in the VM by a factor of 10.
|
|
|
|
One big design consideration of the the KVM setup at local.ch is to make the
|
|
KVM hosts as independent as possible and sensibly fault tolerant. That said,
|
|
VMs are stored on local storage and hosts are always redundantly connected
|
|
to two switches use [LACP](https://en.wikipedia.org/wiki/Link_aggregation).
|
|
|
|
|
|
## KVM Host Network Configuration
|
|
|
|
[[!img kvm-setup-local.ch-overview.png alt="Overview of KVM setup at local.ch"]]
|
|
|
|
As can be seen in the picture above, every KVM host is connected to two
|
|
**10G Arista switches (7050T-52-R)** using LACP. Besides being capable
|
|
of running 10G, the Arista switches are actually pretty neat for the Unix geek,
|
|
because they are Linux based with a
|
|
[FPGA](https://en.wikipedia.org/wiki/Field-programmable_gate_array)
|
|
attached. Furthermore you can easily
|
|
gain access to a shell by typing **enable** followed by **bash**.
|
|
|
|
The Arista switches are connected together with 2x 10G links, over which LACP+MLAG
|
|
is configured. This gives us the ability to connect every KVM host with LACP to two
|
|
**different** switches: They use MLAG to synchronise their LACP states.
|
|
|
|
On the KVM host, the network is configured as follows:
|
|
|
|
The dual Port 10G card (Intel Corporation 82599EB) is bonded together into bond0.
|
|
|
|
[root@kvm-hw-inx01 network-scripts]# cat /proc/net/bonding/bond0
|
|
Ethernet Channel Bonding Driver: v3.6.0 (September 26, 2009)
|
|
|
|
Bonding Mode: IEEE 802.3ad Dynamic link aggregation
|
|
Transmit Hash Policy: layer2 (0)
|
|
MII Status: up
|
|
MII Polling Interval (ms): 0
|
|
Up Delay (ms): 0
|
|
Down Delay (ms): 0
|
|
|
|
802.3ad info
|
|
LACP rate: slow
|
|
Aggregator selection policy (ad_select): stable
|
|
Active Aggregator Info:
|
|
Aggregator ID: 3
|
|
Number of ports: 2
|
|
Actor Key: 33
|
|
Partner Key: 30
|
|
Partner Mac Address: 02:1c:73:1b:f5:b2
|
|
|
|
Slave Interface: eth4
|
|
MII Status: up
|
|
Speed: 10000 Mbps
|
|
Duplex: full
|
|
Link Failure Count: 0
|
|
Permanent HW addr: 68:05:ca:0b:5b:6a
|
|
Aggregator ID: 3
|
|
Slave queue ID: 0
|
|
|
|
Slave Interface: eth5
|
|
MII Status: up
|
|
Speed: 10000 Mbps
|
|
Duplex: full
|
|
Link Failure Count: 0
|
|
Permanent HW addr: 68:05:ca:0b:5b:6b
|
|
Aggregator ID: 3
|
|
Slave queue ID: 0
|
|
|
|
The following configuration is used to create the bond0 device:
|
|
|
|
[root@kvm-hw-inx01 network-scripts]# cat ifcfg-bond0
|
|
DEVICE=bond0
|
|
BOOTPROTO=none
|
|
BONDING_OPTS="mode=802.3ad"
|
|
ONBOOT=yes
|
|
MTU=9000
|
|
|
|
[root@kvm-hw-inx01 sysconfig]# cat network-scripts/ifcfg-eth4
|
|
DEVICE="eth4"
|
|
NM_CONTROLLED="yes"
|
|
USERCTL=no
|
|
ONBOOT=yes
|
|
MASTER=bond0
|
|
SLAVE=yes
|
|
BOOTPROTO=none
|
|
|
|
[root@kvm-hw-inx01 sysconfig]# cat network-scripts/ifcfg-eth5
|
|
DEVICE="eth5"
|
|
NM_CONTROLLED="yes"
|
|
USERCTL=no
|
|
ONBOOT=yes
|
|
MASTER=bond0
|
|
SLAVE=yes
|
|
BOOTPROTO=none
|
|
|
|
The MTU of the 10G cards has been set to 9000, as the Arista switches support
|
|
[Jumbo Frames](https://en.wikipedia.org/wiki/Jumbo_frame).
|
|
|
|
Every VM is attached to two different networks:
|
|
|
|
* PZ: presentation zone (for general traffic) (10.18x.0.0/22 network)
|
|
* FZ: filer zone (for NFS and database traffic) (10.18x.64.0/22 network)
|
|
|
|
Both networks are seperated using the VLAN tags 2 (pz) and 3 (fz), which result
|
|
in **bond0.2** and **bond0.3**:
|
|
|
|
[root@kvm-hw-inx01 network-scripts]# ip l | grep bond
|
|
6: eth4: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 9000 qdisc mq master bond0 state UP qlen 1000
|
|
7: eth5: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 9000 qdisc mq master bond0 state UP qlen 1000
|
|
8: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP
|
|
139: bond0.2@bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP
|
|
140: bond0.3@bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP
|
|
|
|
To keep things simple, the two vlan tagged (bonded) interfaces are added to a bridge each,
|
|
to which the VMs are attached later on. The configuration looks like this:
|
|
|
|
[root@kvm-hw-inx01 network-scripts]# cat ifcfg-bond0.2
|
|
DEVICE="bond0.2"
|
|
ONBOOT=yes
|
|
VLAN=yes
|
|
BRIDGE=brpz
|
|
|
|
[root@kvm-hw-inx01 network-scripts]# cat ifcfg-brpz
|
|
DEVICE=brpz
|
|
TYPE=Bridge
|
|
ONBOOT=yes
|
|
DELAY=0
|
|
NM_CONTROLLED=no
|
|
MTU=9000
|
|
|
|
This is how a bridge looks like in production (with about 70 lines stripped):
|
|
|
|
[root@kvm-hw-inx01 network-scripts]# brctl show
|
|
bridge name bridge id STP enabled interfaces
|
|
brfz 8000.024db29ca91f no bond0.3
|
|
tap13
|
|
tap73
|
|
[...]
|
|
brpz 8000.02f6742800b2 no bond0.2
|
|
tap0
|
|
tap1
|
|
[...]
|
|
|
|
Summarised, the network configuration of a KVM host looks like this:
|
|
|
|
arista1 arista2
|
|
| |
|
|
[eth4 + eth5] -> bond0
|
|
|
|
|
|
|
|
/ \
|
|
bond0.2 bond0.3
|
|
/ \
|
|
brpz brfz
|
|
\ /
|
|
tap1 tap2
|
|
\ /
|
|
VM
|
|
|
|
|
|
|
|
## VM configuration
|
|
|
|
The VM configuration can be found below **/opt/local.ch/sys/kvm**
|
|
on every KVM host. Every VM is stored below
|
|
**/opt/local.ch/sys/kvm/vm/<vm name>** and contains the following
|
|
files:
|
|
|
|
[root@kvm-hw-inx03 jira-vm-inx01.intra.local.ch]# ls
|
|
monitor pid start start-on-boot system-disk vnc
|
|
|
|
|
|
* monitor: socket to the monitor from KVM
|
|
* pid: the pid of the VM
|
|
* start: the script to start the VM (see below for an example)
|
|
* start-on-boot: if this file exists, the VM will be started on boot
|
|
* system-disk: the qcow2 image of the system disk
|
|
* vnc: socket to the screen of the VM
|
|
|
|
With the exception of monitor, pid and vnc are all files generated by cdist.
|
|
The start script of a VM looks like this:
|
|
|
|
[root@kvm-hw-inx03 jira-vm-inx01.intra.local.ch]# cat start
|
|
#!/bin/sh
|
|
# Generated shell script - do not modify
|
|
#
|
|
|
|
/usr/libexec/qemu-kvm \
|
|
-name jira-vm-inx01.intra.local.ch \
|
|
-enable-kvm \
|
|
-m 8192 \
|
|
-drive file=/opt/local.ch/sys/kvm/vm/jira-vm-inx01.intra.local.ch/system-disk,if=virtio \
|
|
-vnc unix:/opt/local.ch/sys/kvm/vm/jira-vm-inx01.intra.local.ch/vnc \
|
|
-cpu host \
|
|
-boot order=nc \
|
|
-pidfile "/opt/local.ch/sys/kvm/vm/jira-vm-inx01.intra.local.ch/pid" \
|
|
-monitor "unix:/opt/local.ch/sys/kvm/vm/jira-vm-inx01.intra.local.ch/monitor,server,nowait" \
|
|
-net nic,macaddr=00:16:3e:02:00:ab,model=virtio,vlan=200 \
|
|
-net tap,script=/opt/local.ch/sys/kvm/bin/ifup-pz,downscript=/opt/local.ch/sys/kvm/bin/ifdown,vlan=200 \
|
|
-net nic,macaddr=00:16:3e:02:00:ac,model=virtio,vlan=300 \
|
|
-net tap,script=/opt/local.ch/sys/kvm/bin/ifup-fz,downscript=/opt/local.ch/sys/kvm/bin/ifdown,vlan=300 \
|
|
-smp 4
|
|
|
|
Most parameter values depend on output of sexy,
|
|
which uses the cdist type **__localch_kvm_vm**,
|
|
which in turn assembles this start script.
|
|
The above script may be useful for one or more of my readers,
|
|
as it includes a lot of tuning we have done to KVM.
|
|
|
|
|
|
## Automatic startup of VMs
|
|
|
|
The virtual machines are brought up by an init script located at
|
|
***/etc/init.d/kvm-vms***. As every VM contains its own startup script
|
|
and is marked whether it should be started at boot, the init script
|
|
is pretty simple:
|
|
|
|
basedir=/opt/local.ch/sys/kvm/vm
|
|
|
|
broken_lock_file_for_centos=/var/lock/subsys/kvm-vms
|
|
|
|
case "$1" in
|
|
start)
|
|
cd "$basedir"
|
|
|
|
# Specific VM given
|
|
if [ "$2" ]; then
|
|
vm_list=$2
|
|
else
|
|
vm_list=$(ls)
|
|
fi
|
|
|
|
for vm in $vm_list; do
|
|
vm_base_dir="$basedir/$vm"
|
|
start_script="$vm_base_dir/start"
|
|
|
|
# Skip start of machines which should not start
|
|
if [ ! -f "$vm/start-on-boot" ]; then
|
|
continue
|
|
fi
|
|
|
|
echo "Starting VM $vm ..."
|
|
logger -t kvm-vms "Starting VM $vm ..."
|
|
screen -d -m -S "$vm" "$start_script"
|
|
done
|
|
|
|
touch "$broken_lock_file_for_centos"
|
|
;;
|
|
|
|
As you can see, every VM is started in its own
|
|
[screen](http://www.gnu.org/software/screen/) - so if screen decides to
|
|
hang up, only one VM is affected.
|
|
Furthermore screen supports only a limited number of windows it can server.
|
|
The process listing for a running virtual machine looks like this:
|
|
|
|
root 64611 0.0 0.0 118840 852 ? Ss Mar11 0:00 SCREEN -d -m -S binarypool-vm-inx02.intra.local.ch /opt/local.ch/sys/kvm/vm/binarypool-vm-inx02.intra.local.ch/start
|
|
root 64613 0.0 0.0 106092 1180 pts/22 Ss+ Mar11 0:00 /bin/sh /opt/local.ch/sys/kvm/vm/binarypool-vm-inx02.intra.local.ch/start
|
|
root 64614 2.9 2.2 9106828 5819748 pts/22 Sl+ Mar11 5221:41 /usr/libexec/qemu-kvm -name binarypool-vm-inx02.intra.local.ch -enable-kvm -m 8192 -drive file=/opt/local.ch/sys/kvm/vm/binarypool-vm-inx02.intra.local.ch/system-disk,if=virtio -vnc unix:/opt/local.ch/sys/kvm/vm/binarypool-vm-inx02.intra.local.ch/vnc -cpu host -boot order=nc -pidfile /opt/local.ch/sys/kvm/vm/binarypool-vm-inx02.intra.local.ch/pid -monitor unix:/opt/local.ch/sys/kvm/vm/binarypool-vm-inx02.intra.local.ch/monitor,server,nowait -net nic,macaddr=00:16:3e:02:00:7f,model=virtio,vlan=200 -net tap,script=/opt/local.ch/sys/kvm/bin/ifup-pz,downscript=/opt/local.ch/sys/kvm/bin/ifdown,vlan=200 -net nic,macaddr=00:16:3e:02:00:80,model=virtio,vlan=300 -net tap,script=/opt/local.ch/sys/kvm/bin/ifup-fz,downscript=/opt/local.ch/sys/kvm/bin/ifdown,vlan=300 -smp 4
|
|
|
|
## Common Tasks
|
|
|
|
The following sections show you how to do regular maintenance
|
|
tasks on the KVM infrastructure.
|
|
|
|
### Create a VM
|
|
|
|
VMs can easily be created using the script **vm/create-vm** from the sysadmin-logs repository
|
|
(local.ch internally), which looks like this:
|
|
|
|
sexy host add --type vm $fqdn
|
|
sexy host vm-host-set --vm-host $vmhost $fqdn
|
|
sexy host disk-add --size $disksize $fqdn
|
|
sexy host memory-set --memory $memory $fqdn
|
|
sexy host cores-set --cores $cores $fqdn
|
|
|
|
mac_pz=$(sexy mac generate)
|
|
mac_fz=$(sexy mac generate)
|
|
sexy host nic-add $fqdn -m $mac_pz -n pz
|
|
sexy host nic-add $fqdn -m $mac_fz -n fz
|
|
|
|
sexy net-ipv4 host-add "$net_pz" -m "$mac_pz" -f "$fqdn"
|
|
sexy net-ipv4 host-add "$net_fz" -m "$mac_fz" -f "$fz_fqdn"
|
|
|
|
echo "Updating git / github ..."
|
|
cd ~/.sexy
|
|
git add db
|
|
git commit -m "Added host $fqdn"
|
|
git pull
|
|
git push
|
|
|
|
# Apply changes: first network, so dhcp & dns are ok, then create VM
|
|
cat << eof
|
|
Todo for apply:
|
|
sexy net-ipv4 apply --all
|
|
sexy host apply --all
|
|
|
|
Start VM on $vmhost: ssh $vmhost /opt/local.ch/sys/kvm/vm/$fqdn/start
|
|
eof
|
|
|
|
|
|
### Delete a VM
|
|
|
|
Run the script **remove-host**, which essentially does the following:
|
|
|
|
* Remove various monitoring / backup configurations
|
|
* Detect if it is a VM, if so
|
|
* Stop it
|
|
* Remove it from the host
|
|
* Add mac address to the list of free mac addresses
|
|
* Delete host from the networks
|
|
* Delete host from sexy database
|
|
|
|
|
|
### Move VM to another server
|
|
|
|
To move one VM to another host, the following steps are necessary:
|
|
|
|
* sexy host vm-host-set ... # to new host
|
|
* stop vm
|
|
* scp/rsync directory from old host to new host
|
|
* sexy host apply --all # record db change
|
|
* start vm on new host
|
|
|
|
|
|
[[!tag cdist localch net sexy unix]]
|