LPI 304: Virtualization and High Availability

In this document you find information for the different objectives from the LPIC 304 exam. Before using this document you should check on the LPI site if the objectives are still the same. This document is provided as an aid in studying and is in noway a guaranty for passing the exam. Try to gain some practical knowledge and really understand the concepts how things work, that should help.

Topic 330: Virtualization

330.1 Virtualization Concepts and Theory (weight: 10)

Candidates should know and understand the general concepts, theory and terminology of Virtualization. This includes Xen and KVM terminology. Key Knowledge Areas:

  • Terminology
  • Pros and Cons of Virtualization
  • Variations of Virtual Machine Monitors


The following is a partial list of the used files, terms and utilities:

  • Hypervisor
  • HVM(HardwareVirtualMachine)
  • PV(Paravirtualization)
  • domains
  • emulation and simulation
  • CPU flags

Terminology

Hypervisor
In computing, a hypervisor, also called virtual machine manager (VMM), is one of many hardware virtualization techniques allowing multiple operating systems, termed guests, to run concurrently on a host computer. It is so named because it is conceptually one level higher than a supervisory program. The hypervisor presents to the guest operating systems a virtual operating platform and manages the execution of the guest operating systems. Multiple instances of a variety of operating systems may share the virtualized hardware resources. Hypervisors are very commonly installed on server hardware, with the function of running guest operating systems, that themselves act as servers.

HVM(HardwareVirtualMachine)
In computing, hardware-assisted virtualization is a platform virtualization approach that enables efficient full virtualization using help from hardware capabilities, primarily from the host processors. Full virtualization is used to simulate a complete hardware environment, or virtual machine, in which an unmodified guest operating system (using the same instruction set as the host machine) executes in complete isolation. Hardware-assisted virtualization was added to x86 processors (Intel VT-x or AMD-V) in 2006.

Hardware-assisted virtualization is also known as accelerated virtualization; Xen calls it hardware virtual machine (HVM), Virtual Iron calls it native virtualization.

PV(Paravirtualization)
In computing, paravirtualization is a virtualization technique that presents a software interface to virtual machines that is similar but not identical to that of the underlying hardware.

The intent of the modified interface is to reduce the portion of the guest's execution time spent performing operations which are substantially more difficult to run in a virtual environment compared to a non-virtualized environment. The paravirtualization provides specially defined 'hooks' to allow the guest(s) and host to request and acknowledge these tasks, which would otherwise be executed in the virtual domain (where execution performance is worse). A successful paravirtualized platform may allow the virtual machine monitor (VMM) to be simpler (by relocating execution of critical tasks from the virtual domain to the host domain), and/or reduce the overall performance degradation of machine-execution inside the virtual-guest.

Paravirtualization requires the guest operating system to be explicitly ported for the para-API — a conventional OS distribution which is not paravirtualization-aware cannot be run on top of a paravirtualizing VMM. However, even in cases where the operating system cannot be modified, still components may be available that enable many of the significant performance advantages of paravirtualization; for example, the XenWindowsGplPv project provides a kit of paravirtualization-aware device drivers, licensed under the terms of the GPL, that are intended to be installed into a Microsoft Windows virtual-guest running on the Xen hypervisor.

domains
Domain or virtual machine. A virtual machine (VM) is a software implementation of a machine (i.e. a computer) that executes programs like a physical machine. Virtual machines are separated into two major categories, based on their use and degree of correspondence to any real machine. A system virtual machine provides a complete system platform which supports the execution of a complete operating system (OS). In contrast, a process virtual machine is designed to run a single program, which means that it supports a single process. An essential characteristic of a virtual machine is that the software running inside is limited to the resources and abstractions provided by the virtual machine—it cannot break out of its virtual environment.

A virtual machine was originally defined by Popek and Goldberg as “an efficient, isolated duplicate of a real machine”. Current use includes virtual machines which have no direct correspondence to any real hardware.

emulation and simulation
In integrated circuit design, hardware emulation is the process of imitating the behavior of one or more pieces of hardware (typically a system under design) with another piece of hardware, typically a special purpose emulation system. The emulation model is usually based on RTL (e.g. Verilog) source code, which is compiled into the format used by emulation system. The goal is normally debugging and functional verification of the system being designed. Often an emulator is fast enough to be plugged into a working target system in place of a yet-to-be-built chip, so the whole system can be debugged with live data. This is a specific case of in-circuit emulation.

Sometimes hardware emulation can be confused with hardware devices such as expansion cards with hardware processors that assist functions of software emulation, such as older daughterboards with x86 chips to allow x86 OSes to run on motherboards of different processor families.

A computer simulation, a computer model, or a computational model is a computer program, or network of computers, that attempts to simulate an abstract model of a particular system. Computer simulations have become a useful part of mathematical modeling of many natural systems in physics (computational physics), astrophysics, chemistry and biology, human systems in economics, psychology, social science, and engineering. Simulation of a system is represented as the running of the system's model. It can be used to explore and gain new insights into new technology, and to estimate the performance of systems too complex for analytical solutions.

CPU flags
Indicating the features supported by a cpu. Relevant flags for virtualization:

  • HVM Hardware support for virtual machines (Xen abbreviation for AMD SVM / Intel VMX)
  • SVM Secure Virtual Machine. (AMD’s virtualization extensions to the 64-bit x86 architecture, equivalent to Intel’s VMX, both also known as HVM in the Xen hypervisor.)
  • VMX Intel’s equivalent to AMD’s SVM

More cpu flags

Pros and Cons of Virtualization

System virtual machine advantages:

  • multiple OS environments can co-exist on the same computer, in strong isolation from each other
  • the virtual machine can provide an instruction set architecture (ISA) that is somewhat different from that of the real machine
  • application provisioning, maintenance, high availability and disaster recovery


The main disadvantages of VMs are:

  • a virtual machine is less efficient than a real machine when it accesses the hardware indirectly
  • when multiple VMs are concurrently running on the same physical host, each VM may exhibit a varying and unstable performance (Speed of Execution, and not results), which highly depends on the workload imposed on the system by other VMs, unless proper techniques are used for temporal isolation among virtual machines.


Variations of Virtual Machine Monitors

The software that creates a virtual machine environment in a computer. In a regular, non-virtual environment, the operating system is the master control program, which manages the execution of all applications and acts as an interface between the applications and the hardware. The OS has the highest privilege level in the machine, known as “ring 0” (see ring).

In a virtual machine environment, the virtual machine monitor (VMM) becomes the master control program with the highest privilege level, and the VMM manages one or more operating systems, now referred to as “guest operating systems.” Each guest OS manages its own applications as it normally does in a non-virtual environment, except that it has been isolated in the computer by the VMM. Each guest OS with its applications is known as a “virtual machine” and is sometimes called a “guest OS stack.”

Prior to the introduction of hardware support for virtualization, the VMM could only use software techniques for virtualizing x86 processors and providing virtual hardware. This software approach, binary translation (BT), was used for instruction set virtualization and shadow page tables for memory management unit virtualization. Today, both Intel and AMD provide hardware support for CPU virtualization with Intel VT-x and AMD-V, respectively. More recently they added support for memory management unit (MMU) virtualization with Intel EPT and AMD RVI. In the rest of this paper, the following are referred as follows: hardware support for CPU virtualization as hardware virtualization (HV), hardware support for MMU virtualization as hwMMU, and software memory management unit virtualization as swMMU.

For some guests and hardware configurations the VMM may choose to virtualize the CPU and MMU using:

  • no hardware support (BT + swMMU),
  • HV and hwMMU (VT-x + EPT),
  • HV only (VT-x + swMMU).

The method of virtualization that the VMware VMM chooses for a particular guest on a certain platform is known as the monitor execution mode or simply monitor mode. On modern x86 CPUs the VMM has an option of choosing from several possible monitor modes. However, not all modes provide similar performance. A lot depends on the available CPU features and the guest OS behavior. VMware ESX identifies the hardware platform and chooses a default monitor mode for a particular guest on that platform. This decision is made by the VMM based on the available CPU features on a platform and the guest behavior on that platform.
Source: Virtual Machine Monitor Execution Modes in VMware vSphere 4.0

330.2 Xen (weight: 10)

Candidates should be able to install, configure, maintain and troubleshoot Xen installations. The following is a partial list of the used files, terms and utilities:

  • Xen w/Intel VT
  • Xen w/AMD-V
  • Dom0 DomU GuestOS HostOS
  • xm
  • /etc/xen
  • xmdomain.cfg
  • xentop

Installing and configuring Xen

Xen w/Intel VT / Xen w/AMD-V

Manually creating a PV Guest VM

In this section we will focus on Paravirtualized (or PV) guests. PV guests are guests that are made Xen-aware and therefore can be optimized for Xen.

As a simple example we'll create a PV guest in LVM logical volume (LV) by doing a network installation of Ubuntu (other distros such as Debian, Fedora, and CentOS can be installed in a similar way).

sudo pvs

choose your VG

create LV

sudo lvcreate -L 4G -n ubuntu /dev/<VG>

Set up the initial guest configuration: /etc/xen/ubuntu.cfg

name = "ubuntu"

memory = 256

disk = ['phy:/dev/<VG>/ubuntu,xvda,w']
vif = [' ']

kernel = "/var/lib/xen/images/ubuntu-netboot/vmlinuz"
ramdisk = "/var/lib/xen/images/ubuntu-netboot/initrd.gz"
extra = "debian-installer/exit/always_halt=true -- console=hvc0"

Start the VM and connect to console (-c):

sudo xm create /etc/xen/ubuntu.cfg -c

Do the install.

Once installed, we can use pygrub as the bootloader.

sudo ln -s /usr/lib/xen-4.1/bin/pygrub /usr/bin/pygrub

Once the install is done, the VM will shutdown. Next change the guest config, /etc/xen/ubuntu.cfg:

name = "ubuntu"
memory = 256
disk = ['phy:/dev/<VG>/ubuntu64,xvda,w']
vif = [' ']

bootloader = "pygrub"


#kernel = "/var/lib/xen/images/ubuntu-netboot/amd64/vmlinuz"
#ramdisk = "/var/lib/xen/images/ubuntu-netboot/amd64/initrd.gz"
#extra = "debian-installer/exit/always_halt=true -- console=hvc0"

Start the VM and connect to console (-c):

sudo xm create /etc/xen/ubuntu.cfg -c


Manually installing an HVM Guest VM

sudo pvs

choose your VG

Create a LV

sudo lvcreate -L 4G -n ubuntu-hvm /dev/<VG>

Create a guest config file /etc/xen/ubuntu-hvm.cfg

builder = "hvm"
name = "ubuntu-hvm"
memory = "512"
vcpus = 1
vif = ['']
disk = ['phy:/dev/<VG>/ubuntu-hvm,hda,w','file:/root/ubuntu-12.04-desktop-amd64.iso,hdc:cdrom,r']
vnc = 1
boot="dc"
xm create /etc/xen/ubuntu-hvm.cfg
vncviewer localhost:0 

After the install you can optionally remove the CDROM from the config and/or change the boot order.
For example /etc/xen/ubuntu-hvm.cfg:

builder = "hvm"
name = "ubuntu-hvm"
memory = "512"
vcpus = 1
vif = ['']
#disk = ['phy:/dev/<VG>/ubuntu-hvm,hda,w','file:/root/ubuntu-12.04-server-amd64.iso,hdc:cdrom,r']
disk = ['phy:/dev/<VG>/ubuntu-hvm,hda,w']
vnc = 1
boot="c"
#boot="dc"

Dom0 DomU GuestOS HostOS

Privileged domain (“dom0”) - the only virtual machine which by default has direct access to hardware. From the dom0 the hypervisor can be managed and unprivileged domains (“domU”) can be launched.
The dom0 domain is typically a modified version of Linux, NetBSD or Solaris. User domains may either be unmodified open-source or proprietary operating systems, such as Microsoft Windows (if the host processor supports x86 virtualization, e.g., Intel VT-x and AMD-V), or modified, para-virtualized operating system with special drivers that support enhanced Xen features.
HostOS=Dom0
GuestOS=DomU

xm

Listing Guest System Status
The status of the host and guest systems may be viewed at any time using the list option of the xm tool. For example:

xm list

The above command will display output containing a line for the host system and a line for each guest similar to the following:

Name                                      ID   Mem VCPUs      State   Time(s)
Domain-0                                   0   389     1     r-----   1414.9
XenFed                                         305     1               349.9
myFedoraXen                                    300     1                 0.0
myXenGuest                                 6   300     1     -b----     10.6

The state column uses a single character to specify the current state of the corresponding guest. These are as follows:

    r - running - The domain is currently running and healthy 

    b - blocked - The domain is blocked, and not running or runnable. This can be caused because the domain 
                  is waiting on IO (a traditional wait state) or has gone to sleep because there was nothing 
                  else for it to do. 

    p - paused - The domain has been paused, typically as a result of the administrator running the xm pause
                 command. When in a paused state the domain will still consume allocated resources like memory, 
                 but will not be eligible for scheduling by the Xen hypervisor. 

    s - shutdown - The guest has requested to be shutdown, rebooted or suspended, and the domain is in 
                   the process of being destroyed in response. 

    c - crashed - The domain has crashed. Usually this state can only occur if the domain has been 
                  configured not to restart on crash. 

    d - dying - The domain is in process of dying, but hasn't completely shutdown or crashed. 


Starting a Xen Guest System
A guest operating system can be started using the xm tool combined with the start option followed by the name of the guest operating system to be launched. For example:

su -
xm start myGuestOS


Connecting to a Running Xen Guest System
Once the guest operating system has started, a connection to the guest may be established using either the vncviewer tool or the virt-manager console. To use virt-manager, select Applications→System Tools→Virtual Machine Manager, select the desired system and click Open.

To connect using vncviewer enter the following command in Terminal window:

vncviewer

When prompted for a server enter localhost:5900. A VNC window will subsequently appear containing the running guest system.
Shutting Down a Guest System
The shutdown option of the xm tool is used to shutdown a guest operating system:

xm shutdown guestName

where guestName is the name of the guest system, to be shutdown.

Note that the shutdown option allows the guest operating system to perform an orderly shutdown when it receives the shutdown instruction. To instantly stop a guest operating system the destroy option may be used (with all the attendant risks of filesystem damage and data loss):

xm destroy myGuestOS


Pausing and Resuming a Guest System
A guest system can be paused and resumed using the xm tool's pause and restore options. For example, to pause a specific system named myXenGuest:

xm pause myXenGuest

Similarly, to resume the paused system:

xm resume myXenGuest

Note that a paused session will be lost if the host system is rebooted. Also, be aware that a paused system continues to reside in memory. To save a session such that it no longer takes up memory and can be restored to its exact state after a reboot, it is necessary to either suspend and resume or save and restore the guest.

Suspending and Resuming a Guest OS
A running guest operating system can be suspended and resumed using the xm utility. When suspended, the current status of the guest operating system is written to disk and removed from system memory. A suspended system may subsequently be restored at any time (including after a host system reboot):

To suspend a guest OS named myGuestOS:

xm suspend myGuestOS

To restore a suspended guest OS:

xm resume myGuestOS


Saving and Restoring Xen Guest Systems
Saving and restoring of a Xen guest operating system is similar to suspending with the exception that the file used to contain the suspended operating system memory image can be specified by the user:

To save a guest:

xm save myGuestOS path_to_save_file

To restore a saved guest operating system session:

xm restore path_to_save_file


Rebooting a Guest System
To reboot a guest operating system:

xm reboot myGuestOS


Configuring the Memory Assigned to a Xen Guest OS
To configure the memory assigned to a guest OS, use the mem-set option of the xm command. For example, the following command reduces the memory allocated to a guest system named myGuestOS to 256Mb:

xm mem-set myGuestOS 256

Note that acceptable memory settings must fall within the memory available to the current Domain. This may be increased using the mem-max option to xm.

Migrating a Domain to a Different Host
The migrate option allows a Xen managed domain to be migrated to a different physical server.

In order to use migrate, Xend must already be running on other host machine, and must be running the same version of Xen as the local host system. In addition, the remote host system must have the migration TCP port open and accepting connections from the source host. Finally, there must be sufficient resources for the domain to run (memory, disk space, etc).

xm migrate domainName host

Optional flags available with this command are:

-l, --live           Use live migration.
-p=portnum, --port=portnum
                     Use specified port for migration.
-r=MBIT, --resource=MBIT
                     Set level of resource usage for migration.

/etc/xen

Location where xen configuration is stored.
Location where vm configuration is stored.

xmdomain.cfg

The operating parameters that you must modify reside within the xmdomain.cfg file, which is located in the etc/xen directory. Here are the parameters you can enable or disable in the xmdomain.cfg configuration file:

ItemDescription
kernelDetermines the fully qualified path to the kernel image
ramdiskDetermines the fully qualified path to initrd for the initial ramdisk
memoryDetermines the amount of RAM (in MB) to allocate for the domain when it starts
nameDetermines the unique name for a domain
rootDetermines the root device for a domain
nicDetermines the number of network interface cards for a domain (default is 1)
diskDetermines the arrays of device block stanzas — the three stanzas are: mode - device access mode, backend-dev - the backend domain that exports to the guest domain, frontend-dev - determines how the device appears in a guest domain
vifDetermines arrays of virtual interface stanzas (each stanza represents a set of name=value operations).
builderDetermines the builder that constructs the domain (default is linux)
cpuDetermines the CPU count for the domain to start on. 0 indicates the first CPU, 1 the second, etc. (default is -1)
cpusDetermines which CPUs the domain's VCPU are executable
extraDetermines the additional information to append to end of the kernel parameter line
nfs_serverDetermines the NFS server IP address to use for the root device
nfs_rootDetermines the root directory as a fully qualified path for the NFS server
vcpusDetermines the number of virtual CPUs to allocate to a domain (default is 1)
on_shutdownDetermines the domain shutdown parameter to trigger a graceful shutdown (or xm shutdown) from inside DomU
on_rebootDetermines the domain shutdown parameter to trigger a graceful reboot (or an xm reboot) from inside DomU
on_crashDetermines the domain shutdown parameter that trigger DomU crashes.


Example:

kernel = "/boot/vmlinuz-2.6-xenU"
memory = 128
name = "MyLinux"
root = "/dev/hda1 ro"
disk = [ "file:/var/xen/mylinux.img,hda1,w" ]

xentop

The Xentop utility is included in all versions of XenServer. It displays real-time information about a XenServer system and running domains. It uses a semigraphical interface to display all details in a more friendly format.

 OPTIONS

-h, --help
Show help and exit

-V, --version
Show version information and exit

-d, --delay=SECONDS
Seconds between updates (default 3)

-n, --networks
Show network information

-x, --vbds
Show vbd block device data

-r, --repeat-header
Repeat table header before each domain

-v, --vcpus
Show VCPU data

-b, --batch
Redirect output data to stdout (batch mode)

-i, --iterations=ITERATIONS
Maximum number of updates that xentop should produce before ending
INTERACTIVE COMMANDS

All interactive commands are case-insensitive.

D
Set delay between updates

N
Toggle display of network information

Q, Esc
Quit

R
Toggle table header before each domain

S
Cycle sort order

V
Toggle display of VCPU information

Arrows
Scroll domain display 

330.3 KVM (weight: 7)

Candidates should be able to install, configure, maintain and troubleshoot KVM installations. The following is a partial list of the used files, terms and utilities:

  • /proc/cpuinfo
  • kernel modules: kvm kvm-intel kvm-amd
  • /etc/kvm/
  • kvm-qemu
  • kvm_stat
  • kvm networking
  • kvm monitor
  • kvm storage
  • qemu

/proc/cpuinfo

root@richard:~# cat /proc/cpuinfo 
processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 15
model name	: Intel(R) Core(TM)2 Quad CPU           @ 2.40GHz
stepping	: 7
microcode	: 0x66
cpu MHz		: 1596.000
cache size	: 4096 KB
physical id	: 0
siblings	: 4
core id		: 0
cpu cores	: 4
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 10
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good nopl aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm lahf_lm dtherm tpr_shadow
bogomips	: 4799.95
clflush size	: 64
cache_alignment	: 64
address sizes	: 36 bits physical, 48 bits virtual
power management:

How can I tell if I have Intel VT or AMD-V?
With a recent enough Linux kernel, run the command:

egrep '^flags.*(vmx|svm)' /proc/cpuinfo

kernel modules: kvm kvm-intel kvm-amd

KVM (for Kernel-based Virtual Machine) is a full virtualization solution for Linux on x86 hardware containing virtualization extensions (Intel VT or AMD-V). It consists of a loadable kernel module, kvm.ko, that provides the core virtualization infrastructure and a processor specific module, kvm-intel.ko or kvm-amd.ko. KVM also requires a modified QEMU although work is underway to get the required changes upstream.

Using KVM, one can run multiple virtual machines running unmodified Linux or Windows images. Each virtual machine has private virtualized hardware: a network card, disk, graphics adapter, etc.

modprobe kvm
modprobe kvm_intel

or

modprobe kvm
modprobe kvm_amd

Using KVM directly

While the rest of this documentation focuses on using KVM through libvirt, it is also possible to work with KVM directly. This is not the recommended way due to it being cumbersome but can be very useful at times.

KVM is very similar to Qemu and it is possible to run machines from the command line.

The basic syntax is :

kvm -m 512 -hda disk.img -cdrom ubuntu.iso -boot d -smp 2
  • -m = memory (in MB)
  • -hda = first hard drive
    • You can use a number of image file types including .img, .cow
    • You can also boot a hard drive. Be careful with this option as you do not want to boot the host root partition
      • Syntax -hda /dev/sda
      • This will call your grub menu from your MBR when you boot kvm.
  • -cdrom can be an iso image or a CD/DVD drive.
  • -boot [a|c|d|n] boot on floppy (a), hard disk ©, CD-ROM (d), or network (n)
  • -smp = number of CPU
  • -alt-grab change Ctrl-Alt mouse grab combination for Ctrl-Alt-Shift (very practical if you often use some control key combinations like Ctrl-Alt-Del or Windows-E)

/etc/kvm/

Location for storing vm configuration and scripts.

kvm-qemu

Modified version of the qemu software.

kvm_stat

The kvm_stat command is a python script which retrieves runtime statistics from the kvm kernel module. The kvm_stat command can be used to diagnose guest behavior visible to kvm. In particular, performance related issues with guests. Currently, the reported statistics are for the entire system; the behavior of all running guests is reported.

The kvm_stat command requires that the kvm kernel module is loaded and debugfs is mounted. If either of these features are not enabled, the command will output the required steps to enable debugfs or the kvm module. For example:

# kvm_stat
Please mount debugfs ('mount -t debugfs debugfs /sys/kernel/debug')
and ensure the kvm modules are loaded

Mount debugfs if required:

# mount -t debugfs debugfs /sys/kernel/debug

kvm_stat output
The kvm_stat command outputs statistics for all guests and the host. The output is updated until the command is terminated (using Ctrl+ c or the q key).

# kvm_stat

kvm statistics

efer_reload                 94       0
exits                  4003074   31272
fpu_reload             1313881   10796
halt_exits               14050     259
halt_wakeup               4496     203
host_state_reload	1638354   24893
hypercalls                   0       0
insn_emulation         1093850    1909
insn_emulation_fail          0       0
invlpg                   75569       0
io_exits               1596984   24509
irq_exits                21013     363
irq_injections           48039    1222
irq_window               24656     870
largepages                   0       0
mmio_exits               11873       0
mmu_cache_miss           42565       8
mmu_flooded              14752       0
mmu_pde_zapped           58730       0
mmu_pte_updated              6       0
mmu_pte_write           138795       0
mmu_recycled                 0       0
mmu_shadow_zapped        40358       0
mmu_unsync                 793       0
nmi_injections               0       0
nmi_window                   0       0
pf_fixed                697731    3150
pf_guest                279349       0
remote_tlb_flush             5       0
request_irq                  0       0
signal_exits                 1       0
tlb_flush               200190       0

Explanation of variables:

efer_reload

    The number of Extended Feature Enable Register (EFER) reloads.
exits

    The count of all VMEXIT calls.
fpu_reload

    The number of times a VMENTRY reloaded the FPU state. The fpu_reload is incremented when a guest is 
    using the Floating Point Unit (FPU).
halt_exits

    Number of guest exits due to halt calls. This type of exit is usually seen when a guest is idle.
halt_wakeup

    Number of wakeups from a halt.
host_state_reload

    Count of full reloads of the host state (currently tallies MSR setup and guest MSR reads).
hypercalls

    Number of guest hypervisor service calls.
insn_emulation

    Number of guest instructions emulated by the host.
insn_emulation_fail

    Number of failed insn_emulation attempts.
io_exits

    Number of guest exits from I/O port accesses.
irq_exits

    Number of guest exits due to external interrupts.
irq_injections

    Number of interrupts sent to guests.
irq_window

    Number of guest exits from an outstanding interrupt window.
largepages

    Number of large pages currently in use.
mmio_exits

    Number of guest exits due to memory mapped I/O (MMIO) accesses.
mmu_cache_miss

    Number of KVM MMU shadow pages created.
mmu_flooded

    Detection count of excessive write operations to an MMU page. This counts detected 
    write operations not of individual write operations.
mmu_pde_zapped

    Number of page directory entry (PDE) destruction operations.
mmu_pte_updated

    Number of page table entry (PTE) destruction operations.
mmu_pte_write

    Number of guest page table entry (PTE) write operations.
mmu_recycled

    Number of shadow pages that can be reclaimed.
mmu_shadow_zapped

    Number of invalidated shadow pages.
mmu_unsync

    Number of non-synchronized pages which are not yet unlinked.
nmi_injections

    Number of Non-maskable Interrupt (NMI) injections to the guest.
nmi_window

    Number of guest exits from (outstanding) Non-maskable Interrupt (NMI) windows.
pf_fixed

    Number of fixed (non-paging) page table entry (PTE) maps.
pf_guest

    Number of page faults injected into guests.
remote_tlb_flush

    Number of remote (sibling CPU) Translation Lookaside Buffer (TLB) flush requests.
request_irq

    Number of guest interrupt window request exits.
signal_exits

    Number of guest exits due to pending signals from the host.
tlb_flush

    Number of tlb_flush operations performed by the hypervisor.

Source

kvm networking

There are two parts to networking within QEMU:

  • the virtual network device that is provided to the guest (e.g. a PCI network card).
  • the network backend that interacts with the emulated NIC (e.g. puts packets onto the host's network).

There are a range of options for each part.

Creating a network backend
There are a number of network backends to choose from depending on your environment. Create a network backend like this:

-netdev TYPE,id=NAME,...

The id option gives the name by which the virtual network device and the network backend are associated with each other. If you want multiple virtual network devices inside the guest they each need their own network backend. The name is used to distinguish backends from each other and must be used even when only one backend is specified.

Network backend types
In most cases, if you don't have any specific networking requirements other than to be able to access to a web page from your guest, user networking (slirp) is a good choice. However, if you are looking to run any kind of network service or have your guest participate in a network in any meaningful way, tap is usually the best choice.

User Networking (SLIRP)
This is the default networking backend and generally is the easiest to use. It does not require root / Administrator privileges. It has the following limitations:

  • there is a lot of overhead so the performance is poor
  • ICMP traffic does not work (so you cannot use ping within a guest)
  • the guest is not directly accessible from the host or the external network

User Networking is implemented using “slirp”, which provides a full TCP/IP stack within QEMU and uses that stack to implement a virtual NAT'd network.

You can configure User Networking using the -netdev user command line option.

Adding the following to the qemu command line will change the network configuration to use 192.168.76.0/24 instead of the default (10.0.2.0/24) and will start guest DHCP allocation from 9 (instead of 15):

-netdev user,id=mynet0,net=192.168.76.0/24,dhcpstart=192.168.76.9

You can isolate the guest from the host (and broader network) using the restrict option. For example -netdev user,id=mynet0,restrict=y or -netdev type=user,id=mynet0,restrict=yes will restrict networking to just the guest and any virtual devices. This can be used to prevent software running inside the guest from phoning home while still providing a network inside the guest. You can selectively override this using hostfwd and guestfwd options.

TODO:

-netdev user,id=mynet0,dns=xxx

-netdev user,id=mynet0,tftp=xxx,bootfile=yyy

-netdev user,id=mynet0,smb=xxx,smbserver=yyy

-netdev user,id=mynet0,hostfwd=hostip:hostport-guestip:guestport

-netdev user,id=mynet0,guestfwd=

-netdev user,id=mynet0,host=xxx,hostname=yyy


Tap
The tap networking backend makes use of a tap networking device in the host. It offers very good performance and can be configured to create virtually any type of network topology. Unfortunately, it requires configuration of that network topology in the host which tends to be different depending on the operating system you are using. Generally speaking, it also requires that you have root privileges.

-netdev tap,id=mynet0


VDE
The VDE networking backend uses the Virtual Distributed Ethernet infrastructure to network guests. Unless you specifically know that you want to use VDE, it is probably not the right backend to use.

Socket
The socket networking backend, together with QEMU VLANs, allow you to create a network of guests that can see each other. It's primarily useful in extending the network created by Documentation/Networking/Slirp to multiple virtual machines. In general, if you want to have multiple guests communicate, tap is a better choice unless you do not have root access to the host environment.

-netdev socket,id=mynet0 


Creating a virtual network device
The virtual network device that you choose depends on your needs and the guest environment (i.e. the hardware that you are emulating). For example, if you are emulating a particular embedded board, then you should use the virtual network device that matches that embedded board's configuration.

On machines that have PCI bus, there are a wider range of options. The e1000 is the default network adapter in qemu. The rtl8139 is the default network adapter in qemu-kvm. In both projects, the virtio-net (para-virtualised) network adapter has the best performance, but requires special guest driver support.

Use the -device option to add a particular virtual network device to your virtual machine:

-device TYPE,netdev=NAME

The netdev is the name of a previously defined -netdev. The virtual network device will be associated with this network backend.

Note that there are other device options to select alternative devices, or to change some aspect of the device. For example, you want something like: -device DEVNAME,netdev=NET-ID,macaddr=MACADDR,DEV-OPTS, where DEVNAME is the device (e.g. i82559c for an Intel i82559C Ethernet device), NET_ID is the network identifier to attach the device to (see discussion of -netdev below), MACADDR is the MAC address for the device, and DEV-OPTS are any additional device options that you may wish to pass (e.g. bus=PCI-BUS,addr=DEVFN to control the PCI device address), if supported by the device.

Use -device ? to get a list of the devices (including network devices) you can add using the -device option for a particular guest. Remember that ? is a shell metacharacter, so you may need to use -device \? on the command-line.

Monitoring Networking
You can monitor the network configuration using info network and info usernet commands.

You can capture network traffic from within qemu using the -net dump command line option. See Stefan Hajnoczi's blog post on this feature.

The legacy -net option
QEMU previously used the -net nic option instead of -device DEVNAME and -net TYPE instead of -netdev TYPE. This is considered obsolete since QEMU 0.12, although it continues to work.

The legacy syntax to create virtual network devices is:

-net nic,model=MODEL

You can use -net nic,model=? to get a list of valid network devices that you can pass to the -net nic option. Note that these model names are different from the -device ? names and are therefore only useful if you are using the -net nic,model=MODEL syntax. [If you'd like to know all of the virtual network devices that are currently provided in QEMU, a search for “NetClientInfo” in the source code may be useful.]

QEMU “VLANs”
The obsolete -net syntax automatically created an emulated hub (called a QEMU “VLAN”, for virtual LAN) that forwards traffic from any device connected to it to every other device on the “VLAN”. It is not an 802.1q VLAN, just an isolated network segment. When creating multiple network devices using the -net syntax, you generally want to specify different vlan ids. The exception is when dealing with the socket backend. For example:

-net user,vlan=0 -net nic,vlan=0 -net user,vlan=1 -net nic,vlan=1  

kvm monitor

When QEMU is running, it provides a monitor console for interacting with QEMU. Through various commands, the monitor allows you to inspect the running guest OS, change removable media and USB devices, take screenshots and audio grabs, and control various aspects of the virtual machine.

The monitor is accessed from within QEMU by holding down the Control and Alt keys, and pressing Shift-2. Once in the monitor, Shift-1 switches back to the guest OS. Typing help or ? in the monitor brings up a list of all commands. Alternatively the monitor can be redirected to using the -monitor <dev> command line option Using -monitor stdio will send the monitor to the standard output, this is most useful when using qemu on the command line.

Help and information

help

  • help [command] or ? [command]

With no arguments, the help command lists all commands available. For more detail about another command, type help command, e.g.

(qemu) help info

On a small screen / VM window, the list of commands will scroll off the screen too quickly to let you read them. To scroll back and forth so that you can read the whole list, hold down the control key and press Page Up and Page Down. info

  • info option

Show information on some aspect of the guest OS. Available options are:

  • block – block devices such as hard drives, floppy drives, cdrom
  • blockstats – read and write statistics on block devices
  • capture – active capturing (audio grabs)
  • history – console command history
  • irq – statistics on interrupts (if compiled into QEMU)
  • jit – statistics on QEMU's Just In Time compiler
  • kqemu – whether the kqemu kernel module is being utilised
  • mem – list the active virtual memory mappings
  • mice – mouse on the guest that is receiving events
  • network – network devices and VLANs
  • pci – PCI devices being emulated
  • pcmcia – PCMCIA card devices
  • pic – state of i8259 (PIC)
  • profile – info on the internal profiler, if compiled into QEMU
  • registers – the CPU registers
  • snapshots – list the VM snapshots
  • tlb – list the TLB (Translation Lookaside Buffer), i.e. mappings between physical memory and virtual memory
  • usb – USB devices on the virtual USB hub
  • usbhost – USB devices on the host OS
  • uuid – Unique id of the VM
  • version – QEMU version number
  • vnc – VNC information


Devices

change

  • change device setting

The change command allows you to change removable media (like CD-ROMs), change the display options for a VNC, and change the password used on a VNC.

When you need to change the disc in a CD or DVD drive, or switch between different .iso files, find the name of the CD or DVD drive using info and use change to make the change.

(qemu) info block
ide0-hd0: type=hd removable=0 file=/path/to/winxp.img
ide0-hd1: type=hd removable=0 file=/path/to/pagefile.raw
ide1-hd1: type=hd removable=0 file=/path/to/testing_data.img
ide1-cd0: type=cdrom removable=1 locked=0 file=/dev/sr0 ro=1 drv=host_device
floppy0: type=floppy removable=1 locked=0 [not inserted]
sd0: type=floppy removable=1 locked=0 [not inserted]
(qemu) change ide1-cd0 /path/to/my.iso
(qemu) change ide1-cd0 /dev/sr0 host_device

eject

  • eject [-f] device

Use the eject command to release the device or file connected to the removable media device specified. The -f parameter can be used to force it if it initially refuses!

usb_add
Add a host file as USB flash device ( you need to create in advance the host file: dd if=/dev/zero of=/tmp/disk.usb bs=1024k count=32 )
usb_add disk:/tmp/disk.usb

usb_del
use info usb to get the usb device list

(qemu)info usb
Device 0.1, Speed 480 Mb/s, Product XXXXXX
Device 0.2, Speed 12 Mb/s, Product XXXXX

(qemu)usb_del 0.2

This deletes the device

sendkey keys
You can emulate keyboard events through sendkey command. The syntax is: sendkey keys. To get a list of keys, type sendkey [tab]. Example: sendkey ctrl-alt-f1

Screen and audio grabs

screendump

  • screendump filename

Capture a screendump and save into a PPM image file.

Virtual machine

commit

  • commit device or commit all

When running QEMU with the -snapshot option, commit changes to the device, or all devices.

quit

  • quit or q

Quit QEMU immediately.

savevm

  • savevm name

Save the virtual machine as the tag 'name'. Not all filesystems support this. raw does not, but qcow2 does.

loadvm

  • loadvm name

Load the virtual machine tagged 'name'. This can also be done on the command line: -loadvm name

With the info snapshots command, you can request a list of available machines.

stop
Suspend execution of VM

cont
Reverse a previous stop command - resume execution of VM.

system_reset
This has an effect similar to the physical reset button on a PC. Warning: Filesystems may be left in an unclean state.

system_powerdown
This has an effect similar to the physical power button on a modern PC. The VM will get an ACPI shutdown request and usually shutdown cleanly.

log

  • log option

logfile

  • logfile filename

Write logs to specified file instead of the default path, /tmp/qemu.log .

gdbserver
Starts a remote debugger session for the GNU debugger (gdb). To connect to it from the host machine, run the following commands:

shell$ gdb qemuKernelFile
(gdb) target remote localhost:1234

x
x /format address
Displays memory at the specified virtual address using the specified format.
Refer to the xp section for details on format and address.

xp
x /format address
Displays memory at the specified physical address using the specified format.
format: Used to specify the output format the displayed memory. The format is broken down as /[count][data_format][size]

  • count: number of item to display (base 10)
  • data_format: 'x' for hex, 'd' for decimal, 'u' for unsigned decimal, 'o' for octal, 'c' for char and 'i' for (disassembled) processor instructions
  • size: 'b' for 8 bits, 'h' for 16 bits, 'w' for 32 bits or 'g' for 64 bits. On x86 'h' and 'w' can select instruction disassembly code formats.


address:

  • Direct address, for example: 0×20000
  • Register, for example: $eip

Example - Display 3 instructions on an x86 processor starting at the current instruction:

(qemu) xp /3i $eip

Example - Display the last 20 words on the stack for an x86 processor:

(qemu) xp /20wx $esp

print
Print (or p), evaluates and prints the expression given to it. The result will be printed in hexadecimal, but decimal can also be used in the expression. If the result overflows it will wrap around. To use a the value in a CPU register use $<register name>. The name of the register should be lower case. You can see registers with the info registers command.
Example of qemu simulating an i386.

(qemu) print 16
0x10
(qemu) print 16 + 0x10
0x20
(qemu) print $eax
0xc02e4000
(qemu) print $eax + 2
0xc02e4000
(qemu) print ($eax + 2) * 2
0x805c8004
(qemu) print 0x80000000 * 2
0

kvm storage

Devices and media:

  • Floppy, CD-ROM, USB stick, SD card,harddisk

Host storage:

  • Flat files (img, iso)
    • Also over NFS
  • CD-ROM host device (/dev/cdrom)
  • Block devices (/dev/sda3, LVM volumes,iSCSI LUNs)
  • Distributed storage (Sheepdog, Ceph)


Supported image formats:

  • QCOW2, QED – QEMU
  • VMDK – VMware
  • VHD – Microsoft
  • VDI – VirtualBox

Features that various image formats provide:

  • Sparse images
  • Backing files (delta images)
  • Encryption
  • Compression
  • Snapshots


qemu -drive
   if=ide|virtio|scsi,
   file=path/to/img,
   cache=writethrough|writeback|none|unsafe
  • Storage interface is set with if=
  • Path to image file or device is set with path=
  • Caching mode is set with cache=
qemu -drive file=install-disc-1.iso,media=cdrom ...


QEMU supports a wide variety for storage formats and back-ends. Easiest to use are the raw and qcow2 formats, but for the best performance it is best to use a raw partition. You can create either a logical volume or a partition and assign it to the guest:

 qemu -drive file=/dev/mapper/ImagesVolumeGroup-Guest1,cache=none,if=virtio

QEMU also supports a wide variety of caching modes. If you're using raw volumes or partitions, it is best to avoid the cache completely, which reduces data copies and bus traffic:

 qemu -drive file=/dev/mapper/ImagesVolumeGroup-Guest1,cache=none,if=virtio

As with networking, QEMU supports several storage interfaces. The default, IDE, is highly supported by guests but may be slow, especially with disk arrays. If your guest supports it, use the virtio interface:

 qemu -drive file=/dev/mapper/ImagesVolumeGroup-Guest1,cache=none,if=virtio

Don't use the linux filesystem btrfs on the host for the image files. It will result in low IO performance. The kvm guest may even freeze when high IO traffic is done on the guest.

Virtual FAT filesystem (VVFAT)
Qemu can emulate a virtual drive with a FAT filesystem. It is an easy way to share files between the guest and host.
It works by prepending fat: to a directory name. By default it's read-only, if you need to make it writable append rw: to the aforementioned prefix.

Example:

qemu -drive file=fat:rw:some/directory ...

WARNING: keep in mind that QEMU makes the virtual FAT table once, when adding the device, and then doesn't update it in response to changes to the specified directory made by the host system. If you modify the directory while the VM is running, QEMU might get confused.

Cache policies

QEMU can cache access to the disk image files, and it provides several methods to do so. This can be specified using the cache modifier.

PolicyDescription
unsafeLike writeback, but without performing an fsync.
writethroughData is written to disk and cache simultaneously. (default)
writebackData is written to disk when discarded from the cache.
noneDisable caching.

Example:

qemu -drive file=disk.img,cache=writeback ...


Creating an image
To set up your own guest OS image, you first need to create a blank disc image. QEMU has the qemu-img command for creating and manipulating disc images, and supports a variety of formats. If you don't tell it what format to use, it will use raw files. The “native” format for QEMU is qcow2, and this format offers some flexibility. Here we'll create a 3GB qcow2 image to install Windows XP on:

qemu-img create -f qcow2 winxp.img 3G

The easiest way to install a guest OS is to create an ISO image of a boot CD/DVD and tell QEMU to boot off it. Many free operating systems can be downloaded from the Internet as bootable ISO images, and you can use them directly without having to burn them to disc.
Here we'll boot off an ISO image of a properly licensed Windows XP boot disc. We'll also give it 256MB of RAM, but we won't use the kqemu kernel module just yet because it causes problems during Windows XP installation.

qemu -m 256 -hda winxp.img -cdrom winxpsp2.iso -boot d


Copy on write
The “cow” part of qcow2 is an acronym for copy on write, a neat little trick that allows you to set up an image once and use it many times without changing it. This is ideal for developing and testing software, which generally requires a known stable environment to start off with. You can create your known stable environment in one image, and then create several disposable copy-on-write images to work in.

To start a new disposable environment based on a known good image, invoke the qemu-img command with the option -b and tell it what image to base its copy on. When you run QEMU using the disposable environment, all writes to the virtual disc will go to this disposable image, not the base copy.

qemu-img create -f qcow2 -b winxp.img test01.img 
qemu -m 256 -hda test01.img -kernel-kqemu &

The option -b is not supported on qemu-img, at least not in version 0.12.5. There you use the option backing_file, as shown here:

qemu-img create -f qcow2 -o backing_file=winxp.img test01.img 

Source

qemu

QEMU is a generic and open source machine emulator and virtualizer
Emulation:

  • For cross-compilation, development environments
  • Android Emulator, shipping in an Android. SDK near you

Virtualization:

  • KVM and Xen use QEMU device emulation

330.4 Other Virtualization Solutions (weight: 3)

Candidates should have some basic knowledge and experience with alternatives to Xen and KVM. The following is a partial list of the used files, terms and utilities:

  • OpenVZ
  • VirtualBox

OpenVZ

OpenVZ is not true virtualization but really containerization like FreeBSD Jails. Technologies like VMWare and Xen are more flexible in that they virtualize the entire machine and can run multiple operating systems, at the expense of greater overhead required to handle hardware virtualization. OpenVZ uses a single patched Linux kernel and therefore can run only Linux. However because it doesn't have the overhead of a true hypervisor, it is very fast and efficient. The disadvantage with this approach is the single kernel. All guests must function with the same kernel version that the host uses.

The advantages, however, are that memory allocation is soft in that memory not used in one virtual environment can be used by others or for disk caching. OpenVZ uses a common file system so each virtual environment is just a directory of files that is isolated using chroot, newer versions of OpenVZ also allow the container to have its own file system.[4] Thus a virtual machine can be cloned by just copying the files in one directory to another and creating a config file for the virtual machine and starting it.

Kernel

The OpenVZ kernel is a Linux kernel, modified to add support for OpenVZ containers. The modified kernel provides virtualization, isolation, resource management, and checkpointing.

Virtualization and isolation
Each container is a separate entity, and behaves largely as a physical server would. Each has its own:

  • Files : System libraries, applications, virtualized /proc and /sys, virtualized locks, etc.
  • Users and groups : Each container has its own root user, as well as other users and groups.
  • Process tree : A container only sees its own processes (starting from init). PIDs are virtualized, so that the init PID is 1 as it should be.
  • Network : Virtual network device, which allows a container to have its own IP addresses, as well as a set of netfilter (iptables), and routing rules.

* Devices : If needed, any container can be granted access to real devices like network interfaces, serial ports, disk partitions, etc. * IPC objects : Shared memory, semaphores, messages.

Resource management

OpenVZ resource management consists of three components: two-level disk quota, fair CPU scheduler, and user beancounters. These resources can be changed during container run time, eliminating the need to reboot.

Two-level disk quota

Each container can have its own disk quotas, measured in terms of disk blocks and inodes (roughly number of files). Within the container, it is possible to use standard tools to set UNIX per-user and per-group disk quotas.

CPU scheduler
The CPU scheduler in OpenVZ is a two-level implementation of fair-share scheduling strategy.
On the first level, the scheduler decides which container it is to give the CPU time slice to, based on per-container cpuunits values. On the second level the standard Linux scheduler decides which process to run in that container, using standard Linux process priorities.
It is possible to set different values for the CPUs in each container. Real CPU time will be distributed proportionally to these values.
Strict limits, such as 10% of total CPU time, are also possible.

I/O scheduler
Similar to the CPU scheduler described above, I/O scheduler in OpenVZ is also two-level, utilizing Jens Axboe's CFQ I/O scheduler on its second level.
Each container is assigned an I/O priority, and the scheduler distributes the available I/O bandwidth according to the priorities assigned. Thus no single container can saturate an I/O channel.

User Beancounters
User Beancounters is a set of per-container counters, limits, and guarantees. There is a set of about 20 parameters which is meant to control all the aspects of container operation. This is meant to prevent a single container from monopolizing system resources.
These resources primarily consist of memory and various in-kernel objects such as IPC shared memory segments, and network buffers. Each resource can be seen from /proc/user_beancounters and has five values associated with it: current usage, maximum usage (for the lifetime of a container), barrier, limit, and fail counter. The meaning of barrier and limit is parameter-dependent; in short, those can be thought of as a soft limit and a hard limit. If any resource hits the limit, the fail counter for it is increased. This allows the owner to detect problems by monitoring /proc/user_beancounters in the container.

ValueMeaning
lockedpagesThe memory not allowed to be swapped out (locked with the mlock() system call), in pages.
shmpagesThe total size of shared memory (including IPC, shared anonymous mappings and tmpfs objects) allocated by the processes of a particular VPS, in pages.
privvmpagesThe size of private (or potentially private) memory allocated by an application. The memory that is always shared among different applications is not included in this resource parameter.
numfileThe number of files opened by all VPS processes.
numflockThe number of file locks created by all VPS processes.
numptyThe number of pseudo-terminals, such as an ssh session, the screen or xterm applications, etc.
numsiginfoThe number of siginfo structures (essentially, this parameter limits the size of the signal delivery queue).
dcachesizeThe total size of dentry and inode structures locked in the memory.
physpagesThe total size of RAM used by the VPS processes. This is an accounting-only parameter currently. It shows the usage of RAM by the VPS. For the memory pages used by several different VPSs (mappings of shared libraries, for example), only the corresponding fraction of a page is charged to each VPS. The sum of the physpages usage for all VPSs corresponds to the total number of pages used in the system by all the accounted users.
numiptentThe number of IP packet filtering entries


Checkpointing and live migration
A live migration and checkpointing feature was released for OpenVZ in the middle of April 2006. This makes it possible to move a container from one physical server to another without shutting down the container. The process is known as checkpointing: a container is frozen and its whole state is saved to a file on disk. This file can then be transferred to another machine and a container can be unfrozen (restored) there; the delay is roughly a few seconds. Because state is usually preserved completely, this pause may appear to be an ordinary computational delay.

OpenVZ distinct features

Scalability
As OpenVZ employs a single kernel model, it is as scalable as the Linux kernel; that is, it supports up to 4096 CPUs and up to 64 GiB of RAM on 32-bit with PAE. Please note that 64-bit kernels are strongly recommended for production. A single container can scale up to the whole physical system, i.e. use all the CPUs and all the RAM.

Performance
The virtualization overhead observed in OpenVZ is minimal; More computing power is available for each container.

Density
By decreasing the overhead required for each container, it is possible to serve more containers from a given physical server, so long as the computational demands do not exceed the physical availability.

Mass-management
An administrator (i.e. root) of an OpenVZ physical server (also known as a hardware node or host system) can see all the running processes and files of all the containers on the system, and this has convenience implications. Some fixes (such as a kernel update) will affect all containers automatically, while other changes can simply be “pushed” to all the containers by a simple shell script.
Compare this with managing a VMware- or Xen-based virtualized environment: in order to apply a security update to 10 virtual servers, one either needs a more elaborate pull system (on all the virtual servers) for such updates, or an administrator is required to log in to each virtual server and apply the update. This makes OpenVZ more convenient in those cases where a pull system has not been or can not be implemented.

Limitations

OpenVZ restricts access to /dev devices to a small subset. The container may be impacted in not having access to devices that are used – not in providing access to physical hardware – but in adding or configuring kernel-level features.

/dev/loopN is often restricted in deployments, as it relies on a limit pool of kernel threads. It's absence restricts the ability to mount disk images. Some work-arounds exist using FUSE.

OpenVZ is limited to the providing only some VPN technologies based on PPP (such as PPTP/L2TP) and TUN/TAP. IPsec is not supported inside containers, including L2TP secured with IPsec.

Full virtualization solutions are free of these limitation.
Source

VirtualBox

Oracle VM VirtualBox (formerly Sun VirtualBox, Sun xVM VirtualBox and innotek VirtualBox) is an x86 virtualization software package, created by software company Innotek GmbH, purchased in 2008 by Sun Microsystems, and now developed by Oracle Corporation as part of its family of virtualization products. Oracle VM VirtualBox is installed on an existing host operating system as an application; this host application allows additional guest operating systems, each known as a Guest OS, to be loaded and run, each with its own virtual environment.

Supported host operating systems include Linux, Mac OS X, Windows XP, Windows Vista, Windows 7, Windows 8, Solaris, and OpenSolaris; there is also a port to FreeBSD. Supported guest operating systems include versions and derivations of Windows, Linux, BSD, OS/2, Solaris and others. Since release 3.2.0, VirtualBox also allows limited virtualization of Mac OS X guests on Apple hardware, though OSX86 can also be installed using VirtualBox

Since version 4.1, Windows guests on supported hardware can take advantage of the recently implemented WDDM driver included in the guest additions; this allows Windows Aero to be enabled along with Direct3D support.

Emulated environment

Multiple guest OSs can be loaded under the host operating system (host OS). Each guest can be started, paused and stopped independently within its own virtual machine (VM). The user can independently configure each VM and run it under a choice of software-based virtualization or hardware assisted virtualization if the underlying host hardware supports this. The host OS and guest OSs and applications can communicate with each other through a number of mechanisms including a common clipboard and a virtualized network facility provided. Guest VMs can also directly communicate with each other if configured to do so.

Software-based virtualization

In the absence of hardware-assisted virtualization, VirtualBox adopts a standard software-based virtualization approach. This mode supports 32-bit guest OSs which run in rings 0 and 3 of the Intel ring architecture.

  • The guest OS code, running in ring 0, is reconfigured to execute in ring 1 on the host hardware. Because this code contains many privileged instructions which cannot run natively in ring 1, VirtualBox employs a Code Scanning and Analysis Manager (CSAM) to scan the ring 0 code recursively before its first execution to identify problematic instructions and then calls the Patch Manager (PATM) to perform in-situ patching. This replaces the instruction with a jump to a VM-safe equivalent compiled code fragment in hypervisor memory.
  • The guest user-mode code, running in the ring 3, is generally run directly on the host hardware at ring 3.

In both cases, VirtualBox uses CSAM and PATM to inspect and patch the offending instructions whenever a fault occurs. VirtualBox also contains a dynamic recompiler, based on QEMU to recompile any real mode or protected mode code entirely (e.g. BIOS code, a DOS guest, or any operating system startup).
Using these techniques, VirtualBox can achieve a performance that is comparable to that of VMware.

Hardware-assisted virtualization

VirtualBox supports both Intel's VT-x and AMD's AMD-V hardware virtualization. Making use of these facilities, VirtualBox can run each guest VM in its own separate address space; the guest OS ring 0 code runs on the host at ring 0 in VMX non-root mode rather than in ring 1.

Some guests, including 64-bit guests, SMP guests and certain proprietary OSs, are only supported by VirtualBox on hosts with hardware-assisted virtualization.

Device virtualization

Hard disks are emulated in one of three disk image formats: a VirtualBox-specific container format, called “Virtual Disk Image” (VDI), which are stored as files (with a .vdi suffix) on the host operating system; VMware Virtual Machine Disk Format (VMDK); and Microsoft Virtual PC VHD format. A VirtualBox virtual machine can, therefore, use disks that were created in VMware or Microsoft Virtual PC, as well as its own native format. VirtualBox can also connect to iSCSI targets and to raw partitions on the host, using either as virtual hard disks. VirtualBox emulates IDE (PIIX4 and ICH6 controllers), SCSI, SATA (ICH8M controller) and SAS controllers to which hard drives can be attached.

Both ISO images and host-connected physical devices can be mounted as CD/DVD drives. For example, the DVD image of a Linux distribution can be downloaded and used directly by VirtualBox.

By default VirtualBox provides graphics support through a custom virtual graphics card that is VESA compatible. The Guest Additions for Windows, Linux, Solaris, OpenSolaris, or OS/2 guests include a special video driver that increases video performance and includes additional features, such as automatically adjusting the guest resolution when resizing the VM window, or desktop composition via virtualized WDDM drivers .

For an Ethernet network adapter, VirtualBox virtualizes these Network Interface Cards: AMD PCnet PCI II (Am79C970A), AMD PCnet-Fast III (Am79C973), Intel Pro/1000 MT Desktop (82540EM), Intel Pro/1000 MT Server (82545EM), and Intel Pro/1000 T Server (82543GC).[25] The emulated network cards allow most guest OSs to run without the need to find and install drivers for networking hardware as they are shipped as part of the guest OS. A special paravirtualized network adapter is also available, which improves network performance by eliminating the need to match a specific hardware interface, but requires special driver support in the guest. (Many distributions of Linux are shipped with this driver included.) By default, VirtualBox uses NAT through which Internet software for end users such as Firefox or ssh can operate. Bridged networking via a host network adapter or virtual networks between guests can also be configured. Up to 36 network adapters can be attached simultaneously, but only four are configurable through the graphical interface.

For a sound card, VirtualBox virtualizes Intel HD Audio, Intel ICH AC'97 device and SoundBlaster 16 cards.

A USB 1.1 controller is emulated so that any USB devices attached to the host can be seen in the guest. The closed source extension pack adds a USB 2.0 controller and, if VirtualBox acts as an RDP server, it can also use USB devices on the remote RDP client as if they were connected to the host, although only if the client supports this VirtualBox-specific extension (Oracle provides clients for Solaris, Linux and Sun Ray thin clients that can do this, and have promised support for other platforms in future versions).

Virtual Disk Image

VirtualBox uses its own format for storage containers – Virtual Disk Image (VDI). VirtualBox also supports other well-known storage formats[30] such as VMDK (used in particular by VMware) as well as the VHD format used by Microsoft.

VirtualBox's command-line utility VBoxManage includes options for cloning disks and importing and exporting file systems, however, it does not include a tool for increasing the size of the filesystem within a VDI container: this can be achieved in many ways with third-party tools (e.g. CloneVDI provides a GUI for cloning and increasing the size [31]) or in the guest OS itself.

VirtualBox has supported Open Virtualization Format (OVF) since version 2.2.0 (April 2009).
Source

Topic 331: Load Balancing

331.1 Linux Virtual Server (weight: 5)

Candidates should know how to install, configure, maintain and troubleshoot LVS. This includes the configuration and use of keepalived. Key Knowledge Areas:

  • IPVS
  • VRRP
  • keepalived configuration


The following is a partial list of the used files, terms and utilities:

  • ipvsadm
  • syncd
  • LVS-NAT/Tun/DR/LocalNode
  • connection scheduling algorithms
  • genhash

IPVS

IPVS (IP Virtual Server) implements transport-layer load balancing inside the Linux kernel, so called Layer-4 switching. IPVS running on a host acts as a load balancer at the front of a cluster of real servers, it can direct requests for TCP/UDP based services to the real servers, and makes services of the real servers to appear as a virtual service on a single IP address.

VRRP

It's a daemon that implements the VRRPv2 (Virtual Router Redundant Protocol) (RFC 2338) for Linux. The daemon has to be run on each of the boxes that together make the high availability system.

Basically, its function is to create a set of nodes with the same IP, so if one dies, another box of the same set can take its place transparently for the end user or host, e.g. a redundant system. Usually (though not necessarily) it is used on routers.

How it works

Each group has a master box, that honours services associated to an IP. That IP is shared throughout all the set. When the master node fails, a backup node takes its place as a new master. To choose which one should be the new master, static priorities are assigned to each node. Furthermore, as a box can be part of other redundancy groups, sets are attached together with a unique ID, called VRID (Virtual Router ID),

- Virtual IP and virtual MAC

The shared IP is a virtual IP, but that has to be completely valid; any host that uses that IP must not suffer from bad catching in the ARP table (MAC/IP couple). So this doesn't happen, a virtual MAC associated to the virtual IP is created. Otherwise, we'd have to wait for the arp table in each host that uses the redundant system entry to timeout for the new master to work, wrecking the whole point of use.

Therefore, even if the host that serves that IP changes, as the MAC is consistent through all the set, the ARP table entry of each box that uses the high availability system is absolutely valid. The service is attended transparently by each node of the set so that the end user hardly notices the change.\

To makes things extra simple, the virtual MAC is made up from the standard prefix 00:00:5E:00:00:01 and the VRID. Let's say we have assigned VRID 1 to our set. Then our virtual MAC will be 00:00:5E:00:00:01 + 01 = 00:00:5E:00:00:01:01.

- Node intercommunication: synchronization

The master node takes active part in synchronizing the whole system. Every fixed period of time (by default, 1 second) it announces that it is up and running, sending out a packet to the 224.0.0.18 multicast address. When a few cycles of these announcements pass (3 by default) without any announcement of the master, the working highest priority backup node comes into play, taking its turn to be master node.

If this actually happens, and the master node comes back to life afterwards, because it has higher priority, it preempts the temporal master; the first king goes back to its master ruling postion, setting the old backup node back to its idle wait status.

- Priorities: master node and backup nodes

Each node has a static priority so that, in case of competition, decide who shall be master. The alive node with highest priority will be the new ruler.

How long does it take?

I did a couple of tests, and it seems that the response is quicker when the master fails, to when the master that had failed cames back. On average, in the first case, a backup starts working after about 10 seconds. In the second case, when the master that had failed comes back to its original status, it doesn't become functional till about 30 seconds or 1 minute after.

Where to see it

Around in /var/log/syslog of each node. Here you have an extract from a backup:

Aug 10 17:43:21 sandbox vrrpd[3926]: Starting (adver_int: 1000000, vrid: 100, use virtual mac: no)
Aug 10 17:43:21 sandbox vrrpd[3926]: VRRP ID 100 on eth0 (prio: 100) : we are now a backup router.
Aug 10 17:43:24 sandbox vrrpd[3926]: VRRP ID 100 on eth0 (prio: 100): we are now the master router.
Aug 10 17:43:27 sandbox kernel: eth0: no IPv6 routers present
Aug 10 17:45:28 sandbox vrrpd[3931]: Starting (adver_int: 1000000, vrid: 1, use virtual mac: no)
Aug 10 17:45:28 sandbox vrrpd[3931]: VRRP ID 1 on eth0 (prio: 100) : we are now a backup router.
Aug 10 17:47:02 sandbox vrrpd[3931]: VRRP ID 1 on eth0 (prio: 100): 172.16.0.3 is down, we are now 
                                     the master router.
Aug 10 17:47:27 sandbox vrrpd[3931]: VRRP ID 1 on eth0 (prio: 100) : 172.16.0.3 is up, we are now 
                                     a backup router.


Use

The package in s/Debian/Lunar/g is [surprise!] vrrpd. Once installed, running it is quite simple.

At the master node, box A [output truncated] :

# ifconfig eth0
eth0  Link encap:Ethernet  HWaddr 00:E0:4C:31:69:5C
	inet addr:172.168.0.3  Bcast:172.168.255.255  Mask:255.255.0.0

# vrrpd -i eth0 -v 1 -D -p 100 172.168.0.222

# ifconfig eth0
eth0  Link encap:Ethernet  HWaddr 00:00:5E:00:00:01:01
	inet addr:172.168.0.3  Bcast:172.168.255.255  Mask:255.255.0.0 

Options are (in order)

  • -i eth0 : interface that is going to be modified and whose IP is

inside the same network as the virtual IP

  • -v 1 : VRID of the set
  • -D : daemonize.
  • -p 100 : this node's priority
  • 172.168.0.222 : virtual IP served


In a backup node (box B):

# ifconfig eth0
eth0    Link encap:Ethernet  HWaddr 00:07:95:A6:BE:81
        inet addr:172.16.0.50  Bcast:172.16.255.255  Mask:255.255.0.0

# vrrpd -i eth0 -v 1 -D -p 150 -n 172.168.0.222 

# ifconfig eth0
eth0    Link encap:Ethernet  HWaddr 00:07:95:A6:BE:81
        inet addr:172.16.0.50  Bcast:172.16.255.255  Mask:255.255.0.0

Differences with the master:

  • -p 150 : Has lower priority
  • -n : Don't change the MAC, as seen before and after running the

vrrpd command

IMPORTANT:

  • The virtual IP 172.16.0.222 is a completely valid IP address inside the network 172.16.0.0/16. The announcements of the master node are not to 255.255.255.255, only to the range the virtual IP is in, 172.16.0.0/16 in our example.
  • If we add the -n option (do not change immediately the MAC to the virtual one) to the master too, there'll be no problem. As the master has a higher priority, as soon as it wakes up, it will preempt any other backup. So, to keep things simple and stupid (TM), we can add -n to the conf of any node, including the master, but THE MASTER HAS TO HAVE THE HIGHEST PRIORITY.


If now, carrying on with the example, we disconnect the master, the backup will come into play. And, as said above, when the master arrives again, it'll be master and rule again :)

Inconveniences

Open connections are not exported through the nodes upon failure and change of master. You'll have to count on hanged connections untill a retries happen. At least we have assured a short response window ;)

Source

keepalived configuration

The Keepalived configuration file uses the following synopsis

Global definitions synopsis
global_defs {
   notification_email {
      email
      email
   }
   notification_email_from email
   smtp_server host
   smtp_connect_timeout num
   lvs_id string
}
KeywordDefinitionType
global_defsidentify the global def configuration block
notification_emailemail accounts that will receive the notification mailList
notification_email_fromemail to use when processing “MAIL FROM:” SMTP commandList
smtp_serverremote SMTP server to use for sending mail notificationsalphanum
smtp_connection_timeoutspecify a timeout for SMTP stream processingnumerical
lvs_idspecify the name of the LVS directoralphanum

Email type: Is a string using charset as specified into the SMTP RFC eg: “user@domain.com”

Virtual server definitions synopsis
virtual_server (@IP PORT)|(fwmark num) {
   delay_loop num
   lb_algo rr|wrr|lc|wlc|sh|dh|lblc
   lb_kind NAT|DR|TUN
   (nat_mask @IP)
   persistence_timeout num
   persistence_granularity @IP
   virtualhost string
   protocol TCP|UDP

   sorry_server @IP PORT

   real_server @IP PORT {
      weight num
      TCP_CHECK {
          connect_port num
          connect_timeout num
      }
   }
   real_server @IP PORT {
      weight num
      MISC_CHECK {
          misc_path /path_to_script/script.sh
          (or misc_path “/path_to_script/script.sh <arg_list>”)
      }
   }
   real_server @IP PORT {
      weight num
      HTTP_GET|SSL_GET {
          url {      # You can add multiple url block
             path alphanum
             digest alphanum
          }
          connect_port num
          connect_timeout num
          nb_get_retry num
          delay_before_retry num
      }
   }
}
KeywordDefinition
virtual_serveridentify a virtual server definition block
fwmarkspecify that virtual server is a FWMARK
delay_loopspecify in seconds the interval between checks
lb_algoselect a specific scheduler (rr,wrr,lc or wlc…)
lb_kindselect a specific forwarding method (NAT,DR or TUN)
persistence_timeoutspecify a timeout value for persistent connections
persistence_granularityspecify a granularity mask for persistent connections
Virtualhostspecify a HTTP virtualhost to use for HTTP or SSL_GET
protocolspecify the protocol kind (TCP or UDP)
sorry_serverserver to be added to the pool if all real servers are down
real_serverspecify a real server member
Weightspecify the real server weight for load balancing decisions
TCP_CHECKcheck real server availability using TCP connect
MISC_CHECKcheck real server availability using user defined script
misc_pathidentify the script to run with full path
HTTP_GETcheck real server availability using HTTP GET request
SSL_GETcheck real server availability using SSL GET request
urlidentify a url definition block
Pathspecify the url path
Digestspecify the digest for a specific url path
connect_portconnect remote server on specified TCP port
connect_timeoutconnect remote server using timeout
Nb_get_retrymaximum number of retries
delay_before_retrydelay between two successive retries

NB: The “nat_mask” keyword is obsolete if you are not using LVS with Linux kernel 2.2 series. This flag give you the ability to define the reverse NAT granularity.
NB: Currently, Healthcheck framework, only implements TCP protocol for service monitoring.
NB: Type “path” refers to the full path of the script being called. Note that for scripts requiring arguments the path and arguments must be enclosed in double quotes (“).

VRRP Instance definitions synopsis
vrrp_sync_group string {
   group {
      string
      string
   }
   notify_master /path_to_script/script_master.sh
      (or notify_master “/path_to_script/script_master.sh <arg_list>”)
   notify_backup /path_to_script/script_backup.sh
      (or notify_backup “/path_to_script/script_backup.sh <arg_list>”)
   notify_fault /path_to_script/script_fault.sh
      (or notify_fault “/path_to_script/script_fault.sh <arg_list>”)
}
vrrp_instance string {
   state MASTER|BACKUP
   interface string
   mcast_src_ip @IP
   lvs_sync_daemon_interface string
   virtual_router_id num
   priority num
   advert_int num
   smtp_alert
   authentication {
      auth_type PASS|AH
      auth_pass string
   }
   virtual_ipaddress {
      # Block limited to 20 IP addresses
      @IP
      @IP
      @IP
   }
   virtual_ipaddress_excluded { # Unlimited IP addresses number
      @IP
      @IP
      @IP
   }
   notify_master /path_to_script/script_master.sh
      (or notify_master “/path_to_script/script_master.sh <arg_list>”)
   notify_backup /path_to_script/script_backup.sh
      (or notify_backup “/path_to_script/script_backup.sh <arg_list>”)
   notify_fault /path_to_script/script_fault.sh
      (or notify_fault “/path_to_script/script_fault.sh <arg_list>”)
}
KeywordDefinition
vrrp_instanceidentify a VRRP instance definition block
Statespecify the instance state in standard use
Interfacespecify the network interface for the instance to run on
mcast_src_ipspecify the src IP address value for VRRP adverts IP header
lvs_sync_daemon_intefacespecify the network interface for the LVS sync_daemon to run on
Virtual_router_idspecify to which VRRP router id the instance belongs
Priorityspecify the instance priority in the VRRP router
advert_intspecify the advertisement interval in seconds (set to 1)
smtp_alertActivate the SMTP notification for MASTER state transition
authenticationidentify a VRRP authentication definition block
auth_typespecify which kind of authentication to use (PASS or AH)
auth_passspecify the password string to use
virtual_ipaddressidentify a VRRP VIP definition block
virtual_ipaddress_excludedidentify a VRRP VIP excluded definition block (not protocol VIPs)
notify_masterspecify a shell script to be executed during transition to master state
notify_backupspecify a shell script to be executed during transition to backup state
notify_faultspecify a shell script to be executed during transition to fault state
vrrp_sync_groupIdentify the VRRP synchronization instances group

Path type: A system path to a script eg: “/usr/local/bin/transit.sh <arg_list>”

Example config

Client (on the internet somewhere) –> load balancer –> realserver

Load balancer IPs:

        IP of load balancer's external interface(eth0): 192.168.1.9
        external VIP of our realserver: 192.168.1.11
        IP of load balancer's interface(eth1): 10.20.40.2
        internal VIP our realserver will use as a default gateway: 10.20.40.1 

Realserver:

        IP: 10.20.40.10
        be sure to set the default gateway to 10.20.40.1 

Our first step is to configure keepalived. The typical location for this file is /etc/keepalived/keepalived.conf

Note that keepalived, as of this writing, does not report errors in the configuration file! This means if something is not right in the config file it may be difficult to notice. Try starting keepalived with the -d option, which will dump a config to syslog.

! This is a comment
! Configuration File for keepalived

global_defs {
   ! this is who emails will go to on alerts
   notification_email {
        admins@example.com
    fakepager@example.com
    ! add a few more email addresses here if you would like
   }
   notification_email_from admins@example.com

   ! I use the local machine to relay mail
   smtp_server 127.0.0.1
   smtp_connect_timeout 30

   ! each load balancer should have a different ID
   ! this will be used in SMTP alerts, so you should make
   ! each router easily identifiable
   lvs_id LVS_EXAMPLE_01
}

! vrrp_sync_groups make sure that several router instances
! stay together on a failure - a good example of this is
! that the external interface on one router fails and the backup server
! takes over, you want the internal interface on the failed server
! to failover as well, otherwise nothing will work.
! you can have as many vrrp_sync_group blocks as you want.
vrrp_sync_group VG1 {
   group {
      VI_1
      VI_GATEWAY
   }
}

! each interface needs at least one vrrp_instance
! each vrrp_instance is a group of VIPs that are logically grouped
! together
! you can have as many vrrp_instaces as you want

vrrp_instance VI_1 {
        state MASTER
        interface eth0
     
        lvs_sync_daemon_inteface eth0

    ! each virtual router id must be unique per instance name!
        virtual_router_id 51

    ! MASTER and BACKUP state are determined by the priority
    ! even if you specify MASTER as the state, the state will
    ! be voted on by priority (so if your state is MASTER but your
    ! priority is lower than the router with BACKUP, you will lose
    ! the MASTER state)
    ! I make it a habit to set priorities at least 50 points apart
    ! note that a lower number is lesser priority - lower gets less vote
        priority 150

    ! how often should we vote, in seconds?
        advert_int 1

    ! send an alert when this instance changes state from MASTER to BACKUP
        smtp_alert

    ! this authentication is for syncing between failover servers
    ! keepalived supports PASS, which is simple password
    ! authentication
    ! or AH, which is the IPSec authentication header.
    ! I don't use AH
    ! yet as many people have reported problems with it
        authentication {
                auth_type PASS
                auth_pass example
        }

    ! these are the IP addresses that keepalived will setup on this
    ! machine. Later in the config we will specify which real
        ! servers  are behind these IPs
    ! without this block, keepalived will not setup and takedown the
    ! any IP addresses
     
        virtual_ipaddress {
                192.168.1.11
        ! and more if you want them
        }
}

! now I setup the instance that the real servers will use as a default
! gateway
! most of the config is the same as above, but on a different interface

vrrp_instance VI_GATEWAY {
        state MASTER
        interface eth1
        lvs_sync_daemon_inteface eth1
        virtual_router_id 52
        priority 150
        advert_int 1
        smtp_alert
        authentication {
                auth_type PASS
                auth_pass example
        }
        virtual_ipaddress {
                10.20.40.1
        }
}

! now we setup more information about are virtual server
! we are just setting up one for now, listening on port 22 for ssh
! requests.

! notice we do not setup a virtual_server block for the 10.20.40.1
! address in the VI_GATEWAY instance. That's because we are doing NAT
! on that IP, and nothing else.

virtual_server 192.168.1.11 22 {
    delay_loop 6

    ! use round-robin as a load balancing algorithm
    lb_algo rr

    ! we are doing NAT
    lb_kind NAT
    nat_mask 255.255.255.0

    protocol TCP

    ! there can be as many real_server blocks as you need

    real_server 10.20.40.10 22 {

    ! if we used weighted round-robin or a similar lb algo,
    ! we include the weight of this server

        weight 1

    ! here is a health checker for this server.
    ! we could use a custom script here (see the keepalived docs)
    ! but we will just make sure we can do a vanilla tcp connect()
    ! on port 22
    ! if it fails, we will pull this realserver out of the pool
    ! and send email about the removal
        TCP_CHECK {
                connect_timeout 3
        connect_port 22
        }
    }
}

! that's all

When you start keepalived with the -d flag, you should see this in /var/log/message (or equivalent):

Sep 12 14:13:11 example-01 Keepalived: ------< Global definitions >------
Sep 12 14:13:11 example-01 Keepalived:  LVS ID = LVS_EXAMPLE_01
Sep 12 14:13:11 example-01 Keepalived:  Smtp server = 127.0.0.1
Sep 12 14:13:11 example-01 Keepalived:  Smtp server connection timeout = 100
Sep 12 14:13:11 example-01 Keepalived:  Email notification from = admins@example.com, fakepager@example.com
Sep 12 14:13:11 example-01 Keepalived:  Email notification = admins@example.com
Sep 12 14:13:11 example-01 Keepalived: ------< SSL definitions >------
Sep 12 14:13:11 example-01 Keepalived:  Using autogen SSL context
Sep 12 14:13:11 example-01 Keepalived: ------< VRRP Topology >------
Sep 12 14:13:11 example-01 Keepalived:  VRRP Instance = VI_1
Sep 12 14:13:11 example-01 Keepalived:    Want State = MASTER
Sep 12 14:13:11 example-01 Keepalived:    Runing on device = eth0
Sep 12 14:13:11 example-01 Keepalived:    Virtual Router ID = 51
Sep 12 14:13:11 example-01 Keepalived:    Priority = 150
Sep 12 14:13:11 example-01 Keepalived:    Advert interval = 1sec
Sep 12 14:13:11 example-01 Keepalived:    Preempt Active
Sep 12 14:13:11 example-01 Keepalived:    Authentication type = SIMPLE_PASSWORD
Sep 12 14:13:11 example-01 Keepalived:    Password = example
Sep 12 14:13:11 example-01 Keepalived:    VIP count = 1
Sep 12 14:13:11 example-01 Keepalived:      VIP1 = 192.168.1.11/32
Sep 12 14:13:11 example-01 Keepalived:  VRRP Instance = VI_GATEWAY
Sep 12 14:13:11 example-01 Keepalived:    Want State = MASTER
Sep 12 14:13:11 example-01 Keepalived:    Runing on device = eth1
Sep 12 14:13:11 example-01 Keepalived:    Virtual Router ID = 52
Sep 12 14:13:11 example-01 Keepalived:    Priority = 150
Sep 12 14:13:11 example-01 Keepalived:    Advert interval = 1sec
Sep 12 14:13:11 example-01 Keepalived:    Preempt Active
Sep 12 14:13:11 example-01 Keepalived:    Authentication type = SIMPLE_PASSWORD
Sep 12 14:13:11 example-01 Keepalived:    Password = example
Sep 12 14:13:11 example-01 Keepalived:    VIP count = 1
Sep 12 14:13:11 example-01 Keepalived:      VIP1 = 10.20.40.1/32
Sep 12 14:13:11 example-01 Keepalived: ------< VRRP Sync groups >------
Sep 12 14:13:11 example-01 Keepalived:  VRRP Sync Group = VG1, MASTER
Sep 12 14:13:11 example-01 Keepalived:    monitor = VI_1
Sep 12 14:13:11 example-01 Keepalived:    monitor = VI_GATEWAY
Sep 12 14:13:11 example-01 Keepalived: ------< LVS Topology >------
Sep 12 14:13:11 example-01 Keepalived:  System is compiled with LVS v1.0.4
Sep 12 14:13:11 example-01 Keepalived:  VIP = 192.168.1.11, VPORT = 22
Sep 12 14:13:11 example-01 Keepalived:    delay_loop = 10, lb_algo = rr
Sep 12 14:13:11 example-01 Keepalived:    protocol = TCP
Sep 12 14:13:11 example-01 Keepalived:    lb_kind = NAT
Sep 12 14:13:11 example-01 Keepalived:    RIP = 10.20.40.11, RPORT = 22, WEIGHT = 1
Sep 12 14:13:11 example-01 Keepalived: ------< Health checkers >------
Sep 12 14:13:11 example-01 Keepalived:  10.20.40.11:22
Sep 12 14:13:11 example-01 Keepalived:    Keepalive method = TCP_CHECK
Sep 12 14:13:11 example-01 Keepalived:    Connection timeout = 10
Sep 12 14:13:11 example-01 Keepalived:    Connection port = 22

Let's see what ipvsadm has to say about this, after keepalived starts up:

example-01:~ # ipvsadm
IP Virtual Server version 1.0.4 (size=65536)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
TCP  192.168.1.11:ssh rr
  -> 10.20.40.10:ssh            Masq    1      0          0
example-01:~ #

And finally, we should see the new IP addresses in our IP address list:

example-01:~ # ip addr list
1: lo: <LOOPBACK,UP> mtu 16436 qdisc noqueue
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 brd 127.255.255.255 scope host lo
2: eth0: <BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast qlen 100
    link/ether 00:e0:81:21:bb:1c brd ff:ff:ff:ff:ff:ff
    inet 192.168.1.9/24 brd 192.168.1.254 scope global eth0
    inet 192.168.1.11/32 scope global eth0
3: eth1: <BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast qlen 100
    link/ether 00:e0:81:21:bb:1d brd ff:ff:ff:ff:ff:ff
    inet 10.20.40.2/24 brd 10.20.40.255 scope global eth1
    inet 10.20.40.1/32 scope global eth1
example-01:~ #

ipvsadm, ip addr list, and starting keepalived with the -d option are good ways to verify your config is working.

Failover With our basic config from above, we can easily move to a failover situation. All you have to do is setup keepalived on another box, copy over the keepalived.conf, change the lvs_id, change any priorities down 50 points, states to BACKUP, and run keepalived. You'll see in the logs on the backup server that the server accepts it's BACKUP state, and if you unplug the network cable(s) from the MASTER server, the BACKUP server takes over the MASTER state.

For the example, use the config file from the simple example above on the MASTER machine. On the BACKUP machine, use this config file:

! This is a comment
! Configuration File for keepalived


global_defs {
   ! this is who emails will go to on alerts
   notification_email {
        admins@example.com
    fakepager@example.com
    ! add a few more email addresses here if you would like
   }
   notification_email_from admins@example.com

   ! I use the local machine to relay mail
   smtp_server 127.0.0.1
   smtp_connect_timeout 30

   ! each load balancer should have a different ID
   ! this will be used in SMTP alerts, so you should make
   ! each router easily identifiable

   ! this is router 2
   lvs_id LVS_EXAMPLE_02
}

! vrrp_sync_groups make sure that several router instances
! stay together on a failure - a good example of this is
! that the external interface on one router fails and the backup server
! takes over, you want the internal interface on the failed server
! to failover as well, otherwise nothing will work.
! you can have as many vrrp_sync_group blocks as you want.
vrrp_sync_group VG1 {
   group {
      VI_1
      VI_GATEWAY
   }
}

! each interface needs at least one vrrp_instance
! each vrrp_instance is a group of VIPs that are logically grouped
! together
! you can have as many vrrp_instaces as you want

vrrp_instance VI_1 {
        ! we are the failover
        state BACKUP
        interface eth0
     
        lvs_sync_daemon_inteface eth0

    ! each virtual router id must be unique per instance name!
        ! instance names are the same on MASTER and BACKUP, so the
        ! virtual router_id is the same as VI_1 on the MASTER
        virtual_router_id 51

    ! MASTER and BACKUP state are determined by the priority
    ! even if you specify MASTER as the state, the state will
    ! be voted on by priority (so if your state is MASTER but your
    ! priority is lower than the router with BACKUP, you will lose
    ! the MASTER state)
    ! I make it a habit to set priorities at least 50 points apart
    ! note that a lower number is lesser priority -
    ! lower gets less vote
        priority 100

    ! how often should we vote, in seconds?
        advert_int 1

    ! send an alert when this instance changes state from
    ! MASTER to BACKUP
        smtp_alert

    ! this authentication is for syncing between failover servers
    ! keepalived supports PASS, which is simple
    ! password authentication
    ! or AH, which is the ipsec authentication header.
    ! I don't use AH
    ! yet as many people have reported problems with it
        authentication {
                auth_type PASS
                auth_pass example
        }


        
        virtual_ipaddress {
                192.168.1.11
        ! and more if you want them
        }
}

! now I setup the instance that the real servers will use as a default
! gateway
! most of the config is the same as above, but on a different interface

vrrp_instance VI_GATEWAY {
        state BACKUP
        interface eth1
        lvs_sync_daemon_inteface eth1
        virtual_router_id 52
        priority 100
        advert_int 1
        smtp_alert
        authentication {
                auth_type PASS
                auth_pass example
        }
        virtual_ipaddress {
                10.20.40.1
        }
}


! now we setup more information about are virtual server
! we are just setting up one for now, listening on port 22 for ssh
! requests.

! notice we do not setup a virtual_server block for the 10.20.40.1
! address in the VI_GATEWAY instance. That's because we are doing NAT
! on that IP, and nothing else.

virtual_server 192.168.1.11 22 {
    delay_loop 6

    ! use round-robin as a load balancing alogorithm
    lb_algo rr

    ! we are doing NAT
    lb_kind NAT
    nat_mask 255.255.255.0


    protocol TCP

    ! there can be as many real_server blocks as you need

    real_server 10.20.40.10 22 {

    ! if we used weighted round-robin or a similar lb algo,
    ! we include the weight of this server

        weight 1

    ! here is a health checker for this server.
    ! we could use a custom script here (see the keepalived docs)
    ! but we will just make sure we can do a vanilla tcp connect()
    ! on port 22
    ! if it fails, we will pull this realserver out of the pool
    ! and send email about the removal
        TCP_CHECK {
                connect_timeout 3
        connect_port 22
        }
    }
}

! that's all


Notice how little is different between the MASTER and BACKUP config file - just the lvs_id directive, the priorities, and the state directive. That's it, that's all. Make sure these are different but everything else is the same.

Once you startup keepalived on the MASTER and the BACKUP, you should be able to kill keepalived on the MASTER server and watch the BACKUP take over in the logs on the BACKUP server.

If you did an ip addr list on the backup server, you won't see the VIPs until the backup server takes over the MASTER state.
Source

ipvsadm

Man Page

NAME
       ipvsadm - Linux Virtual Server administration

SYNOPSIS
       ipvsadm -A|E -t|u|f service-address [-s scheduler]
               [-p [timeout]] [-M netmask]
       ipvsadm -D -t|u|f service-address
       ipvsadm -C
       ipvsadm -R
       ipvsadm -S [-n]
       ipvsadm -a|e -t|u|f service-address -r server-address
               [-g|i|m] [-w weight] [-x upper] [-y lower]
       ipvsadm -d -t|u|f service-address -r server-address
       ipvsadm -L|l [options]
       ipvsadm -Z [-t|u|f service-address]
       ipvsadm --set tcp tcpfin udp
       ipvsadm --start-daemon state [--mcast-interface interface]
               [--syncid syncid]
       ipvsadm --stop-daemon state
       ipvsadm -h

DESCRIPTION
       Ipvsadm(8)  is  used  to set up, maintain or inspect the virtual server
       table in the Linux kernel. The Linux Virtual  Server  can  be  used  to
       build  scalable  network  services  based  on  a cluster of two or more
       nodes. The active node of the cluster redirects service requests  to  a
       collection  of  server  hosts  that will actually perform the services.
       Supported features include two protocols (TCP and UDP),  three  packet-
       forwarding methods (NAT, tunneling, and direct routing), and eight load
       balancing algorithms (round robin, weighted round robin,  least-connec-
       tion,   weighted   least-connection,  locality-based  least-connection,
       locality-based least-connection with replication,  destination-hashing,
       and source-hashing).

       The command has two basic formats for execution:

       ipvsadm COMMAND [protocol] service-address
               [scheduling-method] [persistence options]

       ipvsadm command [protocol] service-address
               server-address [packet-forwarding-method]
               [weight options]

       The  first  format  manipulates a virtual service and the algorithm for
       assigning service requests to real servers.  Optionally,  a  persistent
       timeout  and  network  mask for the granularity of a persistent service
       may be specified. The second format manipulates a real server  that  is
       associated  with  an  existing  virtual service. When specifying a real
       server, the packet-forwarding method and the weight of the real server,
       relative  to  other real servers for the virtual service, may be speci-
       fied, otherwise defaults will be used.

   COMMANDS
       ipvsadm(8) recognises the commands described below. Upper-case commands
       maintain  virtual  services.  Lower-case commands maintain real servers
       that are associated with a virtual service.

       -A, --add-service
              Add a virtual service. A service address is uniquely defined  by
              a triplet: IP address, port number, and protocol. Alternatively,
              a virtual service may be defined by a firewall-mark.

       -E, --edit-service
              Edit a virtual service.

       -D, --delete-service
              Delete  a  virtual  service,  along  with  any  associated  real
              servers.

       -C, --clear
              Clear the virtual server table.

       -R, --restore
              Restore  Linux  Virtual  Server rules from stdin. Each line read
              from stdin will be treated as the command line options to a sep-
              arate  invocation  of ipvsadm. Lines read from stdin can option-
              ally begin with "ipvsadm".  This option is useful to avoid  exe-
              cuting  a large number or ipvsadm  commands when constructing an
              extensive routing table.

       -S, --save
              Dump the Linux Virtual Server rules to stdout in a  format  that
              can be read by -R|--restore.

       -a, --add-server
              Add a real server to a virtual service.

       -e, --edit-server
              Edit a real server in a virtual service.

       -d, --delete-server
              Remove a real server from a virtual service.

       -L, -l, --list
              List  the virtual server table if no argument is specified. If a
              service-address is selected, list this service only. If  the  -c
              option is selected, then display the connection table. The exact
              output is affected by the other arguments given.

       -Z, --zero
              Zero the packet, byte and rate counters in a service or all ser-
              vices.

       --set tcp tcpfin udp
              Change  the  timeout values used for IPVS connections. This com-
              mand always takes  3  parameters,   representing   the   timeout
              values (in seconds) for TCP sessions, TCP sessions after receiv-
              ing a  FIN packet, and  UDP  packets, respectively.   A  timeout
              value 0 means that the current timeout value of the  correspond-
              ing  entry  is preserved.

       --start-daemon state
              Start the connection synchronization daemon.  The  state  is  to
              indicate  that  the  daemon  is started as master or backup. The
              connection synchronization  daemon  is  implemented  inside  the
              Linux kernel. The master daemon running at the primary load bal-
              ancer multicasts changes of connections  periodically,  and  the
              backup daemon running at the backup load balancers receives mul-
              ticast message and creates corresponding connections.  Then,  in
              case  the  primary  load  balancer fails, a backup load balancer
              will takeover, and it has state of almost  all  connections,  so
              that  almost  all established connections can continue to access
              the service.

       --stop-daemon
              Stop the connection synchronization daemon.

       -h, --help
              Display a description of the command syntax.

   PARAMETERS
       The commands above accept or require zero  or  more  of  the  following
       parameters.

       -t, --tcp-service service-address
              Use TCP service. The service-address is of the form host[:port].
              Host may be one of a plain IP address or a hostname. Port may be
              either a plain port number or the service name of port. The Port
              may be omitted, in which case zero will be used. A Port  of zero
              is  only  valid if the service is persistent as the -p|--persis-
              tent option, in which case it is a wild-card port, that is  con-
              nections will be accepted to any port.

       -u, --udp-service service-address
              Use UDP service. See the -t|--tcp-service for the description of
              the service-address.

       -f, --fwmark-service integer
              Use a firewall-mark, an integer  value  greater  than  zero,  to
              denote  a virtual service instead of an address, port and proto-
              col (UDP or TCP). The marking of packets with a firewall-mark is
              configured  using the -m|--mark option to iptables(8). It can be
              used to build a virtual service assoicated with  the  same  real
              servers,  covering  multiple IP address, port and protocol trip-
              plets.

              Using  firewall-mark  virtual  services  provides  a  convenient
              method  of  grouping  together different IP addresses, ports and
              protocols into a single virtual service. This is useful for both
              simplifying  configuration if a large number of virtual services
              are required and grouping persistence across what  would  other-
              wise be multiple virtual services.

       -s, --scheduler scheduling-method
              scheduling-method   Algorithm for allocating TCP connections and
              UDP datagrams to real servers.  Scheduling algorithms are imple-
              mented as kernel modules. Ten are shipped with the Linux Virtual
              Server:

              rr - Robin Robin: distributes jobs equally amongst the available
              real servers.

              wrr - Weighted Round Robin: assigns jobs to real servers propor-
              tionally to there real  servers’  weight.  Servers  with  higher
              weights  receive  new  jobs first and get more jobs than servers
              with lower weights. Servers with equal weights get an equal dis-
              tribution of new jobs.

              lc  -  Least-Connection:  assigns more jobs to real servers with
              fewer active jobs.

              wlc - Weighted Least-Connection: assigns more  jobs  to  servers
              with  fewer  jobs  and  relative  to  the  real  servers’ weight
              (Ci/Wi). This is the default.

              lblc - Locality-Based Least-Connection:  assigns  jobs  destined
              for  the same IP address to the same server if the server is not
              overloaded and available; otherwise assign jobs to servers  with
              fewer jobs, and keep it for future assignment.

              lblcr   -   Locality-Based  Least-Connection  with  Replication:
              assigns jobs destined for the same IP address to the  least-con-
              nection  node  in  the server set for the IP address. If all the
              node in the server set are over loaded, it picks up a node  with
              fewer  jobs  in the cluster and adds it in the sever set for the
              target. If the server set has not been modified for  the  speci-
              fied  time, the most loaded node is removed from the server set,
              in order to avoid high degree of replication.

              dh - Destination Hashing: assigns jobs to servers through  look-
              ing  up a statically assigned hash table by their destination IP
              addresses.

              sh - Source Hashing: assigns jobs to servers through looking  up
              a statically assigned hash table by their source IP addresses.

              sed  -  Shortest  Expected Delay: assigns an incoming job to the
              server with the shortest expected delay. The expected delay that
              the  job  will  experience  is (Ci + 1) / Ui if  sent to the ith
              server, in which Ci is the number of jobs on the the ith  server
              and Ui is the fixed service rate (weight) of the ith server.

              nq  -  Never Queue: assigns an incoming job to an idle server if
              there is, instead of waiting for a fast one; if all the  servers
              are busy, it adopts the Shortest Expected Delay policy to assign
              the job.

       -p, --persistent [timeout]
              Specify that a virtual service is persistent. If this option  is
              specified, multiple requests from a client are redirected to the
              same real server selected for the  first  request.   Optionally,
              the  timeout  of  persistent  sessions may be specified given in
              seconds, otherwise the default of 300 seconds will be used. This
              option  may be used in conjunction with protocols such as SSL or
              FTP where it is important that clients consistently connect with
              the same real server.

              Note:  If  a  virtual  service is to handle FTP connections then
              persistence must be set for the virtual service if Direct  Rout-
              ing  or  Tunnelling is used as the forwarding mechanism. If Mas-
              querading is used in conjunction with an FTP service  than  per-
              sistence  is not necessary, but the ip_vs_ftp kernel module must
              be used.  This module may be manually inserted into  the  kernel
              using insmod(8).

       -M, --netmask netmask
              Specify  the granularity with which clients are grouped for per-
              sistent virtual services.  The source address of the request  is
              masked with this netmask to direct all clients from a network to
              the same real server. The default is 255.255.255.255,  that  is,
              the  persistence  granularity  is per client host. Less specific
              netmasks may be used to  resolve  problems  with  non-persistent
              cache clusters on the client side.

       -r, --real-server server-address
              Real  server  that  an  associated  request  for  service may be
              assigned to.  The server-address is the host address of  a  real
              server, and may plus port. Host can be either a plain IP address
              or a hostname.  Port can be either a plain port  number  or  the
              service  name  of port.  In the case of the masquerading method,
              the host address is usually an RFC 1918 private IP address,  and
              the  port  can be different from that of the associated service.
              With the tunneling and direct  routing  methods,  port  must  be
              equal  to  that of the service address. For normal services, the
              port specified  in the service address will be used if  port  is
              not  specified.  For  fwmark  services,  port may be omitted, in
              which case  the destination port on the real server will be  the
              destination port of the request sent to the virtual service.

       [packet-forwarding-method]

              -g,  --gatewaying   Use gatewaying (direct routing). This is the
              default.

              -i, --ipip  Use ipip encapsulation (tunneling).

              -m, --masquerading  Use masquerading  (network  access  transla-
              tion, or NAT).

              Note:   Regardless of the packet-forwarding mechanism specified,
              real servers for addresses for which there are interfaces on the
              local node will be use the local forwarding method, then packets
              for the servers will be passed to upper layer on the local node.
              This cannot be specified by ipvsadm, rather it set by the kernel
              as real servers are added or modified.

       -w, --weight weight
              Weight is an integer specifying the capacity  of a server  rela-
              tive to the others in the pool. The valid values of weight are 0
              through to 65535. The default is 1. Quiescent servers are speci-
              fied  with  a weight of zero. A quiescent server will receive no
              new jobs but still serve the existing jobs, for  all  scheduling
              algorithms  distributed with the Linux Virtual Server. Setting a
              quiescent server may be useful if the server  is  overloaded  or
              needs to be taken out of service for maintenance.

       -x, --u-threshold uthreshold
              uthreshold is an integer specifying the upper connection thresh-
              old of a server. The valid values of uthreshold are 0 through to
              65535.  The  default  is  0,  which  means  the upper connection
              threshold is not set. If uthreshold is set with other values, no
              new  connections  will  be sent to the server when the number of
              its connections exceeds its upper connection threshold.

       -y, --l-threshold lthreshold
              lthreshold is an integer specifying the lower connection thresh-
              old of a server. The valid values of lthreshold are 0 through to
              65535. The default  is  0,  which  means  the  lower  connection
              threshold  is  not  set. If lthreshold is set with other values,
              the server will receive new connections when the number  of  its
              connections  drops  below  its  lower  connection  threshold. If
              lthreshold is not set but uthreshold is  set,  the  server  will
              receive new connections when the number of its connections drops
              below three forth of its upper connection threshold.

       --mcast-interface interface
              Specify the multicast interface  that  the  sync  master  daemon
              sends  outgoing  multicasts  through,  or the sync backup daemon
              listens to for multicasts.

       --syncid syncid
              Specify the syncid that the sync master daemon fills in the Syn-
              cID  header while sending multicast messages, or the sync backup
              daemon uses to filter out multicast messages  not  matched  with
              the  SyncID  value.  The valid values of syncid are 0 through to
              255. The default is 0, which means no filtering at all.

       -c, --connection
              Connection output. The list command with this option  will  list
              current IPVS connections.

       --timeout
              Timeout  output.  The list command with this option will display
              the  timeout values (in seconds) for TCP sessions, TCP  sessions
              after receiving a FIN packet, and UDP packets.

       --daemon
              Daemon  information  output.  The  list command with this option
              will display the daemon status and its multicast interface.

       --stats
              Output of statistics information. The  list  command  with  this
              option  will  display the statistics information of services and
              their servers.

       --rate Output of rate information. The list command  with  this  option
              will  display  the rate information (such as connections/second,
              bytes/second and packets/second) of services and their  servers.

       --thresholds
              Output  of  thresholds  information.  The list command with this
              option will display the upper/lower connection threshold  infor-
              mation of each server in service listing.

       --persistent-conn
              Output  of  persistent  connection information. The list command
              with this option will display the persistent connection  counter
              information  of  each  server in service listing. The persistent
              connection is used to forward the actual  connections  from  the
              same client/network to the same server.

       --sort Sort  the list of virtual services and real servers. The virtual
              service entries are sorted  in  ascending  order  by  <protocol,
              address,  port>. The real server entries are sorted in ascending
              order by <address, port>.

       -n, --numeric
              Numeric output.  IP addresses and port numbers will  be  printed
              in  numeric  format  rather  than  as as host names and services
              respectively, which is the  default.   --exact  Expand  numbers.
              Display  the  exact  value  of  the  packet  and  byte counters,
              instead  of only the rounded number in K’s (multiples  of  1000)
              M’s  (multiples  of  1000K)  or G’s (multiples  of 1000M).  This
              option is only relevant for the -L command.

EXAMPLE 1 - Simple Virtual Service
       The following commands configure a Linux Director to distribute  incom-
       ing  requests addressed to port 80 on 207.175.44.110 equally to port 80
       on five real servers. The forwarding method used  in  this  example  is
       NAT,  with  each  of  the  real  servers being masqueraded by the Linux
       Director.

       ipvsadm -A -t 207.175.44.110:80 -s rr
       ipvsadm -a -t 207.175.44.110:80 -r 192.168.10.1:80 -m
       ipvsadm -a -t 207.175.44.110:80 -r 192.168.10.2:80 -m
       ipvsadm -a -t 207.175.44.110:80 -r 192.168.10.3:80 -m
       ipvsadm -a -t 207.175.44.110:80 -r 192.168.10.4:80 -m
       ipvsadm -a -t 207.175.44.110:80 -r 192.168.10.5:80 -m

       Alternatively, this could be achieved in a single ipvsadm command.

       echo "
       -A -t 207.175.44.110:80 -s rr
       -a -t 207.175.44.110:80 -r 192.168.10.1:80 -m
       -a -t 207.175.44.110:80 -r 192.168.10.2:80 -m
       -a -t 207.175.44.110:80 -r 192.168.10.3:80 -m
       -a -t 207.175.44.110:80 -r 192.168.10.4:80 -m
       -a -t 207.175.44.110:80 -r 192.168.10.5:80 -m
       " | ipvsadm -R

       As masquerading is used as the forwarding mechanism  in  this  example,
       the  default  route of the real servers must be set to the linux direc-
       tor, which will need to be configured to forward and  masquerade  pack-
       ets. This can be achieved using the following commands:

       echo "1" > /proc/sys/net/ipv4/ip_forward

EXAMPLE 2 - Firewall-Mark Virtual Service
       The  following commands configure a Linux Director to distribute incom-
       ing requests addressed to any port on 207.175.44.110 or  207.175.44.111
       equally to the corresponding port on five real servers. As per the pre-
       vious example, the forwarding method used in this example is NAT,  with
       each of the real servers being masqueraded by the Linux Director.

       ipvsadm -A -f 1  -s rr
       ipvsadm -a -f 1 -r 192.168.10.1:0 -m
       ipvsadm -a -f 1 -r 192.168.10.2:0 -m
       ipvsadm -a -f 1 -r 192.168.10.3:0 -m
       ipvsadm -a -f 1 -r 192.168.10.4:0 -m
       ipvsadm -a -f 1 -r 192.168.10.5:0 -m

       As  masquerading  is  used as the forwarding mechanism in this example,
       the default route of the real servers must be set to the  linux  direc-
       tor,  which  will need to be configured to forward and masquerade pack-
       ets. The real server should also be configured to mark incoming packets
       addressed  to any port on 207.175.44.110 and  207.175.44.111 with fire-
       wall-mark 1. If FTP traffic is to be handled by this  virtual  service,
       then  the ip_vs_ftp kernel module needs to be inserted into the kernel.
       These operations can be achieved using the following commands:

       echo "1" > /proc/sys/net/ipv4/ip_forward
       modprobe ip_tables
       iptables  -A PREROUTING -t mangle -d 207.175.44.110/31 -j MARK --set-mark 1
       modprobe ip_vs_ftp

NOTES
       The Linux Virtual Server implements three  defense  strategies  against
       some  types of denial of service (DoS) attacks. The Linux Director cre-
       ates an entry for each connection in order to keep its state, and  each
       entry occupies 128 bytes effective memory. LVS’s vulnerability to a DoS
       attack lies in the potential to increase the number entries as much  as
       possible until the linux director runs out of memory. The three defense
       strategies against the attack are: Randomly drop some  entries  in  the
       table.  Drop  1/rate packets before forwarding them. And use secure tcp
       state transition table and short  timeouts.  The  strategies  are  con-
       trolled  by  sysctl  variables  and  corresponding entries in the /proc
       filesystem:

       /proc/sys/net/ipv4/vs/drop_entry      /proc/sys/net/ipv4/vs/drop_packet
       /proc/sys/net/ipv4/vs/secure_tcp

       Valid values for each variable are 0 through to 3. The default value is
       0, which disables the respective defense strategy. 1 and  2  are  auto-
       matic  modes - when there is no enough available memory, the respective
       strategy will be enabled and the variable is automatically  set  to  2,
       otherwise  the  strategy  is  disabled  and the variable is set to 1. A
       value of 3 denotes that the respective strategy is always enabled.  The
       available  memory  threshold and secure TCP timeouts can be tuned using
       the sysctl variables and corresponding entries in the /proc filesystem:

       /proc/sys/net/ipv4/vs/amemthresh /proc/sys/net/ipv4/vs/timeout_*

FILES
       /proc/net/ip_vs
       /proc/net/ip_vs_app
       /proc/net/ip_vs_conn
       /proc/net/ip_vs_stats
       /proc/sys/net/ipv4/vs/am_droprate
       /proc/sys/net/ipv4/vs/amemthresh
       /proc/sys/net/ipv4/vs/drop_entry
       /proc/sys/net/ipv4/vs/drop_packet
       /proc/sys/net/ipv4/vs/secure_tcp
       /proc/sys/net/ipv4/vs/timeout_close
       /proc/sys/net/ipv4/vs/timeout_closewait
       /proc/sys/net/ipv4/vs/timeout_established
       /proc/sys/net/ipv4/vs/timeout_finwait
       /proc/sys/net/ipv4/vs/timeout_icmp
       /proc/sys/net/ipv4/vs/timeout_lastack
       /proc/sys/net/ipv4/vs/timeout_listen
       /proc/sys/net/ipv4/vs/timeout_synack
       /proc/sys/net/ipv4/vs/timeout_synrecv
       /proc/sys/net/ipv4/vs/timeout_synsent
       /proc/sys/net/ipv4/vs/timeout_timewait
       /proc/sys/net/ipv4/vs/timeout_udp

SEE ALSO
       The LVS web site (http://www.linuxvirtualserver.org/) for more documen-
       tation about LVS.

       ipvsadm-save(8), ipvsadm-restore(8), iptables(8),
       insmod(8), modprobe(8)

AUTHORS
       ipvsadm - Wensong Zhang <wensong@linuxvirtualserver.org>
              Peter Kese <peter.kese@ijs.si>
       man page - Mike Wangsmo <wanger@redhat.com>
               Wensong Zhang <wensong@linuxvirtualserver.org>
               Horms <horms@verge.net.au>

Output The output of ipvs rules in the above example one is as follows:

[root@penguin sbin]# ipvsadm -ln
IP Virtual Server version 1.0.8 (size=65536)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
TCP  207.175.44.110:80 rr
  -> 192.168.10.5:80              Masq    1      0          0
  -> 192.168.10.4:80              Masq    1      0          0
  -> 192.168.10.3:80              Masq    1      0          0
  -> 192.168.10.2:80              Masq    1      0          0
  -> 192.168.10.1:80              Masq    1      0          0

Source

syncd

Server State Sync Demon, syncd (saving the director's connection state on failover)

Here is simple intructions to use connection synchronization.

On the primary load balancer, run

primary_director:# ipvsadm --start-daemon=master --mcast-interface=eth0

On the backup load balancers, run

backup_director:# ipvsadm --start-daemon=backup --mcast-interface=eth0

To stop the daemon, run

director:# ipvsadm --stop-daemon

Note that the feature of connection synchronization is under experiment now, and there is some performance penalty when connection synchronization, because a highly loaded load balancer may need to multicast a lot of connection information. If the daemon is not started, the performance will not be affected.
Syncd boxes must have the same time.

Expiration of Connection in Backup Director In Primary Director

[root@RADIXS root]# ipvsadm -lc
IPVS connection entries
pro expire state       source             virtual
      destination
TCP 14:42  ESTABLISHED 192.168.123.133:32861
vipserver:telnet   application:telnet

In Backup Director

[root@main ong]# ipvsadm -lc
IPVS connection entries
pro expire state       source             virtual
      destination
TCP 02:17  ESTABLISHED 192.168.123.133:32861
vipserver:telnet   application:telnet


Source

LVS-NAT/Tun/DR/LocalNode

LVS-NAT

Network address translation

Due to the shortage of IP address in IPv4 and some security reasons, more and more networks use internal IP addresses (such as 10.0.0.0/255.0.0.0, 172.16.0.0/255.240.0.0 and 192.168.0.0/255.255.0.0) which cannot be used in the Internet. The need for network address translation arises when hosts in internal networks want to access the Internet and be accessed in the Internet.

Network address translation is a feature by which IP addresses are mapped from one group to another. When the address mapping is N-to-N, it is called static network address translation; when the mapping is M-to-N (M>N), it is called dynamic network address translation. Network address port translation is an extension to basic NAT, in that many network addresses and their TCP/UDP ports are translated to a single network address and its TCP/UDP ports. This is N-to-1 mapping, in which way Linux IP Masquerading was implemented. More description about network address translation is in rfc1631 and draft-rfced-info-srisuresh-05.txt.

Virtual server via NAT on Linux is done by network address port translation. The code is implemented on Linux IP Masquerading codes, and some of Steven Clarke's port forwarding codes are reused.


When a user accesses the service provided by the server cluster, the request packet destined for virtual IP address (the external IP address for the load balancer) arrives at the load balancer. The load balancer examines the packet's destination address and port number. If they are matched for a virtual server service according to the virtual server rule table, a real server is chosen from the cluster by a scheduling algorithm, and the connection is added into the hash table which record the established connection. Then, the destination address and the port of the packet are rewritten to those of the chosen server, and the packet is forwarded to the server. When the incoming packet belongs to this connection and the chosen server can be found in the hash table, the packet will be rewritten and forwarded to the chosen server. When the reply packets come back, the load balancer rewrites the source address and port of the packets to those of the virtual service. After the connection terminates or timeouts, the connection record will be removed in the hash table.

Confused? Let me give an example to make it clear. In the example, computers are configured as follows:


Note real servers can run any OS that supports TCP/IP, the default route of real servers must be the virtual server (172.16.0.1 in this example). The ipfwadm utility is used to make the virtual server accept packets from real servers. In the example above, the command is as follows:

        echo 1 > /proc/sys/net/ipv4/ip_forward
        ipfwadm -F -a m -S 172.16.0.0/24 -D 0.0.0.0/0

The following figure illustrates the rules specified in the Linux box with virtual server support.

ProtocolVirtual IP AddressPortReal IP AddressPortWeight
TCP202.103.106.580172.16.0.2801
172.16.0.380002
TCP202.103.106.521172.16.0.3211


All traffic destined for IP address 202.103.106.5 Port 80 is load-balanced over real IP address 172.16.0.2 Port 80 and 172.16.0.3 Port 8000. Traffic destined for IP address 202.103.106.5 Port 21 is port-forwarded to real IP address 172.16.0.3 Port 21.

Packet rewriting works as follows.

The incoming packet for web service would has source and destination addresses as:
SOURCE 202.100.1.2:3456 DEST 202.103.106.5:80

The load balancer will choose a real server, e.g. 172.16.0.3:8000. The packet would be rewritten and forwarded to the server as:
SOURCE 202.100.1.2:3456 DEST 172.16.0.3:8000

Replies get back to the load balancer as:
SOURCE 172.16.0.3:8000 DEST 202.100.1.2:3456

The packets would be written back to the virtual server address and returned to the client as:
SOURCE 202.103.106.5:80 DEST 202.100.1.2:3456

Source

Tun

This page contains information about the working principle of Virtual Server via IP Tunneling and how to use IP Tunneling to greatly increase the scalability of server clusters.

IP tunneling

IP tunneling (IP encapsulation) is a technique to encapsulate IP datagram within IP datagrams, which allows datagrams destined for one IP address to be wrapped and redirected to another IP address. IP encapsulation is now commonly used in Extranet, Mobile-IP, IP-Multicast, tunneled host or network. Please see the NET-3-HOWTO document for details.

How to use IP tunneling on virtual server

First, let's look at the figure of virtual server via IP tunneling. The most different thing of virtual server via IP tunneling to that of virtual server via NAT is that the load balancer sends requests to real servers through IP tunnel in the former, and the load balancer sends request to real servers via network address translation in the latter.


When a user accesses a virtual service provided by the server cluster, a packet destined for virtual IP address (the IP address for the virtual server) arrives. The load balancer examines the packet's destination address and port. If they are matched for the virtual service, a real server is chosen from the cluster according to a connection scheduling algorithm, and the connection is added into the hash table which records connections. Then, the load balancer encapsulates the packet within an IP datagram and forwards it to the chosen server. When an incoming packet belongs to this connection and the chosen server can be found in the hash table, the packet will be again encapsulated and forwarded to that server. When the server receives the encapsulated packet, it decapsulates the packet and processes the request, finally return the result directly to the user according to its own routing table. After a connection terminates or timeouts, the connection record will be removed from the hash table. The workflow is illustrated in the following figure.


Note that real servers can have any real IP address in any network, they can be geographically distributed, but they must support IP encapsulation protocol. Their tunnel devices are all configured up so that the systems can decapsulate the received encapsulation packets properly, and the <Virtual IP Address> must be configured on non-arp devices or any alias of non-arp devices, or the system can be configured to redirect packets for <Virtual IP Address> to a local socket. See the arp problem page for more information.

Finally, when an encapsulated packet arrives, the real server decapsulates it and finds that the packet is destined for <Virtual IP Address>, it says, “Oh, it is for me, so I do it.”, it processes the request and returns the result directly to the user in the end.

How to use it

Let's give an example to see how to use it. The following table illustrates the rules specified in the Linux box with virtual server via IP tunneling. Note that the services running on the real servers must run on the same port as virtual service, so it is not necessary to specify the service port on the real servers.

ProtocolVirtual IP AddressPortReal IP AddressWeight
TCP202.103.106.580202.103.107.21
202.103.106.32


All traffic destined for IP address 202.103.106.5 Port 80 is load-balanced over real IP address 202.103.107.2 Port 80 and 202.103.106.3 Port 80.

We can use the following commands to specify the rules in the table above in the system.

    ipvsadm -A -t 202.103.106.5:80 -s wlc
    ipvsadm -a -t 202.103.106.5:80 -r 202.103.107.2 -i -w 1
    ipvsadm -a -t 202.103.106.5:80 -r 202.103.106.3 -i -w 2

My example for testing virtual server via tunneling

Here is my configure example for testing virtual server via tunneling. The configuration is as follows. I hope it can give you some clues. The load balancer has 172.26.20.111 address, and the real server 172.26.20.112. The 172.26.20.110 is the virtual IP address. In all the following examples, “telnet 172.26.20.110” will actually reach the real server.

The load balancer (LinuxDirector), kernel 2.2.14

    ifconfig eth0 172.26.20.111 netmask 255.255.255.0 broadcast 172.26.20.255 up
    ifconfig eth0:0 172.26.20.110 netmask 255.255.255.255 broadcast 172.26.20.110 up
    echo 1 > /proc/sys/net/ipv4/ip_forward
    ipvsadm -A -t 172.26.20.110:23 -s wlc
    ipvsadm -a -t 172.26.20.110:23 -r 172.26.20.112 -i

The real server 1, kernel 2.0.36 (IP forwarding enabled)

    ifconfig eth0 172.26.20.112 netmask 255.255.255.0 broadcast 172.26.20.255 up
    route add -net 172.26.20.0 netmask 255.255.255.0 dev eth0
    ifconfig tunl0 172.26.20.110 netmask 255.255.255.255 broadcast 172.26.20.110 up
    route add -host 172.26.20.110 dev tunl0


More configuration examples

Here are more configuration examples of virtual server via IP tunneling. In order to save space, just important commands are put and less importants are omitted.

1. Real server running kernel 2.2.14 or later with hidden device

The load balancer (LinuxDirector), kernel 2.2.14

    echo 1 > /proc/sys/net/ipv4/ip_forward
    ipvsadm -A -t 172.26.20.110:23 -s wlc
    ipvsadm -a -t 172.26.20.110:23 -r 172.26.20.112 -i

The real server 1, kernel 2.2.14

    echo 1 > /proc/sys/net/ipv4/ip_forward
    # insert it if it is compiled as module
    modprobe ipip
    ifconfig tunl0 0.0.0.0 up
    echo 1 > /proc/sys/net/ipv4/conf/all/hidden
    echo 1 > /proc/sys/net/ipv4/conf/tunl0/hidden
    ifconfig tunl0 172.26.20.110 netmask 255.255.255.255 broadcast 172.26.20.110 up

Since the kernel 2.2 just has one tunnel device tunl0, you can only have one VIP in this configuration. For multiple VIPs, you can make the tunl0 device up, and configure them on aliases of tunnel/dummy/loopback devices and hide that device. An example is as follows:

    echo 1 > /proc/sys/net/ipv4/ip_forward
    # insert it if it is compiled as module
    modprobe ipip
    ifconfig tunl0 0.0.0.0 up
    ifconfig dummy0 0.0.0.0 up
    echo 1 > /proc/sys/net/ipv4/conf/all/hidden
    echo 1 > /proc/sys/net/ipv4/conf/dummy0/hidden
    ifconfig dummy0:0 172.26.20.110 up
    route add -host 172.26.20.110 dev dummy0:0
    ifconfig dummy0:1 <Another-VIP> up
    ...

2. Real servers runing kernel 2.2.x with redirect approach

The load balancer's configuration is the same as the example above. Real servers running kernel 2.2.x can be configured as follows:

    echo 1 > /proc/sys/net/ipv4/ip_forward
    # insert it if it is compiled as module
    modprobe ipip
    ifconfig tunl0 0.0.0.0 up
    ipchains -A input -j REDIRECT 23 -d 172.26.20.110 23 -p tcp
    ...

Source

DR

This page contains information about working principle of the Direct Routing request dispatching technique and how to use it to contruct server clusters.

Direct Routing request dispatching technique

This request dispatching approach is similar to the one implemented in IBM's NetDispatcher. The virtual IP address is shared by real servers and the load balancer. The load balancer has an interface configured with the virtual IP address too, which is used to accept request packets, and it directly route the packets to the chosen servers. All the real servers have their non-arp alias interface configured with the virtual IP address or redirect packets destined for the virtual IP address to a local socket, so that the real servers can process the packets locally. The load balancer and the real servers must have one of their interfaces physically linked by a HUB/Switch. The architecture of virtual server via direct routing is illustrated as follows:


When a user accesses a virtual service provided by the server cluster, the packet destined for virtual IP address (the IP address for the virtual server) arrives. The load balancer(LinuxDirector) examines the packet's destination address and port. If they are matched for a virtual service, a real server is chosen from the cluster by a scheduling algorithm, and the connection is added into the hash table which records connections. Then, the load balancer directly forwards it to the chosen server. When the incoming packet belongs to this connection and the chosen server can be found in the hash table, the packet will be again directly routed to the server. When the server receives the forwarded packet, the server finds that the packet is for the address on its alias interface or for a local socket, so it processes the request and return the result directly to the user finally. After a connection terminates or timeouts, the connection record will be removed from the hash table.

The direct routing workflow is illustrated in the following figure:


The load balancer simply changes the MAC address of the data frame to that of the chosen server and restransmits it on the LAN. This is the reason that the load balancer and each server must be directly connected to one another by a single uninterrupted segment of a LAN. If you meet some arp problem of the cluster, see the arp problem page for more information.

My example for testing virtual server via direct routing

Here is my configure example for testing virtual server via direct routing. The configuration is as follows. I hope it can give you some clues. The load balancer has 172.26.20.111 address, and the real server 172.26.20.112. The 172.26.20.110 is the virtual IP address. In all the following examples, “telnet 172.26.20.110” will actually reach the real server.

2. For kernel 2.2.x

The load balancer (LinuxDirector), kernel 2.2.14

  ifconfig eth0 172.26.20.111 netmask 255.255.255.0 broadcast 172.26.20.255 up
  ifconfig eth0:0 172.26.20.110 netmask 255.255.255.255 broadcast 172.26.20.110 up
  echo 1 > /proc/sys/net/ipv4/ip_forward
  ipvsadm -A -t 172.26.20.110:23 -s wlc
  ipvsadm -a -t 172.26.20.110:23 -r 172.26.20.112 -g

The real server 1, kernel 2.0.36 (IP forwarding enabled)

  ifconfig eth0 172.26.20.112 netmask 255.255.255.0 broadcast 172.26.20.255 up
  route add -net 172.26.20.0 netmask 255.255.255.0 dev eth0
  ifconfig lo:0 172.26.20.110 netmask 255.255.255.255 broadcast 172.26.20.110 up
  route add -host 172.26.20.110 dev lo:0


More configuration examples
1. Real server running kernel 2.2.14 or later with hidden device

The load balancer (LinuxDirector), kernel 2.2.14

    echo 1 > /proc/sys/net/ipv4/ip_forward
    ipvsadm -A -t 172.26.20.110:23 -s wlc
    ipvsadm -a -t 172.26.20.110:23 -r 172.26.20.112 -g

The real server 1, kernel 2.2.14

    echo 1 > /proc/sys/net/ipv4/ip_forward
    echo 1 > /proc/sys/net/ipv4/conf/all/hidden
    echo 1 > /proc/sys/net/ipv4/conf/lo/hidden
    ifconfig lo:0 172.26.20.110 netmask 255.255.255.255 broadcast 172.26.20.110 up

You can configure the VIP on alias of other devices like dummy and hide it. Since it is the alias interface, you can configure as many VIPs as you want. An example using dummy device is as follows:

    echo 1 > /proc/sys/net/ipv4/ip_forward
    ifconfig dummy0 0.0.0.0 up
    echo 1 > /proc/sys/net/ipv4/conf/all/hidden
    echo 1 > /proc/sys/net/ipv4/conf/dummy0/hidden
    ifconfig dummy0:0 172.26.20.110 up
    ifconfig dummy0:1 <Another-VIP> up
    ...

2. Real servers runing kernel 2.2.x with redirect approach

The load balancer's configuration is the same as the example above. Real servers running kernel 2.2.x can be configured as follows:

    echo 1 > /proc/sys/net/ipv4/ip_forward
    ipchains -A input -j REDIRECT 23 -d 172.26.20.110 23 -p tcp
    ...

With this ipchains redirect commands, packets destined for the address 172.26.20.110 port 23 and the tcp protocol will be redirected to a local socket. Note that the service daemon must listen on all addresses (0.0.0.0) or on the VIP address (172.26.20.110 here).
3. Real servers having different network routes

In the virtual server via direct routing, the servers can follows the different network routes to the clients (different Internet links), it is good for performance. The load balancer and real servers use a private LAN to communicate. Here is a configuration example.

The load balancer (LinuxDirector), kernel 2.2.14

    ifconfig eth0 <an IP address> ...
    ...
    ifconfig eth0:0 <VIP> netmask 255.255.255.255 broadcast <VIP> up
    ifconfig eth1 192.168.0.1 netmask 255.255.255.0 broadcast 192.168.0.255 up
    ipvsadm -A -t <VIP>:23
    ipvsadm -A -t <VIP>:23 -r 192.168.0.2 -g
    ...

The real server 1, kernel 2.0.36

    ifconfig eth0 <a seperate IP address> ...
    # Follow the different network route
    ...
    ifconfig eth1 192.168.0.2 netmask 255.255.255.0 broadcast 192.168.0.255 up
    route add -net 192.168.0.0 netmask 255.255.255.0 dev eth1
    ifconfig lo:0 <VIP> netmask 255.255.255.255 broadcast <VIP> up
    route add -host <VIP> dev lo:0


Source

LocalNode

using the director as a “sorry server” (e.g. when all realservers are overloaded and you want to display a “please come back later message”).

With localnode, the director machine can be a realserver too. This is convenient when only a small number of machines are available as servers.

To use localnode, with ipvsadm you add a realserver with IP 127.0.0.1 (or any local IP on your director). You then setup the service to listen to the VIP on the director, so that when the service replies to the client, the src_addr of the reply packets are from the VIP. The client is not connecting to a service on 127.0.0.1 (or a local IP on the director), despite ipvsadm installing a service with RIP=127.0.0.1.

Some services, e.g. telnet listen on all IP's on the machine and you won't have to do anything special for them, they will already be listening on the VIP. Other services, e.g. http, sshd, have to be specifically configured to listen to each IP.

Note
Configuring the service to listen to an IP which is not the VIP, is the most common mistake of people reporting problems with setting up LocalNode.

LocalNode operates independantly of NAT,TUN or DR modules (i.e. you have have LocalNode running on a director that is forwarding packets to realservers by any of the forwarding methods).

Source

connection scheduling algorithms

One of the advantages of using LVS is its ability to perform flexible, IP-level load balancing on the real server pool. This flexibility is due to the variety of scheduling algorithms an administrator can choose from when configuring LVS. LVS load balancing is superior to less flexible methods, such as Round-Robin DNS where the hierarchical nature of DNS and the caching by client machines can lead to load imbalances. Additionally, the low-level filtering employed by the LVS router has advantages over application-level request forwarding because balancing loads at the network packet level causes minimal computational overhead and allows for greater scalability.

Using scheduling, the active router can take into account the real servers' activity and, optionally, an administrator-assigned weight factor when routing service requests. Using assigned weights gives arbitrary priorities to individual machines. Using this form of scheduling, it is possible to create a group of real servers using a variety of hardware and software combinations and the active router can evenly load each real server.

The scheduling mechanism for LVS is provided by a collection of kernel patches called IP Virtual Server or IPVS modules. These modules enable layer 4 ( L4) transport layer switching, which is designed to work well with multiple servers on a single IP address.

To track and route packets to the real servers efficiently, IPVS builds an IPVS table in the kernel. This table is used by the active LVS router to redirect requests from a virtual server address to and returning from real servers in the pool. The IPVS table is constantly updated by a utility called ipvsadm — adding and removing cluster members depending on their availability.

Scheduling Algorithms

The structure that the IPVS table takes depends on the scheduling algorithm that the administrator chooses for any given virtual server. To allow for maximum flexibility in the types of services you can cluster and how these services are scheduled, Red Hat Enterprise Linux provides the following scheduling algorithms listed below. For instructions on how to assign scheduling algorithms refer to Section 4.6.1, “The VIRTUAL SERVER Subsection”.
Round-Robin Scheduling
Distributes each request sequentially around the pool of real servers. Using this algorithm, all the real servers are treated as equals without regard to capacity or load. This scheduling model resembles round-robin DNS but is more granular due to the fact that it is network-connection based and not host-based. LVS round-robin scheduling also does not suffer the imbalances caused by cached DNS queries.

Weighted Round-Robin Scheduling
Distributes each request sequentially around the pool of real servers but gives more jobs to servers with greater capacity. Capacity is indicated by a user-assigned weight factor, which is then adjusted upward or downward by dynamic load information. Refer to Section 1.3.2, “Server Weight and Scheduling” for more on weighting real servers.
Weighted round-robin scheduling is a preferred choice if there are significant differences in the capacity of real servers in the pool. However, if the request load varies dramatically, the more heavily weighted server may answer more than its share of requests.

Least-Connection
Distributes more requests to real servers with fewer active connections. Because it keeps track of live connections to the real servers through the IPVS table, least-connection is a type of dynamic scheduling algorithm, making it a better choice if there is a high degree of variation in the request load. It is best suited for a real server pool where each member node has roughly the same capacity. If a group of servers have different capabilities, weighted least-connection scheduling is a better choice.

Weighted Least-Connections (default)
Distributes more requests to servers with fewer active connections relative to their capacities. Capacity is indicated by a user-assigned weight, which is then adjusted upward or downward by dynamic load information. The addition of weighting makes this algorithm ideal when the real server pool contains hardware of varying capacity. Refer to Section 1.3.2, “Server Weight and Scheduling” for more on weighting real servers.

Locality-Based Least-Connection Scheduling
Distributes more requests to servers with fewer active connections relative to their destination IPs. This algorithm is designed for use in a proxy-cache server cluster. It routes the packets for an IP address to the server for that address unless that server is above its capacity and has a server in its half load, in which case it assigns the IP address to the least loaded real server.

Locality-Based Least-Connection Scheduling with Replication Scheduling
Distributes more requests to servers with fewer active connections relative to their destination IPs. This algorithm is also designed for use in a proxy-cache server cluster. It differs from Locality-Based Least-Connection Scheduling by mapping the target IP address to a subset of real server nodes. Requests are then routed to the server in this subset with the lowest number of connections. If all the nodes for the destination IP are above capacity, it replicates a new server for that destination IP address by adding the real server with the least connections from the overall pool of real servers to the subset of real servers for that destination IP. The most loaded node is then dropped from the real server subset to prevent over-replication.

Destination Hash Scheduling
Distributes requests to the pool of real servers by looking up the destination IP in a static hash table. This algorithm is designed for use in a proxy-cache server cluster.

Source Hash Scheduling
Distributes requests to the pool of real servers by looking up the source IP in a static hash table. This algorithm is designed for LVS routers with multiple firewalls.

Source

genhash

genhash(1) - Linux man page

Name
genhash - md5 hash generation tool for remote web pages
Synopsis
genhash [options] [-s server-address] [-p port] [-u url]
Description
genhash is a tool used for generating md5sum hashes of remote web pages. genhash can use HTTP or HTTPS 
to connect to the web page. The output by this utility includes the HTTP header, page data, and the 
md5sum of the data. This md5sum can then be used within the keepalived(8) program, for monitoring HTTP 
and HTTPS services.

Options

--use-ssl, -S
    Use SSL to connect to the server. 
--server <host>, -s
    Specify the ip address to connect to. 
--port <port>, -p
    Specify the port to connect to. 
--url <url>, -u
    Specify the path to the file you want to generate the hash of. 
--use-virtualhost <host>, -u
    Specify the virtual host to send along with the HTTP headers. 
--verbose, -v
    Be verbose with the output. 
--help, -h
    Display the program help screen and exit. 
--release, -r
    Display the release number (version) and exit.


Source

331.2 HAProxy (weight: 3)

Exam candidates should be able to install, configure, maintain and troubleshoot HAProxy. Key Knowledge Areas:

  • HAProxy


The following is a partial list of the used files, terms and utilities:

  • ACLs
  • load balancing algorithms

HAProxy

HAProxy is a free, very fast and reliable solution offering high availability, load balancing, and proxying for TCP and HTTP-based applications. It is particularly suited for web sites crawling under very high loads while needing persistence or Layer7 processing. Supporting tens of thousands of connections is clearly realistic with todays hardware. Its mode of operation makes its integration into existing architectures very easy and riskless, while still offering the possibility not to expose fragile web servers to the Net.

Configuring HAProxy

Configuration file format
HAProxy's configuration process involves 3 major sources of parameters :

  • the arguments from the command-line, which always take precedence
  • the “global” section, which sets process-wide parameters
  • the proxies sections which can take form of “defaults”, “listen”, “frontend” and “backend”.


The configuration file syntax consists in lines beginning with a keyword referenced in this manual, optionally followed by one or several parameters delimited by spaces. If spaces have to be entered in strings, then they must be preceded by a backslash ('\') to be escaped. Backslashes also have to be escaped by doubling them.

Time format
Some parameters involve values representing time, such as timeouts. These values are generally expressed in milliseconds (unless explicitly stated otherwise) but may be expressed in any other unit by suffixing the unit to the numeric value. It is important to consider this because it will not be repeated for every keyword. Supported units are :

  • us : microseconds. 1 microsecond = 1/1000000 second
  • ms : milliseconds. 1 millisecond = 1/1000 second. This is the default.
  • s : seconds. 1s = 1000ms
  • m : minutes. 1m = 60s = 60000ms
  • h : hours. 1h = 60m = 3600s = 3600000ms
  • d : days. 1d = 24h = 1440m = 86400s = 86400000ms

Examples

    # Simple configuration for an HTTP proxy listening on port 80 on all
    # interfaces and forwarding requests to a single backend "servers" with a
    # single server "server1" listening on 127.0.0.1:8000
    global
        daemon
        maxconn 256

    defaults
        mode http
        timeout connect 5000ms
        timeout client 50000ms
        timeout server 50000ms

    frontend http-in
        bind *:80
        default_backend servers

    backend servers
        server server1 127.0.0.1:8000 maxconn 32


    # The same configuration defined with a single listen block. Shorter but
    # less expressive, especially in HTTP mode.
    global
        daemon
        maxconn 256

    defaults
        mode http
        timeout connect 5000ms
        timeout client 50000ms
        timeout server 50000ms

    listen http-in
        bind *:80
        server server1 127.0.0.1:8000 maxconn 32

Assuming haproxy is in $PATH, test these configurations in a shell with:

    $ sudo haproxy -f configuration.conf -c


Global parameters
Parameters in the “global” section are process-wide and often OS-specific. They are generally set once for all and do not need being changed once correct. Some of them have command-line equivalents.

The following keywords are supported in the “global” section :

 * Process management and security
   - ca-base
   - chroot
   - crt-base
   - daemon
   - gid
   - group
   - log
   - log-send-hostname
   - nbproc
   - pidfile
   - uid
   - ulimit-n
   - user
   - stats
   - node
   - description
   - unix-bind

 * Performance tuning
   - maxconn
   - maxconnrate
   - maxpipes
   - maxsslconn
   - noepoll
   - nokqueue
   - nopoll
   - nosepoll
   - nosplice
   - spread-checks
   - tune.bufsize
   - tune.chksize
   - tune.http.maxhdr
   - tune.maxaccept
   - tune.maxpollevents
   - tune.maxrewrite
   - tune.pipesize
   - tune.rcvbuf.client
   - tune.rcvbuf.server
   - tune.sndbuf.client
   - tune.sndbuf.server

 * Debugging
   - debug
   - quiet

Source

ACLs

The use of Access Control Lists (ACL) provides a flexible solution to perform content switching and generally to take decisions based on content extracted from the request, the response or any environmental status. The principle is simple:

  • define test criteria with sets of values
  • perform actions only if a set of tests is valid


The actions generally consist in blocking the request, or selecting a backend.

In order to define a test, the acl keyword is used. The syntax is:

acl <aclname> <criterion> [flags] [operator] <value> ...

This creates a new ACL \<aclname\> or completes an existing one with new tests. Those tests apply to the portion of request/response specified in \<criterion\> and may be adjusted with optional flags flags. Some criteria also support an operator which may be specified before the set of values. The values are of the type supported by the criterion, and are separated by spaces.

ACL names must be formed from upper and lower case letters, digits, dash, underscore, dot and colon. ACL names are case-sensitive, which means that “my\_acl” and “My\_Acl” are two different ACLs.

There is no enforced limit to the number of ACLs. The unused ones do not affect performance, they just consume a small amount of memory.

The following ACL flags are currently supported:

-iignore case during matching of all subsequent patterns.
-fload patterns from a file.
force end of flags. Useful when a string looks like one of the flags.

The ”-f” flag is special as it loads all of the lines it finds in the file specified in argument and loads all of them before continuing. It is even possible to pass multiple ”-f” arguments if the patterns are to be loaded from multiple files. Empty lines as well as lines beginning with a sharp ('#') will be ignored. All leading spaces and tabs will be stripped. If it is absolutely needed to insert a valid pattern beginning with a sharp, just prefix it with a space so that it is not taken for a comment. Depending on the data type and match method, haproxy may load the lines into a binary tree, allowing very fast lookups. This is true for IPv4 and exact string matching. In this case, duplicates will automatically be removed. Also, note that the ”-i” flag applies to subsequent entries and not to entries loaded from files preceeding it. For instance:

acl valid-ua hdr(user-agent) -f exact-ua.lst -i -f generic-ua.lst  test

In this example, each line of “exact-ua.lst” will be exactly matched against the “user-agent” header of the request. Then each line of “generic-ua” will be case-insensitively matched. Then the word “test” will be insensitively matched too.

Note that right now it is difficult for the ACL parsers to report errors, so if a file is unreadable or unparsable, the most you'll get is a parse error in the ACL. Thus, file-based ACLs should only be produced by reliable processes.

Supported types of values are:

  • integers or integer ranges
  • strings
  • regular expressions
  • IP addresses and networks


Source

Some actions are only performed upon a valid condition. A condition is a combination of ACLs with operators. 3 operators are supported:

    AND (implicit)
    OR (explicit with the "or" keyword or the "||" operator)
    Negation with the exclamation mark ("!") 

A condition is formed as a disjunctive form:

    [!]acl1 [!]acl2 ... [!]acln { or [!]acl1 [!]acl2 ... [!]acln } ... 

Such conditions are generally used after an “if” or “unless” statement, indicating when the condition will trigger the action.

For instance, to block HTTP requests to the “*” URL with methods other than “OPTIONS”, as well as POST requests without content-length, and GET or HEAD requests with a content-length greater than 0, and finally every request which is not either GET/HEAD/POST/OPTIONS!

acl missing_cl hdr_cnt(Content-length) eq 0
block if HTTP_URL_STAR !METH_OPTIONS || METH_POST missing_cl
block if METH_GET HTTP_CONTENT
block unless METH_GET or METH_POST or METH_OPTIONS

To select a different backend for requests to static contents on the “www” site and to every request on the “img”, “video”, “download” and “ftp” hosts:

acl url_static  path_beg         /static /images /img /css
acl url_static  path_end         .gif .png .jpg .css .js
acl host_www    hdr_beg(host) -i www
acl host_static hdr_beg(host) -i img. video. download. ftp.

# now use backend "static" for all static-only hosts, and for static urls
# of host "www". Use backend "www" for the rest.
use_backend static if host_static or host_www url_static
use_backend www    if host_www

It is also possible to form rules using “anonymous ACLs”. Those are unnamed ACL expressions that are built on the fly without needing to be declared. They must be enclosed between braces, with a space before and after each brace (because the braces must be seen as independant words). Example:

The following rule:

acl missing_cl hdr_cnt(Content-length) eq 0
block if METH_POST missing_cl

Can also be written that way:

block if METH_POST { hdr_cnt(Content-length) eq 0 }

It is generally not recommended to use this construct because it's a lot easier to leave errors in the configuration when written that way. However, for very simple rules matching only one source IP address for instance, it can make more sense to use them than to declare ACLs with random names. Another example of good use is the following:

With named ACLs:

acl site_dead nbsrv(dynamic) lt 2
acl site_dead nbsrv(static)  lt 2
monitor fail  if site_dead

With anonymous ACLs:

monitor fail if { nbsrv(dynamic) lt 2 } || { nbsrv(static) lt 2 }

Source

load balancing algorithms

balance <algorithm> [ <arguments> ]
balance url_param <param> [check_post [<max_wait>]]


Sections

DefaultsFrontendListenBackend
YesNoYesYes


Arguments
<algorithm>

The algorithm used to select a server when doing load balancing. This only applies when no persistence information is available, or when a connection is redispatched to another server. <algorithm> may be one of the following:

roundrobin
Each server is used in turns, according to their weights. This is the smoothest and fairest algorithm when the server's processing time remains equally distributed. This algorithm is dynamic, which means that server weights may be adjusted on the fly for slow starts for instance. It is limited by design to 4128 active servers per backend. Note that in some large farms, when a server becomes up after having been down for a very short time, it may sometimes take a few hundreds requests for it to be re-integrated into the farm and start receiving traffic. This is normal, though very rare. It is indicated here in case you would have the chance to observe it, so that you don't worry.

static-rr
Each server is used in turns, according to their weights. This algorithm is as similar to roundrobin except that it is static, which means that changing a server's weight on the fly will have no effect. On the other hand, it has no design limitation on the number of servers, and when a server goes up, it is always immediately reintroduced into the farm, once the full map is recomputed. It also uses slightly less CPU to run (around -1%).

leastconn
The server with the lowest number of connections receives the connection. Round-robin is performed within groups of servers of the same load to ensure that all servers will be used. Use of this algorithm is recommended where very long sessions are expected, such as LDAP, SQL, TSE, etc… but is not very well suited for protocols using short sessions such as HTTP. This algorithm is dynamic, which means that server weights may be adjusted on the fly for slow starts for instance.

source
The source IP address is hashed and divided by the total weight of the running servers to designate which server will receive the request. This ensures that the same client IP address will always reach the same server as long as no server goes down or up. If the hash result changes due to the number of running servers changing, many clients will be directed to a different server. This algorithm is generally used in TCP mode where no cookie may be inserted. It may also be used on the Internet to provide a best-effort stickiness to clients which refuse session cookies. This algorithm is static by default, which means that changing a server's weight on the fly will have no effect, but this can be changed using hash-type.

uri
The left part of the URI (before the question mark) is hashed and divided by the total weight of the running servers. The result designates which server will receive the request. This ensures that a same URI will always be directed to the same server as long as no server goes up or down. This is used with proxy caches and anti-virus proxies in order to maximize the cache hit rate. Note that this algorithm may only be used in an HTTP backend. This algorithm is static by default, which means that changing a server's weight on the fly will have no effect, but this can be changed using
hash-type.

This algorithm support two optional parameters “len” and “depth”, both followed by a positive integer number. These options may be helpful when it is needed to balance servers based on the beginning of the URI only. The “len” parameter indicates that the algorithm should only consider that many characters at the beginning of the URI to compute the hash. Note that having “len” set to 1 rarely makes sense since most URIs start with a leading ”/”.

The “depth” parameter indicates the maximum directory depth to be used to compute the hash. One level is counted for each slash in the request. If both parameters are specified, the evaluation stops when either is reached.

url_param
The URL parameter specified in argument will be looked up in the query string of each HTTP GET request.

If the modifier “check_post” is used, then an HTTP POST request entity will be searched for the parameter argument, when the question mark indicating a query string ('?') is not present in the URL. Optionally, specify a number of octets to wait for before attempting to search the message body. If the entity can not be searched, then round robin is used for each request. For instance, if your clients always send the LB parameter in the first 128 bytes, then specify that. The default is 48. The entity data will not be scanned until the required number of octets have arrived at the gateway, this is the minimum of: (default/max_wait, Content-Length or first chunk length). If Content-Length is missing or zero, it does not need to wait for more data than the client promised to send. When Content-Length is present and larger than <max_wait>, then waiting is limited to <max_wait> and it is assumed that this will be enough data to search for the presence of the parameter. In the unlikely event that Transfer-Encoding: chunked is used, only the first chunk is scanned. Parameter values separated by a chunk boundary, may be randomly balanced if at all.

If the parameter is found followed by an equal sign ('=') and a value, then the value is hashed and divided by the total weight of the running servers. The result designates which server will receive the request.

This is used to track user identifiers in requests and ensure that a same user ID will always be sent to the same server as long as no server goes up or down. If no value is found or if the parameter is not found, then a round robin algorithm is applied. Note that this algorithm may only be used in an HTTP backend. This algorithm is static by default, which means that changing a server's weight on the fly will have no effect, but this can be changed using hash-type.

hdr(name)
The HTTP header <name> will be looked up in each HTTP request. Just as with the equivalent ACL 'hdr()' function, the header name in parenthesis is not case sensitive. If the header is absent or if it does not contain any value, the roundrobin algorithm is applied instead.

An optional 'use_domain_only' parameter is available, for reducing the hash algorithm to the main domain part with some specific headers such as 'Host'. For instance, in the Host value “haproxy.1wt.eu”, only “1wt” will be considered.

This algorithm is static by default, which means that changing a server's weight on the fly will have no effect, but this can be changed using hash-type.

rdp-cookie
rdp-cookie(name)
The RDP cookie <name> (or “mstshash” if omitted) will be looked up and hashed for each incoming TCP request. Just as with the equivalent ACL 'req_rdp_cookie()' function, the name is not case-sensitive. This mechanism is useful as a degraded persistence mode, as it makes it possible to always send the same user (or the same session ID) to the same server. If the cookie is not found, the normal roundrobin algorithm is used instead.

Note that for this to work, the frontend must ensure that an RDP cookie is already present in the request buffer. For this you must use 'tcp-request content accept' rule combined with a 'req_rdp_cookie_cnt' ACL.

This algorithm is static by default, which means that changing a server's weight on the fly will have no effect, but this can be changed using hash-type.

<arguments>
An optional list of arguments which may be needed by some algorithms. Right now, only “url_param” and “uri” support an optional argument.

balance uri [len <len>] [depth <depth>]
balance url_param <param> [check_post [<max_wait>]]


Description
The load balancing algorithm of a backend is set to roundrobin when no other algorithm, mode nor option have been set. The algorithm may only be set once for each backend.

Note: the following caveats and limitations on using the “check_post” extension with “url_param” must be considered:

  • All POST requests are eligible for consideration, because there is no way to determine if the parameters will be found in the body or entity which may contain binary data. Therefore another method may be required to restrict consideration of POST requests that have no URL parameters in the body. (see acl reqideny http_end)
  • Using a <max_wait> value larger than the request buffer size does not make sense and is useless. The buffer size is set at build time, and defaults to 16 kB.
  • Content-Encoding is not supported, the parameter search will probably fail; and load balancing will fall back to Round Robin.
  • Expect: 100-continue is not supported, load balancing will fall back to Round Robin.
  • Transfer-Encoding (RFC2616 3.6.1) is only supported in the first chunk. If the entire parameter value is not present in the first chunk, the selection of server is undefined (actually, defined by how little actually appeared in the first chunk).

This feature does not support generation of a 100, 411 or 501 response.

  • In some cases, requesting “check_post” MAY attempt to scan the entire contents of a message body. Scanning normally terminates when linear white space or control characters are found, indicating the end of what might be a URL parameter list. This is probably not a concern with SGML type message bodies.


Examples

balance roundrobin
balance url_param userid
balance url_param session_id check_post 64
balance hdr(User-Agent)
balance hdr(host)
balance hdr(Host) use_domain_only

Source

331.3 LinuxPMI (weight: 1)

Candidates should understand the concepts of LinuxPMI. Basic experience in the installation of LinuxPMI is also expected. Key Knowledge Areas:

  • kernel patching
  • SSI vs MSI


The following is a partial list of the used files, terms and utilities:

  • linuxPMI

linuxPMI

LinuxPMI (Linux Process Migration Infrastructure) is a Linux kernel extension for multi-system-image (in contrast to a single-system image) clustering. The project is a continuation of the abandoned openMosix clustering project.
How LinuxPMI works is perhaps best described by an example. You may have a network of ten computers (nodes) and there is one user working on each node. Nine users are working on simple tasks that do not necessarily use their machine to the full potential. One user is working on a program that spawns a lot of jobs that would overload his computer. Since we have nine nodes on the network that have a lot of free resources, they can basically take over some of the jobs from the one computer that would otherwise be overloaded. In other words, LinuxPMI will migrate jobs from busy computers to computers that are able to perform the same task faster. Even if all ten users were using their machines for heavy tasks, it could be that not all machines are fully occupied at the same time, and LinuxPMI will use these to reduce the load on other machines.

This is in many ways similar in principle to how a multi-user operating system manages workload on a multi-CPU system; however, LinuxPMI can have machines (nodes) with several CPUs. You can also add or remove nodes on a running cluster thus expanding or reducing the total computing power of the system.

In short it means that ten computers are able to work as one large computer; however, there is no master machine, so each machine can be used as an individual workstation.

kernel patching

Always use pure vanilla kernel-sources from http://www.kernel.org/ to compile an openMosix kernel! Please be kind enough to download the kernel using a mirror near to you and always try and download patches to the latest kernel sources you do have instead of downloading the whole thing. This is going to be much appreciated by the Linux community and will greatly increase your geeky Karma ;-) Be sure to use the right openMosix patch depending on the kernel-version. At the moment I write this, the latest 2.4 kernel is 2.4.20 so you should download the openMosix-2.4.20-x.gz patch, where the “x” stands for the patch revision (ie: the greater the revision number, the most recent it is). Do not use the kernel that comes with any Linux-distribution: it won't work. These kernel sources get heavily patched by the distribution-makers so, applying the openMosix patch to such a kernel is going to fail for sure! Been there, done that: trust me ;-)

Download the actual version of the openMosix patch and move it in your kernel-source directory (e.g. /usr/src/linux-2.4.20). If your kernel-source directory is other than ”/usr/src/linux-[version_number]” at least the creation of a symbolic link to ”/usr/src/linux-[version_number]” is required. Supposing you're the root user and you've downloaded the gzipped patch file in your home directory, apply the patch using (guess what?) the patch utility:

mv /root/openMosix-2.4.20-2.gz /usr/src/linux-2.4.20
cd /usr/src/linux-2.4.20
zcat openMosix-2.4.20-2.gz | patch -Np1

In the rare case you don't have “zcat” on your system, do:

mv /root/openMosix-2.4.20-2.gz /usr/src/linux-2.4.20
cd /usr/src/linux-2.4.20
gunzip openMosix-2.4.20-2.gz
cat openMosix-2.4.20-2 | patch -Np1

If the even more weird case you don't have a “cat” on your system (!), do:

mv /root/openMosix-2.4.20-2.gz /usr/src/linux-2.4.20
cd /usr/src/linux-2.4.20
gunzip openMosix-2.4.20-2.gz
patch -Np1 < openMosix-2.4.20-2

The “patch” command should now display a list of patched files from the kernel-sources. If you feel adventurous enough, enable the openMosix related options in the kernel-configuration file, e.g.

...
CONFIG_MOSIX=y
# CONFIG_MOSIX_TOPOLOGY is not set
CONFIG_MOSIX_UDB=y
# CONFIG_MOSIX_DEBUG is not set
# CONFIG_MOSIX_CHEAT_MIGSELF is not set
CONFIG_MOSIX_WEEEEEEEEE=y
CONFIG_MOSIX_DIAG=y
CONFIG_MOSIX_SECUREPORTS=y
CONFIG_MOSIX_DISCLOSURE=3
CONFIG_QKERNEL_EXT=y
CONFIG_MOSIX_DFSA=y
CONFIG_MOSIX_FS=y
CONFIG_MOSIX_PIPE_EXCEPTIONS=y
CONFIG_QOS_JID=y
...

However, it's going to be pretty much easier if you configure the above options using one of the Linux-kernel configuration tools:

make config | menuconfig | xconfig

The above means you have to choose one of “config”, “menuconfig”, and “xconfig”. It's a matter of taste. By the way, “config” is going to work on any system; “menuconfig” needs the curses libraries installed while “xconfig” needs an installed X-window environment plus the TCL/TK libraries and interpreters.

Now compile it with:

make dep bzImage modules modules_install

After compilation install the new kernel with the openMosix options within you boot-loader; e.g. insert an entry for the new kernel in /etc/lilo.conf and run lilo after that.

Reboot and your openMosix-cluster-node is up!

Syntax of the /etc/openmosix.map file Before starting openMosix, there has to be an /etc/openmosix.map configuration file which must be the same on each node.

The standard is now /etc/openmosix.map, /etc/mosix.map and /etc/hpc.map are old standards, but the CVS-version of the tools is backwards compatible and looks for /etc/openmosix.map, /etc/mosix.map and /etc/hpc.map (in that order).

The openmosix.map file contains three space separated fields:

openMosix-Node_ID               IP-Address(or hostname)          Range-size

An example openmosix.map file could look like this:

1       node1   1
2       node2   1
3       node3   1
4       node4   1

or

1       192.168.1.1     1
2       192.168.1.2     1
3       192.168.1.3     1
4       192.168.1.4     1

or with the help of the range-size both of the above examples equal to:

1       192.168.1.1     4

openMosix “counts-up” the last byte of the ip-address of the node according to its openMosix-Node_ID. Of course, if you use a range-size greater than 1 you have to use ip-addresses instead of hostnames.

If a node has more than one network-interface it can be configured with the ALIAS option in the range-size field (which equals to setting the range-size to 0) e.g.

1       192.168.1.1     1
2       192.168.1.2     1
3       192.168.1.3     1
4       192.168.1.4     1
4       192.168.10.10   ALIAS

Here the node with the openMosix-Node_ID 4 has two network-interfaces (192.168.1.4 + 192.168.10.10) which are both visible to openMosix.

Always be sure to run the same openMosix version AND configuration on each of your Cluster's nodes!

Start openMosix with the “setpe” utility on each node :

setpe -w -f /etc/openmosix.map

Execute this command (which will be described later on in this HOWTO) on every node in your openMosix cluster.

Alternatively, you can grab the “openmosix” script which can be found in the scripts directory of the userspace-tools, copy it to the /etc/init.d directory, chmod 0755 it, then use the following commands as root:

/etc/init.d/openmosix stop
/etc/init.d/openmosix start
/etc/init.d/openmosix restart

Installation is finished now: the cluster is up and running :)

Source

SSI vs MSI

Multiple-system image (LinuxPMI)

  • Transparent process migration
    • Systems can be relatively heterogenous (except CPU type)
    • Automatic management and load ballancing
  • Almost no resource sharing and IPC
    • Except CPU, physical memory, pipes and trivial cases


Single-system image (MOSIX, Amoeba)

  • Transparent process migration
    • Nodes are almost fully homogenous
  • Full resource sharing and IPC
    • Single filesystem hierarchy, global resource naming and access by design. Sometimes with hardware support (RDMA)


In distributed computing, a single system image (SSI) cluster is a cluster of machines that appears to be one single system. The concept is often considered synonymous with that of a distributed operating system, but a single image may be presented for more limited purposes, just job scheduling for instance, which may be achieved by means of an additional layer of software over conventional operating system images running on each node. The interest in SSI clusters is based on the perception that they may be simpler to use and administer than more specialized clusters.

Different SSI systems may provide a more or less complete illusion of a single system.

Features of SSI clustering systems

Different SSI systems may, depending on their intended usage, provide some subset of these features.
Process migration
Many SSI systems provide process migration.[6] Processes may start on one node and be moved to another node, possibly for resource balancing or administrative reasons.[note 1] As processes are moved from one node to another, other associated resources (for example IPC resources) may be moved with them.

Process checkpointing
Some SSI systems allow checkpointing of running processes, allowing their current state to be saved and reloaded at a later date. Checkpointing can be seen as related to migration, as migrating a process from one node to another can be implemented by first checkpointing the process, then restarting it on another node. Alternatively checkpointing can be considered as migration to disk.

Single process space
Some SSI systems provide the illusion that all processes are running on the same machine - the process management tools (e.g. “ps”, “kill” on Unix like systems) operate on all processes in the cluster.

Single root
Most SSI systems provide a single view of the file system. This may be achieved by a simple NFS server, shared disk devices or even file replication.
The advantage of a single root view is that processes may be run on any available node and access needed files with no special precautions. If the cluster implements process migration a single root view enables direct accesses to the files from the node where the process is currently running.

Some SSI systems provide a way of “breaking the illusion”, having some node-specific files even in a single root. HP TruCluster provides a “context dependent symbolic link” (CDSL) which points to different files depending on the node that accesses it. HP VMScluster provides a search list logical name with node specific files occluding cluster shared files where necessary. This capability may be necessary to deal with heterogeneous clusters, where not all nodes have the same configuration. In more complex configurations such as multiple nodes of multiple architectures over multiple sites, several local disks may combine to form the logical single root.

Single I/O space
Some SSI systems allow all nodes to access the I/O devices (e.g. tapes, disks, serial lines and so on) of other nodes. There may be some restrictions on the kinds of accesses allowed (For example OpenSSI can't mount disk devices from one node on another node).

Single IPC space
Some SSI systems allow processes on different nodes to communicate using inter-process communications mechanisms as if they were running on the same machine. On some SSI systems this can even include shared memory (can be emulated with Software Distributed shared memory).

In most cases inter-node IPC will be slower than IPC on the same machine, possibly drastically slower for shared memory. Some SSI clusters include special hardware to reduce this slowdown.

Cluster IP address
Some SSI systems provide a “cluster address”, a single address visible from outside the cluster that can be used to contact the cluster as if it were one machine. This can be used for load balancing inbound calls to the cluster, directing them to lightly loaded nodes, or for redundancy, moving the cluster address from one machine to another as nodes join or leave the cluster.

Source

Topic 332: Cluster Management

332.1 Pacemaker (weight: 5)

Candidates should have experience in the installation, configuration, maintenance and troubleshooting of the Pacemaker cluster management set of technologies. This includes the use of heartbeat version 2. Key Knowledge Areas:

  • Essential cluster configuration
  • resource agents


The following is a partial list of the used files, terms and utilities:

  • crmd
  • PEngine
  • CIB ptest
  • cibadmin
  • crmadmin
  • crm_* resource agents (heartbeat v2, LSB, OCF)
  • authkeys
  • /usr/lib/heartbeat/ResourceManager
  • /etc/ha.d/

Pacemaker

Here is the list of the possible components that might make up a Pacemaker install is:

  • Pacemaker - Resource manager
  • Corosync - Messaging layer
  • Heartbeat - Also a messaging layer
  • Resource Agents - Scripts that know how to control various services

Pacemaker is the thing that starts and stops services (like your database or mail server) and contains logic for ensuring both that they’re running, and that they’re only running in one location (to avoid data corruption).

But it can’t do that without the ability to talk to instances of itself on the other node(s), which is where Heartbeat and/or Corosync come in.

Think of Heartbeat and Corosync as dbus but between nodes. Somewhere that any node can throw messages on and know that they’ll be received by all its peers. This bus also ensures that everyone agrees who is (and is not) connected to the bus and tells Pacemaker when that list changes.

For two nodes Pacemaker could just as easily use sockets, but beyond that the complexity grows quite rapidly and is very hard to get right - so it really makes sense to use existing components that have proven to be reliable.

You only need one of them though :-)

Source

Essential cluster configuration

Console configuration

Enter configuration mode

Connect to the crm manager

[root@web1 ~]# crm

Create a new shadow copy of the current configuration named test-conf

crm(live)# cib new test-conf

Enter in the configuration menu

crm(live)# cib use test-conf
crm(test-conf)# configure
crm(test-conf)configure#

See current configuration
see the crm configuration
This will show you the base CRM directives configured in your currently running Heartbeat cluster

crm(test-conf)configure# show


see the cib.xml
This will show you the cib.xml contents currently in use on your running Heartbeat cluster

crm(test-conf)configure# show xml

Commit changes
Verify that your changes will not break the current setup

crm(test-conf)configure# verify

Exit configuration mode

crm(test-conf)configure# end

Commit the changes you have made in test-conf

crm(live)# cib commit test-conf

Exit from the CRM CLI interface

crm(live)# quit

In all examples

  • you should first create a shadow cib
  • you will have to commit changes made in the shadow cib
  • our shared IP is: 85.9.12.3
  • our default gateway IP is: 85.9.12.100
  • machine 1 is with hostname jaba.failover.net
  • machine 2 is with hostname joda.failover.net
  • stonith is disabled
  • you have configured both nodes jaba and joda in the cib.xml (if not please see the XML exmaples)


Failover IP
Additional assumptions for this example:

  • we monitor the IP every 10 seconds

Here we create a resource which will use IPaddr (OCF script, provided by Heartbeat).br / Tell this resource that it has one paramter (ip) which has one value (85.9.12.3). br / Tell this resource that it has one opratation (monitor) which has one parameter (interval) with one value (10s)

crm(test-conf)configure# primitive failover-ip ocf:heartbeat:IPaddr params ip=85.9.12.3 op monitor interval=10s


Failover IP + One service
Here we assume that:

  • our service that we migrate is Apache
  • we monitor the IP every 10 seconds
  • we monitor the Service(apache) every 15 seconds


Here we create a resource which will use IPaddr (OCF script, provided by Heartbeat).br / Tell this resource that it has one paramter (ip) which has one value (85.9.12.3). br / Tell this resource that it has one opratation (monitor) which has one parameter (interval) with one value (10s)

crm(test-conf)configure# primitive failover-ip ocf:heartbeat:IPaddr params ip=85.9.12.3 op monitor \
 interval=10s

Here we create another resource which will use apache (LSB script, default location /etc/init.d/apache).br / Tell this resource that it has one opratation (monitor) which has one parameter (interval) with one value (15s)

crm(test-conf)configure# primitive failover-apache lsb::apache op monitor \
 interval=15s


Failover IP Service in a Group
Here we assume that:

  • our service that we migrate is Apache
  • we monitor the IP every 10 seconds
  • we monitor the Service(apache) every 15 seconds
  • we have both the IP and the Service in a group called my_web_cluster
crm(test-conf)configure# primitive failover-ip ocf:heartbeat:IPaddr params ip=85.9.12.3 op monitor \
 interval=10s
crm(test-conf)configure# primitive failover-apache lsb::apache op monitor interval=15s
crm(test-conf)configure# group my_web_cluster failover-ip failover-apache


Failover IP Service in a Group running on a connected node
We still assume the last example. But on top of that, we want the group to run on a node that has a working network connection to our default gateway. Therefore, we configure [pingd] and create a location constraint that looks at the pingd attribute representing that network connectivity.

Set up pingd
You do not have to make any change to /etc/ha.d/ha.cf for this to work:

crm(pingd)configure# primitive pingd ocf:pacemaker:pingd \ 
                      params host_list=85.9.12.100 multiplier=100 \
                      op monitor interval=15s timeout=5s
crm(pingd)configure# clone pingdclone pingd meta globally-unique=false


pingd location constraint
This tells the cluster to only run the group on a node with a working network connection to the default gateway.

crm(pingd)configure# location my_web_cluster_on_connected_node my_web_cluster \
 rule -inf: not_defined pingd or pingd lte 0

Source

XML Config

All examples share these crm nodes config

    <crm_config>
       <cluster_property_set id=cib-bootstrap-options>
         <nvpair id=option-1 name=symmetric-cluster value=false/>
         <nvpair id=option-2 name=no-quorum-policy value=stop/>
         <nvpair id=option-3 name=stonith-enabled value=false/>
       </cluster_property_set>
     </crm_config>
     <nodes>
       <node id=8323c40f-76eb-4187-8c44-51547dc5cd73 uname=jaba.failover.net type=normal/>
       <node id=8405b0df-9044-477f-94fa-412e0e071b94 uname=joda.failover.net type=normal/>
     </nodes>

Failover IP

cib
  configuration
    crm_config/
    nodes/
    resources
      primitive id=failover-ip class=ocf provider=heartbeat type=IPaddr
        operations
          op id=failover-ip-monitor name=monitor interval=10s/
        /operations
        instance_attributes id=failover-ip-attribs
          nvpair id=failover-ip-addr name=ip value=85.9.12.3/
        /instance_attributes
     /primitive
    /resources
  /configuration
  status/
/cib

Failover IP + One service

cib
  configuration
    crm_config/
    nodes/
    resources
      primitive class=ocf id=failover-ip provider=heartbeat type=IPaddr
        instance_attributes id=failover-ip-instance_attributes
          nvpair id=failover-ip-instance_attributes-ip name=ip value=85.9.12.3/
        /instance_attributes
        operations id=failover-ip-ops
          op id=failover-ip-montor-10s interval=10s name=montor/
        /operations
      /primitive
      primitive class=lsb id=failover-apache type=apache
        operations id=failover-apache-ops
          op id=failover-apache-montor-15s interval=15s name=montor/
        /operations
      /primitive
    /resources
  /configuration
  status/
/cib

Failover IP Service in a Group

cib
  configuration
    crm_config/
    nodes/
    resources
      group id=my_web_cluster
        primitive class=ocf id=failover-ip provider=heartbeat type=IPaddr
          instance_attributes id=failover-ip-instance_attributes
            nvpair id=failover-ip-instance_attributes-ip name=ip value=85.9.12.3/
          /instance_attributes
          operations id=failover-ip-ops
            op id=failover-ip-montor-10s interval=10s name=montor/
          /operations
        /primitive
        primitive class=lsb id=failover-apache type=apache
          operations id=failover-apache-ops
            op id=failover-apache-montor-15s interval=15s name=montor/
          /operations
        /primitive
      /group
    /resources
  /configuration
  status/
/cib

Source

resource agents

A resource agent is a standardized interface for a cluster resource. In translates a standard set of operations into steps specific to the resource or application, and interprets their results as success or failure.

Pacemaker supports three types of Resource Agents,

  • LSB Resource Agents,
  • OCF Resource Agents,
  • legacy Heartbeat Resource Agents

Supported Operations
Operations which a resource agent my perform on a resource instance include:

  • start: enable or start the given resource
  • stop: disable or stop the given resource
  • monitor: check whether the given resource is running (and/or doing useful work), return status as running or not running
  • validate-all: validate the resource's configuration
  • meta-data: return information about the resource agent itself (used by GUIs and other management utilities, and documentation tools)
  • some more, see OCF Resource Agents and the Pacemaker documentation for details.

Implementation
Most resource agents are coded as shell scripts. This, however, is by no means a necessity – the defined interface is language agnostic.

They are synchronous in nature. That is, you start them, and they complete some time later, and you are expected to wait for them to complete. Certain operations (notably start, stop and monitor) may take considerable time to complete. Considerable time means seconds to many minutes in some cases.

Source

crmd

Short for Cluster Resource Management Daemon. Largely a message broker for the PEngine and LRM, it also elects a leader to co-ordinate the activities (including starting/stopping resources) of the cluster.

PEngine

Short for Policy Engine. Computes the next state of the cluster based on the current state and the configuration. Produces a transition graph contained a list of actions and dependancies.

CIB ptest

Short for Cluster Information Base. Contains definitions of all cluster options, nodes, resources, their relationships to one another and current status. Synchronizes updates to all cluster nodes.

cibadmin

Name
Pacemaker - Part of the Pacemaker cluster resource manager
Synopsis
cibadmin command [options] [data]
Description
cibadmin - Provides direct access to the cluster configuration.

Allows the configuration, or sections of it, to be queried, modified, replaced and deleted.
Where necessary, XML data will be obtained using the -X, -x, or -p options

Options

-?, --help
    This text 
-$, --version
    Version information 
-V, --verbose
    Increase debug output

Commands:

-u, --upgrade
    Upgrade the configuration to the latest syntax 
-Q, --query
    Query the contents of the CIB 
-E, --erase
    Erase the contents of the whole CIB 
-B, --bump
    Increase the CIB's epoch value by 1 
-C, --create
    Create an object in the CIB. Will fail if the object already exists. 
-M, --modify
    Find the object somewhere in the CIB's XML tree and update it. Fails if the object does not 
    exist unless -c is specified 
-P, --patch
    Supply an update in the form of an xml diff (See also: crm_diff) 
-R, --replace
    Recursivly replace an object in the CIB 
-D, --delete
    Delete the first object matching the supplied criteria, Eg. <op id="rsc1_op1" name="monitor"/> 
The tagname and all attributes must match in order for the element to be deleted
-d, --delete-all
    When used with --xpath, remove all matching objects in the configuration instead of just the 
    first one 
-5, --md5-sum
    Calculate a CIB digest 
-S, --sync
    (Advanced) Force a refresh of the CIB to all nodes

Additional options:

-f, --force
-t, --timeout=value
    Time (in seconds) to wait before declaring the operation failed 
-s, --sync-call
    Wait for call to complete before returning 
-l, --local
    Command takes effect locally. Should only be used for queries 
-c, --allow-create
    (Advanced) Allow the target of a -M operation to be created if they do not exist 
-n, --no-children
    (Advanced) When querying an object, do not return include its children in the result

Data:

-X, --xml-text=value
    Retrieve XML from the supplied string 
-x, --xml-file=value
    Retrieve XML from the named file 
-p, --xml-pipe
    Retrieve XML from stdin 
-A, --xpath=value
    A valid XPath to use instead of -o 
-o, --scope=value
    Limit the scope of the operation to a specific section of the CIB. Valid values are: nodes, 
    resources, constraints, crm_config, rsc_defaults, op_defaults, status 
-N, --node=value
    (Advanced) Send command to the specified host

Examples

Query the configuration from the local node:

# cibadmin --query --local

Query the just the cluster options configuration:

# cibadmin --query --scope crm_config

Query all 'target-role' settings:

# cibadmin --query --xpath "//nvpair[@name='target-role']"

Remove all 'is-managed' settings:

# cibadmin --delete-all --xpath "//nvpair[@name='is-managed']"

Remove the resource named 'old':

# cibadmin --delete --xml-text '<primitive id="old"/>'

Remove all resources from the configuration:

# cibadmin --replace --scope resources --xml-text '<resources/>'

Replace the complete configuration with the contents of $HOME/pacemaker.xml:

# cibadmin --replace --xml-file $HOME/pacemaker.xml

Replace the constraints section of the configuration with the contents of $HOME/constraints.xml:

# cibadmin --replace --scope constraints --xml-file $HOME/constraints.xml

Increase the configuration version to prevent old configurations from being loaded accidentally:

# cibadmin --modify --xml-text '<cib admin_epoch="admin_epoch++"/>'

Edit the configuration with your favorite $EDITOR:

# cibadmin --query > $HOME/local.xml
# $EDITOR $HOME/local.xml
# cibadmin --replace --xml-file $HOME/local.xml

Source

crmadmin

Name
Pacemaker - Part of the Pacemaker cluster resource manager
Synopsis
crmadmin command [options]
Description
crmadmin - Development tool for performing some crmd-specific commands.

Likely to be replaced by crm_node in the future

Options

-?, --help
    This text 
-$, --version
    Version information 
-q, --quiet
    Display only the essential query information 
-V, --verbose
    Increase debug output

Commands:

-i, --debug_inc=value
    Increase the crmd's debug level on the specified host 
-d, --debug_dec=value
    Decrease the crmd's debug level on the specified host 
-S, --status=value
    Display the status of the specified node. 
Result is the node's internal FSM state which can be useful for debugging
-D, --dc_lookup
    Display the uname of the node co-ordinating the cluster. 
    This is an internal detail and is rarely useful to administrators except when deciding on which node
    to examine the logs.
-N, --nodes
    Display the uname of all member nodes 
-E, --election
    (Advanced) Start an election for the cluster co-ordinator 
-K, --kill=value
    (Advanced) Shut down the crmd (not the rest of the clusterstack ) on the specified node

Additional Options:

-t, --timeout=value
    Time (in milliseconds) to wait before declaring the operation failed 
-B, --bash-export
    Create Bash export entries of the form 'export uname=uuid'

Notes:

The -i,-d,-K and -E commands are rarely used and may be removed in future versions.

Source

crm_* resource agents (heartbeat v2, LSB, OCF)

Heartbeat Resource Agents

The legacy Heartbeat resource agents (scripts) are basically LSB init scripts - with slightly odd status operations.

The following is true for all resource agents (even init scripts) for haresources mode, and for class=heartbeat primitives in Pacemaker. But when using Pacemaker, you are usually better off using the corresponding OCF Resource Agents, or, if none is available, real LSB Resource Agents instead.

The only operations on the resource scripts which the cluster performs are:

  • start
  • stop
  • status


These operations are as follows:

start operation
Activate the given resource.

According to the LSB, it is never an error to start an already active resource. Exit with 0 on success, nonzero on failure. The cluster will only start a resource if it wants it to be running on the current machine, and status shows it's not already running. The cluster will never start the same resource at the same time in different nodes in the cluster.

stop operation
Deactivate the given resource.

Performed when we want to make sure a resource is not running. Although there are occasions when we check to see if a resource is running before stopping it, during shutdown, we will stop all resources whether or not we think they're running.

According to the LSB, stopping a resource which is already stopped is always permissible. The cluster will DEFINITELY stop resources it doesn't know is running. Stop failures can result in the machine being rebooted to clear up the error. Note that some init scripts are not LSB-compliant and complain when trying to stop resources which are not running. You'll have to fix those to properly work as cluster resource agents.

status operation
Determine running status of the given resource.

The status operation has to really report status correctly, AND, it has to print either OK or running when the resource is active, and it CANNOT print either of those when it's inactive. For the status operation, we ignore the return code.

This sounds quite odd, but it's a historical hangover for compatibility with earlier versions of Linux distributions where the init scripts didn't reliably give proper status exit codes, but they did print OK or running reliably.

Heartbeat calls the status operation in many places. We do it before starting any resource, and also (IIRC) when releasing resources.

After repeated stop failures, we will do a status on the resource. If the status reports that the resource is still running, then we will reboot the machine to make sure things are really stopped. Note that this behaviour is only with haresources based clusters. CRM/Pacemaker clusters use stonith.

Concurrency
Start, stop and status operations are NEVER overlapped on a given resource on a given machine. You don't have to worry about concurrency of an operation on a resource.

Parameters
Unlike LSB Resource Agents, a Heartbeat Resource Agent can be passed a list of positional parameters. The parameters go before the operation name, like this:

IPaddr 10.10.10.1 start

The haresources line which corresponds to this set of parameters is:

IPaddr::10.10.10.1

and invoked with the start operation.

Location
haresources mode looks for resource scripts in /etc/ha.d/resource.d and /etc/init.d, in that order.

Source

LSB Resource Agents

Background
LSB Resource Agents are those found in /etc/init.d. Generally they are provided by the OS/distribution and in order to be used with Heartbeat or Pacemaker resource management, must conform to the LSB Spec.

The LSB Spec (as it relates to init scripts) can be found at: http://refspecs.linux-foundation.org/LSB_3.2.0/LSB-Core-generic/LSB-Core-generic/iniscrptact.html

Many distributions claim LSB compliance but ship with broken init scripts. The most common problems are:

  • Not implementing the status operation at all
  • Not observing the correct exit status codes for start/stop/status actions
  • Starting a started resource returns an error (this violates the LSB spec)
  • Stopping a stopped resource returns an error (this violates the LSB spec)

This may also result in “ERROR: reboot narrowly avoided” messages by heartbeat, or, if the status operation is poorly implemented as well, in actual reboots, to “recover” from “stop failure”.

NOTE: Parameters and options can not be passed to LSB Resource Agents.

Init Script (LSB) Compatibility Checks
Assuming some_service is configured correctly and currently not active, the following sequence will help you determine if it is LSB compatible:
1. Start (stopped)

        /etc/init.d/some_service start ; echo "result: $?" 
  • Did the service start?
  • Did the command print result: 0 (in addition to the regular output)?

2. Status (running)

        /etc/init.d/some_service status ; echo "result: $?" 
  • Did the script accept the command?
  • Did the script indicate the service was running?
  • Did the command print result: 0 (in addition to the regular output)?

3. Start (running)

        /etc/init.d/some_service start ; echo "result: $?" 
  • Is the service still running?
  • Did the command print result: 0 (in addition to the regular output)?

4. Stop (running)

        /etc/init.d/some_service stop ; echo "result: $?" 
  • Was the service stopped?
  • Did the command print result: 0 (in addition to the regular output)?

5. Status (stopped)

        /etc/init.d/some_service status ; echo "result: $?" 
  • Did the script accept the command?
  • Did the script indicate the service was not running?
  • Did the command print result: 3 (in addition to the regular output)?

6. Stop (stopped)

        /etc/init.d/some_service stop ; echo "result: $?" 
  • Is the service still stopped?
  • Did the command print result: 0 (in addition to the regular output)?

7. Status (failed)

  • This step is not readily testable and relies on manual inspection of the script.
  • The script can optionally use one of the other codes (other than 3) listed in the LSB spec to indicate that it is active but failed.
  • In such a case, this tells the cluster that, before moving the resource to another node, it should stop it on the existing one first.
  • Making use of these extra exit codes is encouraged.


If the answer to any of the above questions is no, then the init script is not LSB compliant.

If you are using Pacemaker resource management, then your options at this point are to:

  • fix the init script, or
  • write an OCF Resource Agent based on the existing init script

If you are still using the haresources mode of Heartbeat, then the script may still work as long as it follows the rules for Heartbeat Resource Agents.

Source

OCF Resource Agents

Background
The OCF specification is basically an extension of the definitions for an LSB Resource Agents.

OCF Resource Agents are those found in /usr/lib/ocf/resource.d/provider

The OCF Spec (as it relates to Resource Agents) can be found at http://www.opencf.org/cgi-bin/viewcvs.cgi/specs/ra/resource-agent-api.txt?rev=HEAD

A comprehensive documentation is here OCF Resource Agent Developer's Guide.
Writing your own OCF Resource Agent mini Howto

Anything found in the /usr/lib/ocf/resource.d/heartbeat is provided as part of the resource-agents (resp. cluster-agents) package, which you should install together with Heartbeat and Pacemaker. When creating your own agents, you are encouraged to create a new directory under /usr/lib/ocf/resource.d/ and use provider={your subdirectory name}. So, for example, if you want to name your provider dubrouski, and create a resource named serge, you would make a directory called /usr/lib/ocf/resource.d/dubrouski and name your resource script /usr/lib/ocf/resource.d/dubrouski/serge.

For convenience, many of the return codes, defaults and other OCF utility functions are available to be included by custom OCF agents from /usr/lib/heartbeat/ocf-shellfuncs

Beware: Linux-ha implementation have been somewhat extended from the OCF Specs, but none of those changes are incompatible with the OCF specification.

When writing/testing your OCF Resource Agent, you may find the ocf-tester script to be very useful. It comes in the resource-agents package (resp. cluster-agents, on Debian based distros).

Actions
Normal OCF Resource Agents are required to have these actions:

  • start - start the resource. Exit 0 when the resource is correctly running (i-e providing the service) and anything else except 7 if it failed
  • stop - stop the resource. Exit 0 when the resource is correctly stopped and anything else except 7 if it failed.
  • monitor - monitor the health of a resource. Exit 0 if the resource is running, 7 if it is stopped and anything else if it is failed. Note that the monitor script should test the state of the resource on the localhost.
  • meta-data - provide information about this resource as an XML snippet. Exit with 0


Note: OCF specs have strict definitions of what exit codes actions must return. We follow these specifications, and exiting with the wrong exit code will cause the cluster to behave in ways you will likely find puzzling and annoying. In particular, the cluster needs to distinguish a completely stopped resource from one which is in some erroneous and indeterminate state.

OCF Resource Agents should support the following action:

  • validate-all - validate the set of configuration parameters given in the environment, exit with 0 if parameters are valid, 2 if not valid, 6 if resource is not configured, 5 if the software the RA is supposed to run cannot be found.


Additional requirements (not part of the OCF specs) are placed on agents that will be used for cloned and multi-state resources.

  • promote - promote the local instance of a resource to the master/primary state. Should exit 0
  • demote - demote the local instance of a resource to the slave/secondary state. Should exit 0
  • notify - used by the cluster to send the agent pre and post notification events telling the resource what is or did just take place. Must exit 0


Optional actions, for usage details see the Pacemaker documentation

  • reload - reload the configuration (non-unique parameters only) of the resource instance without disrupting the service
  • migrate_from / migrate_to - perform live migration of a resource
  • recover - a variant of the start action, this should try to recover a resource locally (currently not used by Pacemaker).


Parameters
In addition to having more actions, your OCF resource agent is permitted to take parameters to tell it which instance of the resource it is being asked to control, and any simple configuration parameters it might need to tell it what to do or exactly how it should be done.

These are passed in to the script as environment variables, with the special prefix OCF_RESKEY_. So, if you need to be given a parameter which the user thinks of as ip it will be passed to the script as OCF_RESKEY_ip.

Debugging your OCF Resource Agent
The most common problems when implementing OCF Resource Agents are:

  • Not implementing the monitor operation at all
  • Not observing the correct exit status codes for start/stop/monitor actions
  • Starting a started resource returns an error (this violates the OCF spec)
  • Stopping a stopped resource returns an error (this violates the OCF spec)
  • returning 0 for a start/stop action when the resource is not yet completely started/stopped.
  • returning early from start/stop
  • Invalid XML output for the meta-data command


Source

authkeys

Configuring authkeys

The authkeys configuration file contains information for Heartbeat to use when authenticating cluster members. It cannot be readable or writable by anyone other than root.

Two lines are required in the authkeys file:

A line which says which key to use in signing outgoing packets. One or more lines defining how incoming packets might be being signed.

 auth 1 
 1 sha1 PutYourSuperSecretKeyHere

In this sample file, the auth 1 directive says to use key number 1 for signing outgoing packets. The 1 sha1… line describes how to sign the packets. The fields in this line are as follows:

  • 1 - the key number associated with this line.
  • sha1 - the key signature method.
  • PutYourSuperSecretKeyHere - shared secret key[1] to use in signing packets. This key must be the same on all nodes except as noted below.


Normally, the key number would be 1, and the first line would say auth 1.

NOTE

We do not recommend that you use the crc method unless all your communication is across serial lines and crossover cables.

List of supported signature methods

We currently support these signature methods:

  • sha1 - SHA1 hash method (requires a key)
  • md5 - MD5 hash method (requires a key)
  • crc - CRC hash method - insecure - does not require a key

To get an absolutely up-to-date list of authentication methods supported, run this command

 ls /usr/lib*/heartbeat/plugins/HBauth/*.so
Changing Keys in the Cluster

To change keys without restarting heartbeat, the following procedure must be followed:

  1. Choose a new authentication method. I'll refer to the chosen authentication method as authmethod.
  2. Append a new number authmethod line to the authkeys file. The number on this line is fairly arbitrary, but it must be unique in the file and between 1 and 15 inclusive.
  3. Copy this authkeys file to each node in the cluster.
  4. On each node, issue a /etc/init.d/heartbeat reload command.
  5. Change the first line to say auth number to match the new number added in step 2 above.
  6. Copy this authkeys file to each node in the cluster.
  7. On each node, issue a /etc/init.d/heartbeat reload command.
  8. Wait for 500 heartbeat intervals.
  9. Remove the original authnumber authmethod line from the file (not the one added to the file in step 2 above).
  10. Copy this authkeys file to each node in the cluster.
  11. On each node, issue a /etc/init.d/heartbeat reload command.

This is a little odd, but it works…

Generating authkeys Automatically

Since the key in /etc/ha.d/authkeys file never has to be typed by a human being, it is not necessary for it to be in any way mnemonic or memorable. As a result a long, randomly generated key is a good choice.

The following line of shell script will generate such a key:

 cat <<-!AUTH >/etc/ha.d/authkeys
       # Automatically generated authkeys file
       auth 1
       1 sha1 `dd if=/dev/urandom count=4 2>/dev/null | md5sum | cut -c1-32`
 !AUTH

Or for SHA1:

 dd if=/dev/urandom count=4 2>/dev/null | openssl dgst -sha1

Source

/usr/lib/heartbeat/ResourceManager

Location where resource agent scripts are stored

/etc/ha.d/

Location where cluster configuration is stored.

332.2 Advanced Pacemaker (weight: 3)

Candidates should have experience in advanced features of the Pacemaker cluster management set of technologies. This includes the use of OpenAIS and corosync. Key Knowledge Areas:

  • fencing
  • quorum
  • data integrity
  • integration with file systems


The following is a partial list of the used files, terms and utilities:

  • STONITHd
  • OCFS2
  • ldirectord
  • softdog
  • OpenAIS and corosync

fencing

Fencing is the process of locking resources away from a node whose status is uncertain.

There are a variety of fencing techniques available.

One can either fence nodes - using Node Fencing, or fence resources using Resource Fencing. Some types of resources are Self Fencing Resources, and some aren't damaged by simultaneous use, and don't require fencing at all.

Source

Node Fencing

Node fencing is the idea of fencing an entire node out of a cluster at once, independently of what kind of resources it might be running. In Pacemaker/CRM, Node Fencing is implemented by STONITH.

Source

Resource Fencing

Resource Fencing is fencing at resource granularity. It ensures exclusive access to a given resource. Common techniques for this include changing the zoning of the node from a SAN fiber channel switch (locking the node out of access to its disks) and things like SCSI reserve. Resource fencing is implemented differently depending on the type of resource and how access to it is granted, and hence can be denied.

Compared to Node Fencing, where we prevent a failed node from accessing shared resources entirely, it has finer granularity, alas not all resources support this functionality, or it might have some limitations that keep it from being useful in a particular situation.

Source

quorum

One way to solve the mutual fencing dilemma described above is to somehow select only one of these two subclusters to carry on and fence the subclusters it can't communicate with. Of course, you have to solve it without communicating with the other subclusters - since that's the problem - you can't communicate with them. The idea of quorum represents the process of selecting a unique (or distinguished for the mathematically inclined) subcluster.

The most classic solution to selecting a single subcluster is a majority vote. If you choose a subcluster with more than half of the members in it, then (barring bugs) you know there can't be any other subclusters like this one. So, this is looks like a simple and elegant solution to the problem. For many cases, that's true. But, what if your cluster only has two nodes in it? Now, if you have a single node fail, then you can't do anything - no one has quorum. If this is the case, then two machines have no advantage over a single machine - it's not much of an HA cluster. Since 2-node HA clusters are by far the most common size of HA cluster, it's kind of an important case to handle well. So, how are we going to get out of this problem?

Quorum Variants and Improvements
What you need in this case, is some kind of a 3rd party arbitrator to help select who can fence off the other nodes and allow you to bring up resources - safely. To solve this problem there is a variety of other methods available to act as this arbitrator - either software or hardware. Although there are several methods available to use as arbitrator, we'll only talk about one each of hardware and software methods: SCSI reserve and Quorum Daemon.

SCSI reserve: In hardware, we fall back on our friend SCSI reserve. In this usage, both nodes try and reserve a disk partition available to both of them, and the SCSI reserve mechanism ensures that only one of the two of them can succeed. Although I won't go into all the gory details here, SCSI reserve creates its own set of problems including it won't work reliably over geographic distances. A disk which one uses in this way with SCSI reserve to determine quorum is sometimes called a quorum disk. Some HA implementations (notably Microsoft's) require a quorum disk.

Quorum Daemon: In Linux-HA[7], we have implemented a quorum daemon - whose sole purpose in life is to arbitrate quorum disputes between cluster members. One could argue that for the purposes of quorum this is basically SCSI reserve implemented in software - and such an analogy is a reasonable one. However, since it is designed for only this purpose, it has a number of significant advantages over SCSI reserve - one of which is that it can conveniently and reliably operate over geographic distances, making it ideal for disaster recovery (DR) type situations. I'll cover the quorum daemon and why it's a good thing in more detail in a later posting. Both HP and Sun have similar implementations, although I have security concerns about them, particularly over long distances. Other than the security concerns (which might or might not concern you), both HP's and Sun's implementations are also good ideas.

Source

STONITHd

STONITH is a technique for NodeFencing, where the errant node which might have run amok with cluster resources is simply shot in the head. Normally, when an HA system declares a node as dead, it is merely speculating that it is dead. STONITH takes that speculation and makes it reality. “Make it so, Number One”.

Reluctantly setting whimsy and humor aside…

There are a few properties a STONITH plugin must have for it to be usable:

  1. It must never report false positives for reset. If a STONITH plugin reports that the node is down, it had better be down.
  2. It must support the RESET command (on and off are optional)
  3. When given a RESET or OFF command it must not return control to its caller until the node is no longer running. Waiting until it comes up again for RESET is optional.
  4. All commands should work in all circumstances:
    1. RESET when node is ON or OFF should succeed and bring the node up (or at least attempt to bring it up - it may not boot for other reasons).
    2. OFF when node is OFF should succeed.
    3. ON when node is ON should succeed.

If you don't follow these rules, Bad Things Will Happen - if not sooner, then later.

Source

Name

stonithd - Options available for all stonith resources
Synopsis

[stonith-timeout=time] [priority=integer] [pcmk_arg_map=string] [pcmk_host_map=string] 
[pcmk_host_list=string] [pcmk_host_check=string] [pcmk_list_cmd=string] [pcmk_status_cmd=string] 
[pcmk_monitor_cmd=string] [pcmk_reboot_action=string]

Description

This is a fake resource that details the instance attributes handled by stonithd.
Supported Parameters

stonith-timeout = time [60s]

    How long to wait for the STONITH action to complete.

    Overrides the stonith-timeout cluster property 
priority = integer [0]
    The priority of the stonith resource. The lower the number, the higher the priority. 
pcmk_arg_map = string []
    A mapping of host attributes to device arguments.

    Eg. uname:domain would tell the cluster to pass the machines name as the domain argument to the device. 
    Useful for devices that have non-standard interfaces 
pcmk_host_map = string []
    A mapping of host names to ports numbers for devices that do not support names.

    Eg. node1:1,node2:3 would tell the cluster to use port 1 for node1 and port 3 for node2 
pcmk_host_list = string []
    A list of machines controlled by this device (Optional unless pcmk_host_check=static-list). 
pcmk_host_check = string [dynamic-list]
    How to determin which machines are controlled by the device.

    Allowed values: dynamic-list (query the device), static-list (check the pcmk_host_list attribute), none 
    (assume every device can fence every machine) 
pcmk_list_cmd = string [list]
    Which device operation to use for listing machines controlled by the device. 
pcmk_status_cmd = string [status]
    Which device operation to use for testing the state of a machine controlled by the device. 
pcmk_monitor_cmd = string [monitor]
    Which device operation to use for monitoring the health of the device. 
pcmk_reboot_action = string [reboot]
    Which device operation to use for rebooting a target. 

Source

OCFS2

Setting up OCFS2 (Oracle Cluster File System version 2) in Pacemaker requires configuring the Pacemaker DLM, the O2CB lock manager for OCFS2, and an OCFS2 filesystem itself.

Prerequisites
  • OCFS2 with Pacemaker integration is supported on Debian (squeeze-backports and up) and Ubuntu (10.04 LTS and up). You'll need the dlm-pcmk, ocfs2-tools, and ocfs2-tools-pacemaker packages.
  • Fencing is imperative. Get a proper fencing/STONITH configuration set up and test it thoroughly.
  • Running OCFS2/Pacemaker integration requires that you start your corosync processes with the following insanely-named environment variable:

COROSYNC_DEFAULT_CONFIG_IFACE=“openaisserviceenableexperimental:corosync_parser”

    And we're actually not kidding on that one!\\
    You'll have to export it from /etc/default/corosync (which the corosync init script sources).
Pacemaker configuration

The Pacemaker configuration, shown here in crm shell syntax, normally puts all the required resources into one cloned group. Have a look at this configuration snippet:

primitive p_dlm_controld ocf:pacemaker:controld \
  op start interval="0" timeout="90" \
  op stop interval="0" timeout="100" \
  op monitor interval="10"
primitive p_o2cb ocf:pacemaker:o2cb \
  op start interval="0" timeout="90" \
  op stop interval="0" timeout="100" \
  op monitor interval="10"
primitive p_fs_ocfs2 ocf:heartbeat:Filesystem \
  params device="<your device path>" \
    directory="<your mount point>" \
    fstype="ocfs2" \
  meta target-role=Stopped \
  op monitor interval="10"
group g_ocfs2 p_dlm_controld p_o2cb p_fs_ocfs2
clone cl_ocfs2 g_ocfs2 \
  meta interleave="true"
  • ocf:pacemaker:controld — Pacemaker’s interface to the DLM;
  • ocf:ocfs2:o2cb — Pacemaker’s interface to OCFS2 cluster management;
  • ocf:heartbeat:Filesystem — the generic filesystem management resource agent which supports cluster file systems when configured as a Pacemaker clone.

Pacemaker manages OCFS2 filesystems using the conventional ocf:heartbeat:Filesystem resource agent, albeit in clone mode. To put an OCFS2 filesystem under Pacemaker management, use the following crm configuration:

Why keep the filesystem stopped?

Because you probably either don't have a configured OCFS2 filesystem on your device yet, or your ran mkfs.ocfs2 when the Pacemaker stack wasn't running. In either of those two cases, mount.ocfs2 will refuse to mount the filesystem.

Thus, fire up your DLM and the o2cb process like the above configuration does, and then:

  • If you haven't got a filesystem yet, run mkfs.ocfs2 on your device, or
  • If you do already have one, run tunefs.ocfs2 –update-cluster-stack <device>.

Then when that's done, run crm resource start p_fs_ocfs2 and your filesystem should happily mount on all nodes.
Source

ldirectord

This should be just enough information to get you up and running with Pacemaker managing ldirectord and a virtual IP address. It does not cover configuring the actual service you want load-balanced.

Configure ldirectord

Create /etc/ha.d/ldirectord.cf on all nodes. The configuration for a hypothetical TCP service running on the virtual IP 192.168.1.100 port 8888, with real servers 192.168.1.10 and 192.168.1.20, might look something like this:

checktimeout=3
checkinterval=5
autoreload=yes
logfile="/var/log/ldirectord.log"
quiescent=no
virtual=192.168.1.100:8888
	real=192.168.1.10:8888 gate
	real=192.168.1.20:8888 gate
	scheduler=wrr
	protocol=tcp
	checktype=connect
	checkport=8888

For more detail on the above, refer to the ldirectord man page.

Configure Pacemaker

Using the crm shell:

 primitive ip ocf:heartbeat:IPaddr2 \
   op monitor interval="60" timeout="20" \
   params ip="192.168.1.100" lvs_support="true"
 primitive ip-lo ocf:heartbeat:IPaddr2 \
   op monitor interval="60" timeout="20" \
   params ip="192.168.1.100" nic="lo" cidr_netmask="32"
 primitive lvs ocf:heartbeat:ldirectord \
   op monitor interval="20" timeout="10"
 group ip-lvs ip lvs
 clone c-ip-lo ip-lo meta interleave="true"
 colocation lo-never-lvs -inf: c-ip-lo ip-lvs

This gives you the virtual IP address and ldirectord running together in a group (ip-lvs) on one node, and the same virtual IP address assigned to the loopback address on all other nodes. This is necessary to make the routing work correctly.

End Result
============
Last updated: Tue Nov 30 18:25:35 2010
Stack: openais
Current DC: node-0 - partition with quorum
Version: 1.1.4-fe6a4a99ffe5275ddbdc547e43f2eabd7cc56095
2 Nodes configured, 2 expected votes
2 Resources configured.
============

Online: [ node-0 node-1 ]

Full list of resources:

 Resource Group: ip-lvs
     ip (ocf::heartbeat:IPaddr2):       Started node-1
     lvs        (ocf::heartbeat:ldirectord):    Started node-1
 Clone Set: c-ip-lo [ip-lo]
     Started: [ node-0 ]
     Stopped: [ ip-lo:1 ]

Source

softdog

It is most strongly suggested that you set up your Linux system to use a watchdog. Use the watchdog driver, which fits best to your hardware, e. g. hpwdt for HP server. A list of available watchdogs can be found in /usr/src/linux/drivers/watchdog/

If no watchdog matches to your hardware then use softdog.

You can do this by adding the line

 modprobe softdog

Source

OpenAIS and corosync

The Corosync Cluster Engine is a group communication system with additional features for implementing high availability within applications.

The project provides four C programming interfaces features:

  • closed process group communication model with virtual synchrony guarantees for creating replicated state machines.
  • simple availability manager that restarts the application process when it has failed.
  • configuration and statistics in-memory database that provide the ability to set, retrieve, and receive change notifications of information.
  • quorum system that notifies applications when quorum is achieved or lost.

The software is designed to operate on UDP/IP and InfiniBand networks natively.
Source

332.3 Red Hat Cluster Suite (weight: 3)

Candidates should have experience in the installation, configuration, maintenance and troubleshooting of the Red Hat Cluster Suite cluster management set of technologies. Key Knowledge Areas:

  • Essential cluster configuration
  • resource agents


The following is a partial list of the used files, terms and utilities:

  • ccs
  • OpenAIS
  • rgmanager
  • /etc/ais/
  • /etc/corosync/

Essential cluster configuration

Quorum disk configuration

Now we have to populate quorum disk space with the right information. To perform this type:

# mkqdisk -c /dev/vg_qdisk/lv_qdisk -l <your_cluster_name>

Note that is not required to use your cluster name as quorum disk label, but it is recommended.

You need also to create a heuristic script to help qdisk when acting as tie-breaker. Create /usr/share/cluster/check_eth_link.sh:

#!/bin/sh
# Network link status checker

ethtool $1 | grep -q "Link detected.*yes"
exit $?

Now activate the quorum disk:

# service qdiskd start
# chkconfig qdiskd on
Configuring cluster using luci

In order to use luci web interface you need to activate service ricci on all nodes and luci on one node only:

(on all nodes)
# chkconfig ricci on
# service ricci start
(choose only a node)
# chkconfig luci on
# luci_admin init
# service luci restart

Please note that luci_admin init must be executed only the first time and before starting luci service, otherwise luci will be unusable. now connect to luci: https://node_with_luci.mydomain.com:8084 Here you can create a cluster, add nodes, create services, failover domains etc…

See Recommended cluster configuration to learn the right settings for the cluster.

Configuring cluster editing the XML

You can also manually configure a cluster editing its main config file /etc/cluster/cluster.conf. To create the config skeleton use:

# ccs_tool create

now the just created config file is not yet usable, you should configure cluster settings, add nodes, create services, failover domains etc… When config file is complete, copy the file on all nodes and start the cluster in this way:

(on all nodes)
# chkconfig cman on
# chkconfig rgmanager on
# service cman start
# service rgmanager start
Recommended cluster configuration

Here is attached a /etc/cluster/cluster.conf file of a fully configured cluster. For commenting purposes, the file is splitted into several consecutive parts:

<?xml version="1.0"?>
<cluster alias="jcaps_prd" config_version="26" name="jcaps_prd">
        <fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="3"/>
        <clusternodes>
                <clusternode name="h-lancelot.yourdomain.com" nodeid="1" votes="1">
                        <fence/>
                </clusternode>
                <clusternode name="h-artu.yourdomain.com" nodeid="2" votes="1">
                        <fence/>
                </clusternode>
                <clusternode name="h-morgana.yourdomain.com" nodeid="3" votes="1">
                        <fence/>
                </clusternode>
        </clusternodes>
        <cman expected_votes="4"/>
        <fencedevices/>

This is the first part of the XML cluster config file.

  • First line describes the cluster name and the config_version. Each time you modify the XML you must increment the config_version by 1 prior to update the config on all nodes.
  • Fence deamon line is the default one.
  • Cluster node stanza contains the nodes of the cluster. Note that name property contains the FQDN of the name. This name determines the eth used for cluster communication. In this example we don’t use the main hostname but the hostname related to the eth we choose to use as cluster communication channel.
  • Note also that the line <fence/> is required. Note that here we do not use any fence device. Due to the nature of HA-LVM the access to the data should be exclusive by one node at a time.
  • Cman expected_votes is 4 because each node give 1 vote each.
 <rm log_facility="local4" log_level="5">
                <failoverdomains>
                        <failoverdomain name="jcaps_prd" nofailback="0" ordered="0" restricted="1">
                                <failoverdomainnode name="h-lancelot.yourdomain.com" priority="1"/>
                                <failoverdomainnode name="h-artu.yourdomain.com" priority="1"/>
                                <failoverdomainnode name="h-morgana.yourdomain.com" priority="1"/>
                        </failoverdomain>
                </failoverdomains>
                <resources/>

This section begins resource manager configuration (<rm …>).

  • Resource manager section can be configured for logging. Rm logs to syslog, here we configured the log_facility and the logging level. The facility we specified allows us to log to a separate file (see logging configuration)
  • We configured also a failover domain containing all cluster node. We want that a service can switch to all cluster nodes, but you can also configure different behaviours here.
        <service autostart="1" domain="jcaps_prd" exclusive="0" name="subversion" recovery="relocate">
                <ip address="10.200.56.60" monitor_link="1"/>
                <lvm name="vg_subversion_apps" vg_name="vg_subversion_apps"/>
                <lvm name="vg_subversion_data" vg_name="vg_subversion_data"/>
                <fs device="/dev/vg_subversion_apps/lv_apps" force_fsck="1" force_unmount="1" fsid="61039" 
                            fstype="ext3" mountpoint="/apps/subversion" name="svn_apps" self_fence="0">
                    <fs device="/dev/vg_subversion_data/lv_repositories" force_fsck="1" force_unmount="1" 
                            fsid="3193" fstype="ext3" mountpoint="/apps/subversion/repositories" 
                            name="svn_repositories" self_fence="0"/>
                </fs>
                <script file="/my_cluster_scripts/subversion/subversion.sh" name="subversion"/>
        </service>

This section contains the services in the cluster (like HP ServiceGuard packages)

  • We choose the failover domain (in this case our failover domain contains all nodes so the service can run on all nodes)
  • We add a ip address resource (use always monitor link!)
  • We use also a HA-LVM resource (<lvm …>). Here all VG specified will be tagged with the node name when activating. This means that they can be activated only on the node where the service is running (only on that node!). Note: If you do not specify any LV, all the LVs inside the VG will be activated!
  • Next there are also <fs …> tags for mounting filesystem resources. It is recommended to use force_unmount and force_fsck.
  • You can specify also a custom script for starting application/services and so on. Please note that the script must be LSB compliant. This means that it must handle start|stop|status. Note also that default cluster behaviour is to run the script with status parameter every 30 seconds. If the script status does not return 0, the service will be marked as failed (and probably will be restarted/relocated).
</rm>

This section closes the resource manager configuration (closes XML tag).

        <totem consensus="4800" join="60" token="20000" token_retransmits_before_loss_const="20"/>

This is a crucial part of cluster configuration. Here you specify the failure detection time of cluster.

  • RedHat recommends to the CMAN membership (token) timeout value to be at least times that of the qdiskd timeout value. Here the value is 20 seconds.
        <quorumd interval="2" label="jcaps_prd_qdisk" min_score="2" tko="5" votes="1">
                <heuristic interval="2" program="/usr/share/cluster/check_eth_link.sh bond0" score="3"/>
        </quorumd>

Here we configure the quorum disk to be used by the cluster.

  • We choose a quorum timeout value of 10 seconds (quorumd interval * quorumd tko) which is a half of token timeout (20 seconds).
  • We insert also a heuristic script to determine the network health. This will help qdisk to take a decision when split-brain happens.
</cluster>

This concludes the configuration file closing XML tags still opened.

Source

resource agents

Use of the script resource agent on an LSB-compliant init script does not require additional review.

Each resource agent specifies the amount of time between periodic status checks. Each resource utilizes these timeout values unless explicitly overridden in the cluster.conf file using the special <action> tag:

<action name="status" depth="*" interval="10" />

This tag is a special child of the resource itself in the cluster.conf file. For example, if you had a file system resource for which you wanted to override the status check interval you could specify the file system resource in the cluster.conf file as follows:

  <fs name="test" device="/dev/sdb3">
    <action name="status" depth="*" interval="10" />
    <nfsexport...>
    </nfsexport>
  </fs>

Some agents provide multiple “depths” of checking. For example, a normal file system status check (depth 0) checks whether the file system is mounted in the correct place. A more intensive check is depth 10, which checks whether you can read a file from the file system. A status check of depth 20 checks whether you can write to the file system. In the example given here, the depth is set to *, which indicates that these values should be used for all depths. The result is that the test file system is checked at the highest-defined depth provided by the resource-agent (in this case, 20) every 10 seconds.

Source

ccs

The Cluster Configuration System (CCS) manages the cluster configuration and provides configuration information to other cluster components in a Red Hat cluster. CCS runs in each cluster node and makes sure that the cluster configuration file in each cluster node is up to date. For example, if a cluster system administrator updates the configuration file in Node A, CCS propagates the update from Node A to the other nodes in the cluster. Source

ccs_toolccs_tool is part of the Cluster Configuration System (CCS). It is used to make online updates of CCS configuration files. Additionally, it can be used to upgrade cluster configuration files from CCS archives created with GFS 6.0 (and earlier) to the XML format configuration format used with this release of Red Hat Cluster Suite.
ccs_testDiagnostic and testing command that is used to retrieve information from configuration files through ccsd.
ccsdCCS daemon that runs on all cluster nodes and provides configuration file data to cluster software.
cluster.confThis is the cluster configuration file. The full path is /etc/cluster/cluster.conf.

Source

OpenAIS

OpenAIS; Provides cluster communications using the totem protocol.
OpenAIS is the heart of the cluster. All other computers operate though this component, and no cluster component can work without it. Further, it is shared between both Pacemaker and RHCS clusters.

In Red Hat clusters, openais is configured via the central cluster.conf file. In Pacemaker clusters, it is configured directly in openais.conf. As we will be building an RHCS, we will only use cluster.conf. That said, (almost?) all openais.conf options are available in cluster.conf. This is important to note as you will see references to both configuration files when searching the Internet.

Source

rgmanager

rgmanager checks the status of individual resources, not whole services. (This is a change from clumanager on Red Hat Enterprise Linux 3, which periodically checked the status of the whole service.) Every 10 seconds, rgmanager scans the resource tree, looking for resources that have passed their “status check” interval.

/etc/ais/

Location where OpenAIS configuration is stored.

/etc/corosync/

Location where corosync configuration is stored.

Command Line Administration Tools

In addition to Conga and the system-config-cluster Cluster Administration GUI, command line tools are available for administering the cluster infrastructure and the high-availability service management components. The command line tools are used by the Cluster Administration GUI and init scripts supplied by Red Hat.

Command Line ToolUsed WithPurpose
ccs_tool — Cluster Configuration System ToolCluster Infrastructureccs_tool is a program for making online updates to the cluster configuration file. It provides the capability to create and modify cluster infrastructure components (for example, creating a cluster, adding and removing a node). For more information about this tool, refer to the ccs_tool(8) man page.
cman_tool — Cluster Management ToolCluster Infrastructurecman_tool is a program that manages the CMAN cluster manager. It provides the capability to join a cluster, leave a cluster, kill a node, or change the expected quorum votes of a node in a cluster. For more information about this tool, refer to the cman_tool(8) man page.
fence_tool — Fence ToolCluster Infrastructurefence_tool is a program used to join or leave the default fence domain. Specifically, it starts the fence daemon (fenced) to join the domain and kills fenced to leave the domain. For more information about this tool, refer to the fence_tool(8) man page.
clustat — Cluster Status UtilityHigh-availability Service Management ComponentsThe clustat command displays the status of the cluster. It shows membership information, quorum view, and the state of all configured user services. For more information about this tool, refer to the clustat(8) man page.
clusvcadm — Cluster User Service Administration UtilityHigh-availability Service Management ComponentsThe clusvcadm command allows you to enable, disable, relocate, and restart high-availability services in a cluster. For more information about this tool, refer to the clusvcadm(8) man page.

Source

[root@example-01 ~]#clustat
Cluster Status for mycluster @ Wed Nov 17 05:40:15 2010
Member Status: Quorate

 Member Name                             ID   Status
 ------ ----                             ---- ------
 node-03.example.com                         3 Online, rgmanager
 node-02.example.com                         2 Online, rgmanager
 node-01.example.com                         1 Online, Local, rgmanager

 Service Name                   Owner (Last)                   State         
 ------- ----                   ----- ------                   -----           
 service:example_apache         node-01.example.com            started       
 service:example_apache2        (none)                         disabled
Services StatusDescription
StartedThe service resources are configured and available on the cluster system that owns the service.
RecoveringThe service is pending start on another node.
DisabledThe service has been disabled, and does not have an assigned owner. A disabled service is never restarted automatically by the cluster.
StoppedIn the stopped state, the service will be evaluated for starting after the next service or node transition. This is a temporary state. You may disable or enable the service from this state.
FailedThe service is presumed dead. A service is placed into this state whenever a resource's stop operation fails. After a service is placed into this state, you must verify that there are no resources allocated (mounted file systems, for example) prior to issuing a disable request. The only operation that can take place when a service has entered this state is disable.
UninitializedThis state can appear in certain cases during startup and running clustat -f.
Service OperationDescriptionCommand Syntax
EnableStart the service, optionally on a preferred target and optionally according to failover domain rules. In the absence of either a preferred target or failover domain rules, the local host where clusvcadm is run will start the service. If the original start fails, the service behaves as though a relocate operation was requested (refer to Relocate in this table). If the operation succeeds, the service is placed in the started state.clusvcadm -e <service_name> or clusvcadm -e <service_name> -m <member> (Using the -m option specifies the preferred target member on which to start the service.)
DisableStop the service and place into the disabled state. This is the only permissible operation when a service is in the failed state.clusvcadm -d <service_name>
RelocateMove the service to another node. Optionally, you may specify a preferred node to receive the service, but the inability of the service to run on that host (for example, if the service fails to start or the host is offline) does not prevent relocation, and another node is chosen. rgmanager attempts to start the service on every permissible node in the cluster. If no permissible target node in the cluster successfully starts the service, the relocation fails and the service is attempted to be restarted on the original owner. If the original owner cannot restart the service, the service is placed in the stopped state.clusvcadm -r <service_name> or clusvcadm -r <service_name> -m <member> (Using the -m option specifies the preferred target member on which to start the service.)
StopStop the service and place into the stopped state.clusvcadm -s <service_name>
FreezeFreeze a service on the node where it is currently running. This prevents status checks of the service as well as failover in the event the node fails or rgmanager is stopped. This can be used to suspend a service to allow maintenance of underlying resources. Refer to the section called “Considerations for Using the Freeze and Unfreeze Operations” for important information about using the freeze and unfreeze operations. clusvcadm -Z <service_name>
UnfreezeUnfreeze takes a service out of the freeze state. This re-enables status checks. Refer to the section called “Considerations for Using the Freeze and Unfreeze Operations” for important information about using the freeze and unfreeze operations.clusvcadm -U <service_name>
MigrateMigrate a virtual machine to another node. You must specify a target node. Depending on the failure, a failure to migrate may result with the virtual machine in the failed state or in the started state on the original owner.clusvcadm -M <service_name> -m <member>
RestartRestart a service on the node where it is currently running.clusvcadm -R <service_name>

332.4 Advanced Red Hat Cluster Suite (weight: 3)

Candidates should have experience in advanced features of the Red Hat Cluster Suite cluster management set of technologies. This includes the use and integration with LVS and GFS. Key Knowledge Areas:

  • fencing
  • quorum
  • data integrity
  • integration with file systems
  • integration with LVS


The following is a partial list of the used files, terms and utilities:

  • qdiskd
  • /etc/lvs.cf
  • Piranha
  • GFS
  • Conga

fencing

Red Hat Cluster Suite provides a variety of fencing methods:

  • Power fencing — A fencing method that uses a power controller to power off an inoperable node. Two types of power fencing are available: external and integrated. External power fencing powers off a node via a power controller (for example an API or a WTI power controller) that is external to the node. Integrated power fencing powers off a node via a power controller (for example,IBM Bladecenters, PAP, DRAC/MC, HP ILO, IPMI, or IBM RSAII) that is integrated with the node.
  • SCSI3 Persistent Reservation Fencing — A fencing method that uses SCSI3 persistent reservations to disallow access to shared storage. When fencing a node with this fencing method, the node's access to storage is revoked by removing its registrations from the shared storage.
  • Fibre Channel switch fencing — A fencing method that disables the Fibre Channel port that connects storage to an inoperable node.
  • GNBD fencing — A fencing method that disables an inoperable node's access to a GNBD server.

Source

quorum

qdisk Configuration
A successful DM-Multipath configuration should produce a set of identifiable inodes in the /dev/mapper directory. The /dev/mapper/qdisk inode will need to be initialized and enabled as a service This is the one of the first pieces of info you need for the /etc/cluster.conf file.

$ mkqdisk –l HA585 –c /dev/mapper/qdisk

By convention, the label is the same name as the cluster; in this case, HA_585. The section of the cluster.conf file looks like the following.

<?xml version="1.0"?>
<cluster config_version="1" name="HA585">
  <fence_daemon post_fail_delay="0" post_join_delay="3"/>
        <quorumd interval="7" device="/dev/mapper/qdisk" tko="9" votes="3" log_level="5"/>

Source

qdiskd

Quorum Disk is a disk-based quorum daemon, qdiskd, that provides supplemental heuristics to determine node fitness. With heuristics you can determine factors that are important to the operation of the node in the event of a network partition. For example, in a four-node cluster with a 3:1 split, ordinarily, the three nodes automatically “win” because of the three-to-one majority. Under those circumstances, the one node is fenced. With qdiskd however, you can set up heuristics that allow the one node to win based on access to a critical resource (for example, a critical network path). If your cluster requires additional methods of determining node health, then you should configure qdiskd to meet those needs.

Maximum nodes16
Min Quorum size10 Mb
Default timeout10 sec

/etc/lvs.cf

 lvs.cf - configuration file for lvs
Description

This file contains the configuration information for piranha and is normally located in 
/etc/sysconfig/ha/lvs.cf. 
lvs.cf is read and updated from the piranha web configuration tool, which uses lvs(8) to actually 
work with the file.

Global Options
Global settings affect all aspects of the cluster, including virtual servers and real servers.

service = [lvs|fos]
    Indicates which set of defined services are to be used. Virtual Servers and Failover Services 
    as mutually exclusive; although they may both be defined in the same config file, thay cannot 
    both be running simultaneously. 
    This option specifies which section is to be used. 
primary = a.b.c.d
    This is the IP (or hostname) of the primary LVS machine. 
primary_private = a.b.c.d
    Indicates the IP address of an alternative network device for private heartbeating. It is not 
    necessary to fill out this field for piranha to work as it simply provides an alternative method 
    of checking an IP service is running. 
backup = a.b.c.d
    This is the IP (or hostname) of the backup (or failover) LVS machine. 
backup_private = a.b.c.d
    This is akin to primary_private but refers to the alternative IP device on the backup 
backup_active = [0|1]
    This dictates if the backup server option is active or inactive. This option must be set if the 
    backup server is to function in a failover manner. 
heartbeat = [0|1]
    Use heartbeat between the two LVS nodes. 
keepalive = n
    Number of seconds between heartbeats. 
deadtime = n
    Length of time before a node is declared dead and IP takeover occurs. 
reservation_conflict_action = [nothing|preempt]
    This option dictates what action should be taken when a scsi reservation conflict occurs during 
    failover and the disk is found to be unexpectedly locked. You should think carefully about this 
    option as your setup may or may not have a scsi controller setup to reset the scsi bus on power 
    on or warm reboot. 
debug = NONE
    Ignore this option. Eventually it will become a means to dictate how much and what type of 
    information about the state of the cluster is written to file. 
rsh_command =
    The command family used to sync file systems and config files. Allowable options are either 
    rsh (default) or ssh. The appropriate .rhosts (or .ssh/authorized_keys) entries must be on all 
    nodes so that connections can be made non-interactively.

    Sync'ing of specified config files and directories will occur when lvs receives a SIGUSR1. 
    lvs.cf(5) is automatically synced between the LVS nodes anytime it is written to. 
network = [direct|nat|tunnel]
    The lvs virtual server can reroute all of its incoming traffic via one of three methods; 
    NAT (Network Address Translation), Direct Routing, or Tunneling (IP Encapsulation). 
nat_router = a.b.c.d dev:n
    If NAT routing is selected, this specifies the IP address and device of the routing interface. 
nat_nmask = a.b.c.d
    Optional. The subnet mask to apply to nat_router.

Per-virtual Server Section
A per-virtual server section starts with

virtual server-name {
}

where the string is a unique server identifier. This doesn't have to match up to a FQDN.

The following items are required for each virtual server entry in the config file.

address = a.b.c.d
    This is the address to be used for the virtual server. 
sorry_server = a.b.c.d
    This is the address to be used for the 'sorry' server. If specified, requests for this 
    virtual server will be redirected to this IP address in the event that no real servers are 
    available to handle the request. 
vip_nmask = a.b.c.d
    Optional. This is the subnet mask to apply to the address of the virtual server. 
active = [0|1]
    This flag is used to indicate whether or not this particular virtual server is active. 
    If it is marked inactive, then all real servers being routed to by it will by default 
    become inactive as well.

    The following items are all optional entries (note the default values for many). 
load_monitor = [uptime|rup|ruptime|none]
    This specifies the method that the LVS can acquire CPU load information from the real 
    servers. This load information is used to adjust the weighting factor for each server entry 
    in the LVS routing table. Each method requires slightly different configurations to be present
    on the real servers and on the LVS nodes. 
    The default method is uptime. Specifying "none" causes the service monitor to skip load tests 
    (required for most non-linux systems). 
timeout =
    This is the amount of time allowed before a presumed dead real server is removed from the 
    LVS routing table. 
    Default is 10 seconds. 
reentry =
    This is the amount of time that a previously dead real server must be alive before the LVS will 
    re-enter it into the routing table. The purpose of this delay is to prevent troubled machines from 
    causing a "ping-pong" effect. The default is 180 seconds. 
port = xx
    This is the port that the virtual server is instructed to listen to and redirect network requests 
    from. The default is port 80 (http). 
send = xxx
    If present, the specified text ("xxx") will be sent to the port of the virtual server as part 
    of the test for whether the service is operational. The text is limited to 255 characters maximum. 
    Characters must be printable/quotable, and may contain "\n, \r, \\, or \'". Note that if both 
    "send" and "expect" are specified, 
    the send will always execute before the read for the expect is attempted. 
send_program = path %h
    For more advanced service verification, you can use this directive to specify the path to 
    a service-checking script. This functionality is especially helpful for services that require 
    dynamically changing data, such as HTTPS or SSL. To use this functionality, you must write a 
    script that returns a textual response (that will be matched against 'expect' directive), set 
    it to be executable, and specify the path to it. To ensure that each 
    server in the real server pool is checked, use the special token %h after the path to the script. 
    This token is replaced with each real server's IP address as the script is called by the nanny 
    daemon. If 'send_program' is used, then the 'send' is ignored. 
expect = xxx
    If present, the specified text ("xxx") will be expected as a response from the port on the 
    virtual server as part of the test for whether the service is operational. The text is limited 
    to 255 characters maximum. Characters must be printable/quotable, and may contain "\n, \r, \\, or \'". 
    Note that if both "send" and 
    "expect" are specified, the send will always execute before the read for the expect is attempted. 
    If you wrote your own service-checking script, enter the response you told it to send if it was 
    successful. 
use_regex = [0|1]
    If enabled, the expect string will be interpreted as a regular expression. 
persistent =
    The number of seconds that a connection between this virtual server and a real server will persist. 
    If a request is received from a client within this number of seconds, it will be assigned to 
    the same real server that processed a prior request. If this parameter is missing or set to zero, 
    connections with this virtual server are not persistent. 
pmask =
    The network mask to apply to persistence if enabled. Default is 255.255.255.255. 
scheduler = [rr|lc|wlc|wrr]
    This is the key part of the LVS router. These methods of scheduling how incoming requests are 
    routed are built as loadable kernel modules: Round Robin (rr), least-connections (lc), Weighted
    Least Connections (wlc, the default) and Weighted Round Robin (wrr).

Real Server Sections
A per-real server section starts with

server servername {
}

where the string is a unique server identifier. This doesn't have to match up to any real FQDN.

The following items are required for each real server entry in the config file.

address = a.b.c.d
    This is the actual IP address being used by the real server. In the cases of NAT type routing, 
    it is generally one of the reserved, private IPs. 
active = [0|1]
    This flag is used to indicate whether or not this particular real server is active.

    The following item is optional. 
weight =
    This option enforces a skew affect by enabling more loading on a particular server. The weights of 
    all real servers influence the scheduling algorithm and a higher weight will load a particular server 
    down with more redirects. The default value is 1. 
An example real server entry might look like:

          server 1 {
                    address = 192.168.10.2
                    active = 1
                    weight = 1
          }

    Per-failover Service SectionA per-failover-service section starts with
        failover service-name {
        }

    where the service-name is a unique identifier.

    The following items are required for each failover service entry in the config file.

    address = a.b.c.d dev:x
        This is the address and device interface to be used for the virtual service. 
    vip_nmask = a.b.c.d
        Optional. The netmask to apply to the service address. 
    active = [0|1]
        This flag is used to indicate whether or not this particular virtual server is active. 
        If it is marked inactive, then all real servers being routed to by it will by default become 
        inactive as well.

        The following items are all optional entries (note the default values for many). 
    timeout =
        This is the amount of time allowed before a service is presumed dead and will cause a failover. 
    reentry =
        This is the amount of time that a previously dead partner system must be alive before it will 
        be a candidate for possible failover. The purpose of this delay is to prevent troubled machines 
        from causing a "ping-pong" effect. 
        The default is 180 seconds. 
    port = xx
        This is the port that the failover service is instructed to test. The default is port 80 (http). 
    send = xxx
        If present, the specified text ("xxx") will be sent to the port of the virtual server as part 
        of the test for whether the service is operational. The text is limited to 255 characters maximum. 
        Characters must be printable/quotable, and may contain "\n, \r, \\, or \'". Note that if both 
        "send" and "expect" are specified, the send will always execute before the read for the expect is 
        attempted. 
    expect = xxx
        If present, the specified text ("xxx") will be expected as a response from the port on the virtual 
        server as part of the test for whether the service is operational. The text is limited to 255 
        characters maximum. Characters must be printable/quotable, and may contain "\n, \r, \\, or \'". 
        Expect can also be a single '*' character to indicate any response characters are allowed. 
        Note that if both "send" and "expect" are specified, the send will always execute before the read 
        for the expect is attempted. 
    start_cmd = xxx
        Mandatory; specifies the startup command/script to execute to start the failover service. 
        Options can be specified, but must be separated by a single space. 
    stop_cmd = xxx
        Mandatory; specifies the shutdown command/script to execute to stop the failover service. 
        Options can be specified, but must be separated by a single space.

/etc/lvs.cf is where the cluster definition is held. A typical cluster configuration will look as follows. Please pay attention to the comment fields which describe the meaning of various items.

    # This file is generated by the piranha GUI.  Do not hand edit.  All
    # modifications created by any means other than the use of piranha will
    # not be supported.
    #
    # This file has 3 sections. Section 1 is always required, then EITHER
    # section 2 or section 3 is to be used.
    #       1. LVS node/router definitions needed by the LVS system.
    #       2. Virtual server definitions, including lists of real servers.
    #       3. Failover service definitions (for any services running on the
    #          LVS primary or backup node instead of on virtual servers).
    #          NOTICE: Failover services are an upcoming feature of piranha and
    #          are not provided in this release.
    
    
    
    
    # SECTION 1 - GLOBAL SETTINGS
    #
    # The LVS is a single point of failure (which is bad).  To protect against
    # this machine breaking things, we should have a redundant/backup LVS node.
    #       service:        Either "lvs" for Virtual Servers  or "fos" for
    #                       Failover Services (defaults to "lvs" if missing)
    #       primary:        The IP of the main LVS node/router
    #       backup:         The IP of the backup LVS node/router
    #       backup_active:  Set this to 1 if using a backup LVS node/router
    #       heartbeat:      Use heartbeat between LVS nodes
    #       keepalive:      Time between heartbeats between LVS machines.
    #       deadtime:       Time w/ out response before node failure is assumed.
    
    
    service = lvs
    primary = 207.175.44.150
    backup = 207.175.44.196
    backup_active = 1
    heartbeat = 1
    heartbeat_port = 1050
    keepalive = 6
    deadtime = 18
    
    # All nodes must have either appropriate .rhost files set up for all nodes in
    # the cluster, us use some equivalent mechanism. Default it rsh, but you
    # may set an alternate command (which must be equivalent to rsh) here (ssh
    # is the most common).
    
    rsh_command = rsh
    
    # lvs server configuration environments: NAT, Direct Routing, and Tunneling.
    # NAT (Network Address Translation) is the simplist to set up and works well
    # in most situations.
    #
    # network = direct
    # network = tunnel
    
    network = nat
    nat_router = 192.168.10.100 eth1:1
    
    
    
    # SECTION 2 - VIRTUAL SERVERS
    #
    # Information we need to keep track of for each virtual server is:
    # scheduler:    pcc, rr, wlc, wrr (default is wlc)
    # persistent:   time (in seconds) to allow a persistent service connection to
    #               remain active.  If missing or set to 0, persistence is turned
    #               off.
    # pmask:        If persistence is enabled, this is the netmask to apply.
    #               Default is 255.255.255.255
    # address:      IP address of the virtual server (required)
    # active:       Simple switch if node is on or off
    # port:         port number to be handled by this virtual server (default
    #               is 80)
    # load_monitor: Tool to check load average on real server machines.
    #               Possible tools include rup, ruptime, uptime.
    # timeout:      Time (in seconds) between service activity queries
    # reentry:      Time (in seconds) a service must be alive before it is allowed
    #               back into the virtual server's routing table after leaving the
    #               table via failure.
    # send:         [optional] test string to send to port
    # expect:       [optional] test string to receive from port
    # protocol:     tcp or udp (defaults to tcp)
    #
    # This is the needed information for each real server for each Virtual Server:
    # address:      IP address of the real server.
    # active:       Simple switch if node is on or off
    # weight:       relative measure of server capacity
    
    virtual server1 {
            address = 207.175.44.252 eth0:1
            active = 1
            load_monitor = uptime
            timeout = 5
            reentry = 10
            port = http
            send = "GET / HTTP/1.0\r\n\r\n"
            expect = "HTTP"
            scheduler = wlc
            persistent = 60
            pmask = 255.255.255.255
            protocol = tcp
    
            server Real1 {
                    address = 192.168.10.2
                    active = 1
                    weight = 1
            }
    
            server Real2 {
                    address = 192.168.10.3
                    active = 1
                    weight = 1
            }
    }
    
    
    virtual server2 {
            address = 207.175.44.253 eth0:1
            active = 0
            load_monitor = uptime
            timeout = 5
            reentry = 10
            port = 21
            send = "\n"
    
            server Real1 {
                    address = 192.168.10.2
                    active = 1
            }
    
            server Real2 {
                    address = 192.168.10.3
                    active = 1
            }
    } 
    
    
    
    
    # SECTION 3 - FAILOVER SERVICES
    #
    # LVS node Service failover. This section applies only to services running
    # on the primary and backup LVS nodes (instead of being part of a virtual
    # server setup). You cannot currently use these services and virtual
    # servers in the same setup, and you must have at least a 2 node cluster
    # (a primary and backup) in order to use these failover services. All
    # nodes must be identically configured Linux systems.
    #
    # Failover services provide the most basic form of fault recovery. If any
    # of the services on the active node fail, all of the services will be
    # shutdown and restarted on a backup node. Services defined here will
    # automatically be started & stopped by LVS, so a backup node is
    # considered a "warm" standby. This is due to a technical restriction that
    # a service can only be operational on one node at a time, otherwise it may
    # fail to bind to a virtual IP address that does not yet exist on that
    # system or cause a networking conflict with the active service. The
    # commands provided for "start_cmd" and "stop_cmd" must work the same for
    # all nodes. Multiple services can be defined.
    #
    # Information here is similar in meaning and format to the virtual server
    # section. Failover Services and Virtual Servers cannot both be used on
    # a running system, so the "service = xxx" setting in the first section
    # of this file indicates which to use when starting the cluster.
    
    
    failover web1 {
         active = 1
         address = 207.175.44.242 eth0:1
         port = 1010
         send = "GET / HTTP/1.0\r\n\r\n"
         expect = "HTTP"
         timeout = 10
         start_cmd = "/etc/rc.d/init.d/httpd start"
         stop_cmd = "/etc/rc.d/init.d/httpd stop"
    }
    
    
    failover ftp {
         active = 0
         address = 207.175.44.252 eth0:1
         port = 21
         send = "\n"
         timeout = 10
         start_cmd = "/etc/rc.d/init.d/inet start"
         stop_cmd = "/etc/rc.d/init.d/inet stop"
    }

Piranha

The piranha kit revolves around a single configuration file /etc/lvs.cf All components of piranha use this file as the definition of the cluster. Piranha provides a daemon called 'lvs' that runs on the primary and backup nodes. This process controls Piranha and supports communication among its components.

To help determine if a node in the cluster is still alive, another daemon, 'pulse' runs on the primary and backup nodes. This process is normally started from the rc scripts as '/etc/rc.d/init.d/pulse start'

Another daemon that runs on all nodes in the cluster is 'nanny'. Through this process, the primary LVS node determines whether a host service is alive and should continue to receive job assignments.

GFS

Red Hat GFS is a cluster file system that allows a cluster of nodes to simultaneously access a block device that is shared among the nodes. GFS is a native file system that interfaces directly with the VFS layer of the Linux kernel file-system interface. GFS employs distributed metadata and multiple journals for optimal operation in a cluster. To maintain file system integrity, GFS uses a lock manager to coordinate I/O. When one node changes data on a GFS file system, that change is immediately visible to the other cluster nodes using that file system.

Using Red Hat GFS, you can achieve maximum application uptime through the following benefits:

  • Simplifying your data infrastructure
    • Install and patch applications once for the entire cluster.
    • Eliminates the need for redundant copies of application data (duplication).
    • Enables concurrent read/write access to data by many clients.
    • Simplifies backup and disaster recovery (only one file system to back up or recover).
  • Maximize the use of storage resources; minimize storage administration costs.

Manage storage as a whole instead of by partition.

  • Decrease overall storage needs by eliminating the need for data replications.
  • Scale the cluster seamlessly by adding servers or storage on the fly.
  • No more partitioning storage through complicated techniques.
  • Add servers to the cluster on the fly by mounting them to the common file system.

Nodes that run Red Hat GFS are configured and managed with Red Hat Cluster Suite configuration and management tools. Volume management is managed through CLVM (Cluster Logical Volume Manager). Red Hat GFS provides data sharing among GFS nodes in a Red Hat cluster. GFS provides a single, consistent view of the file-system name space across the GFS nodes in a Red Hat cluster. GFS allows applications to install and run without much knowledge of the underlying storage infrastructure. Also, GFS provides features that are typically required in enterprise environments, such as quotas, multiple journals, and multipath support.

GFS provides a versatile method of networking storage according to the performance, scalability, and economic needs of your storage environment. This chapter provides some very basic, abbreviated information as background to help you understand GFS.

You can deploy GFS in a variety of configurations to suit your needs for performance, scalability, and economy. For superior performance and scalability, you can deploy GFS in a cluster that is connected directly to a SAN. For more economical needs, you can deploy GFS in a cluster that is connected to a LAN with servers that use GNBD (Global Network Block Device) or to iSCSI (Internet Small Computer System Interface) devices.

Source

Conga

Conga is an agent/server architecture for remote administration of systems. The agent component is called “ricci”, and the server is called “luci”. One luci server can communicate with many multiple ricci agents installed on systems. The luci server is accessed via a browser using https.

Conga has been initially developed to provide a convenient method for creating and managing clusters built with Red Hat Cluster Suite. It also offers an interface for managing sophisticated storage configurations like those often built to support clusters.

At the computer running luci, initialize the luci server using the luci_admin init command. For example:

# luci_admin init
Initializing the Luci server


Creating the 'admin' user

Enter password:  <Type password and press ENTER.>
Confirm password: <Re-type password and press ENTER.>

Please wait...
The admin password has been successfully set.
Generating SSL certificates...
Luci server has been successfully initialized


Restart the Luci server for changes to take effect
eg. service luci restart

Start luci using service luci restart. For example:

# service luci restart
Shutting down luci:                                        [  OK  ]
Starting luci: generating https SSL certificates...  done
                                                           [  OK  ]

Please, point your web browser to https://nano-01:8084 to access luci

Source

Topic 333: Cluster Storage

333.1 DRBD (weight: 3)

Candidates are expected to have the experience and knowledge to install, configure, maintain and troubleshoot DRBD devices. This includes integration with Pacemaker and heartbeat. The following is a partial list of the used files, terms and utilities:

  • w/Pacemaker
  • w/heartbeat

DRBD w/Pacemaker

Basic configuration

The most common way to configure DRBD to replicate a volume between two fixed nodes, using IP addresses statically assigned on each.

Setting up DRBD

Please refer to the DRBD docs on how to install it and set it up.
From now on, we will assume that you've setup DRBD and that it is working (test it with the DRBD init script outside Pacemaker's control). If not, debug this first.

Configuring the resource in the CIB

In the crm shell, you first have to create the primitive resource and then embed that into the master resource.

crm commands

configure

primitive drbd0 ocf:heartbeat:drbd \ 
 params drbd_resource=drbd0 \
 op monitor role=Master interval=59s timeout=30s \
 op monitor role=Slave interval=60s timeout=30s

ms ms-drbd0 drbd0 \
 meta clone-max=2 notify=true globally-unique=false target-role=stopped

commit

quit

The primitive DRBD resource, similar to what you would have used to configure drbddisk, is now embedded in a complex object master. This specifies the abilities and limitations of DRBD there can be only two instances (clone-max), one per node (clone-node-max), and only one master ever (master-max). The notify attribute specifies that DRBD needs to be told about what happens to its peer; globally-unique set to false lets Pacemaker know that the instances cannot be told apart on a single node.

Note that we're creating the resource in stopped state first, so that we can finish configuring its constraints and dependencies before activating it. Specifying the nodes where the DRBD RA can be run

If you have a two node cluster, you could skip this step, because obviously, it can only run on those two. If you want to run drbd0 on two out of more nodes only, you will have to tell the cluster about this constraint:

crm configure location ms-drbd0-placement ms-drbd0 rule -inf: \#uname ne xen-1 and \#uname ne xen-2

This will tell the Policy Engine that, first, drbd0 can not run anywhere else except on xen-1 or xen-2. Second, it tells the PE that yes, it can run on those two.

Note: This assumes a symmetric cluster. If your cluster is asymmetric, you will have to invert the rules (Don't worry - if you do not specifically configure asymmetric, your cluster is symmetric by default).

Prefering a node to run the master role

With the configuration so far, the cluster would pick a node to promote DRBD on. If you want to prefer a node to run the master role (xen-1 in this example), you can express that like this:

crm configure location ms-drbd0-master-on-xen-1 ms-drbd0 rule role=master 100: \#uname eq xen-1

You can now activate the DRBD resource:

crm resource start ms-drbd0

It should be started and promoted on one of the two nodes - or, if you specified a constraint as shown above, on the node you preferred.

Referencing the master or slave resource in constraints

DRBD is rarely useful by itself; you will propably want to run a service on top of it. Or, very likely, you want to mount the filesystem on the master side.

Let us assume that you've created an ext3 filesystem on /dev/drbd0, which you now want managed by the cluster as well. The filesystem resource object is straightforward, and if you have got any experience with configuring Pacemaker at all, will look rather familar:

crm configure primitive fs0 ocf:heartbeat:Filesystem params fstype=ext3 directory=/mnt/share1 \
 device=/dev/drbd0 meta target-role=stopped

Make sure that the various settings match your setup. Again, this object has been created as stopped first.

Now the interesting bits. Obviously, the filesystem should only be mounted on the same node where drbd0 is in primary state, and only after drbd0 has been promoted, which is expressed in these two constraints:

crm commands

configure

order ms-drbd0-before-fs0 mandatory: ms-drbd0:promote fs0:start

colocation fs0-on-ms-drbd0 inf: fs0 ms-drbd0:Master

commit

quit

Et voila! You now can activate the filesystem resource and it'll be mounted at the proper time in the proper place.

crm resource start fs0

Just as this was done with a single filesystem resource, this can be done with a group: In a lot of cases, you will not just want a filesystem, but also an IP-address and some sort of daemon to run on top of the DRBD master. Put those resources in a group, use the constraints above and replace fs0 with the name of your group. The following example includes an apache webserver.

crm commands

configure

primitive drbd0 ocf:heartbeat:drbd \
 params drbd_resource=drbd0 \
 op monitor role=Master interval=59s timeout=30s \
 op monitor role=Slave interval=60s timeout=30s

ms ms-drbd0 drbd0 \
 meta clone-max=2 notify=true globally-unique=false target-role=stopped 

primitive fs0 ocf:heartbeat:Filesystem \ 
 params fstype=ext3 directory=/usr/local/apache/htdocs device=/dev/drbd0

primitive webserver ocf:heartbeat:apache \
 params configfile=/usr/local/apache/conf/httpd.conf httpd=/usr/local/apache/bin/httpd port=80 \ 
 op monitor interval=30s timeout=30s

primitive virtual-ip ocf:heartbeat:IPaddr2 \
 params ip=10.0.0.1 broadcast=10.0.0.255 nic=eth0 cidr_netmask=24 \
 op monitor interval=21s timeout=5s

group apache-group fs0 webserver virtual-ip

order ms-drbd0-before-apache-group mandatory: ms-drbd0:promote apache-group:start

colocation apache-group-on-ms-drbd0 inf: apache-group ms-drbd0:Master

location ms-drbd0-master-on-xen-1 ms-drbd0 rule role=master 100: #uname eq xen-1

commit

end

resource start ms-drbd0

quit

This will load the drbd module on both nodes and promote the instance on xen-1. After successful promotion, it will first mount /dev/drbd0 to /usr/local/apache/htdocs, then start the apache webserver and in the end configure the service IP-address 10.0.0.1/24 on network card eth0.

Moving the master role to a different node

If you want to move the DRBD master role the other node, you should not attempt to just move the master role. On top of DRBD, you will propably have a Filesystem resource or a resource group with your application/Filesystem/IP-Address or whatever (remember, DRBD isn't usually useful by itself). If you want to move the master role, you can accomplish that by moving the resource that is co-located with the DRBD master (and properly ordered). This can be done with the crm shell or crm_resource. Given the group example from above, you would use

crm resource migrate apache-group [hostname] 

This will stop all resources in the group, demote the current master, promote the other DRBD instance and start the group after successful promotion.

Keeping the master role on a network connected node

It is most likely desirable to keep the master role on a node with a working network connection. I assume you are familiar with [pingd]. So if you configured pingd, all you need to do is a rsc_location constraint for the master role, which looks at the pingd attribute of the node.

crm configure location ms-drbd-0_master_on_connected_node ms-drbd0 \
 rule role=master -inf: not_defined pingd or pingd lte 0

This will force the master role off of any node with a pingd attribute value of less or equal 0 or without a pingd attribute at all.

Note: This will prevent the master role and all its colocated resources from running at all if all your nodes lose network connection to the ping nodes.

If you don't want that, you can also configure a different score value than -INFINITY, but that requires cluster-individual score-maths depending on your number of resources, stickiness values and constraint scores.

Source

DRBD w/heartbeat

Heartbeat R1-style configuration

In R1-style clusters, Heartbeat keeps its complete configuration in three simple configuration files:

  • /etc/ha.d/ha.cf, as described in the section called “The ha.cf file”.
  • /etc/ha.d/authkeys, as described in the section called “The authkeys file”.
  • /etc/ha.d/haresources — the resource configuration file, described below.

The haresources file
The following is an example of a Heartbeat R1-compatible resource configuration involving a MySQL database backed by DRBD:

bob drbddisk::mysql Filesystem::/dev/drbd0::/var/lib/mysql::ext3 \
    10.9.42.1 mysql

This resource configuration contains one resource group whose home node (the node where its resources are expected to run under normal circumstances) is named bob. Consequentially, this resource group would be considered the local resource group on host bob, whereas it would be the foreign resource group on its peer host.

The resource group includes a DRBD resource named mysql, which will be promoted to the primary role by the cluster manager (specifically, the drbddisk resource agent) on whichever node is currently the active node. Of course, a corresponding resource must exist and be configured in /etc/drbd.conf for this to work.

That DRBD resource translates to the block device named /dev/drbd0, which contains an ext3 filesystem that is to be mounted at /var/lib/mysql (the default location for MySQL data files).

The resource group also contains a service IP address, 10.9.42.1. Heartbeat will make sure that this IP address is configured and available on whichever node is currently active.

Finally, Heartbeat will use the LSB resource agent named mysql in order to start the MySQL daemon, which will then find its data files at /var/lib/mysql and be able to listen on the service IP address, 192.168.42.1.

It is important to understand that the resources listed in the haresources file are always evaluated from left to right when resources are being started, and from right to left when they are being stopped.

Source

Heartbeat CRM configuration

In CRM clusters, Heartbeat keeps part of configuration in the following configuration files:

  • /etc/ha.d/ha.cf, as described in the section called “The ha.cf file”. You must include the following line in this configuration file to enable CRM mode:
    crm yes
  • /etc/ha.d/authkeys. The contents of this file are the same as for R1 style clusters. See the section called “The authkeys file” for details.

The remainder of the cluster configuration is maintained in the Cluster Information Base (CIB), covered in detail in the following section. Contrary to the two relevant configuration files, the CIB need not be manually distributed among cluster nodes; the Heartbeat services take care of that automatically.

The Cluster Information Base
The Cluster Information Base (CIB) is kept in one XML file, /var/lib/heartbeat/crm/cib.xml. It is, however, not recommended to edit the contents of this file directly, except in the case of creating a new cluster configuration from scratch. Instead, Heartbeat comes with both command-line applications and a GUI to modify the CIB.

The CIB actually contains both the cluster configuration (which is persistent and is kept in the cib.xml file), and information about the current cluster status (which is volatile). Status information, too, may be queried either using Heartbeat command-line tools, and the Heartbeat GUI.

After creating a new Heartbeat CRM cluster — that is, creating the ha.cf and authkeys files, distributing them among cluster nodes, starting Heartbeat services, and waiting for nodes to establish intra-cluster communications — a new, empty CIB is created automatically. Its contents will be similar to this:

<cib>
   <configuration>
     <crm_config>
       <cluster_property_set id="cib-bootstrap-options">
         <attributes/>
       </cluster_property_set>
     </crm_config>
     <nodes>
       <node uname="alice" type="normal"
             id="f11899c3-ed6e-4e63-abae-b9af90c62283"/>
       <node uname="bob" type="normal"
             id="663bae4d-44a0-407f-ac14-389150407159"/>
     </nodes>
     <resources/>
     <constraints/>
   </configuration>
 </cib>

The exact format and contents of this file are documented at length on the Linux-HA web site, but for practical purposes it is important to understand that this cluster has two nodes named alice and bob, and that neither any resources nor any resource constraints have been configured at this point. Adding a DRBD-backed service to the cluster configuration

This section explains how to enable a DRBD-backed service in a Heartbeat CRM cluster. The examples used in this section mimic, in functionality, those described in the section called “Heartbeat resources”, dealing with R1-style Heartbeat clusters.

The complexity of the configuration steps described in this section may seem overwhelming to some, particularly those having previously dealt only with R1-style Heartbeat configurations. While the configuration of Heartbeat CRM clusters is indeed complex (and sometimes not very user-friendly), the CRM's advantages may outweigh those of R1-style clusters. Which approach to follow is entirely up to the administrator's discretion.
Using the drbddisk resource agent in a Heartbeat CRM configuration

Even though you are using Heartbeat in CRM mode, you may still utilize R1-compatible resource agents such as drbddisk. This resource agent provides no secondary node monitoring, and ensures only resource promotion and demotion.

In order to enable a DRBD-backed configuration for a MySQL database in a Heartbeat CRM cluster with drbddisk, you would use a configuration like this:

<group ordered="true" collocated="true" id="rg_mysql">
  <primitive class="heartbeat" type="drbddisk"
             provider="heartbeat" id="drbddisk_mysql">
    <meta_attributes>
      <attributes>
        <nvpair name="target_role" value="started"/>
      </attributes>
    </meta_attributes>
    <instance_attributes>
      <attributes>
        <nvpair name="1" value="mysql"/>
      </attributes>
    </instance_attributes>
  </primitive>
  <primitive class="ocf" type="Filesystem"
             provider="heartbeat" id="fs_mysql">
    <instance_attributes>
      <attributes>
        <nvpair name="device" value="/dev/drbd0"/>
        <nvpair name="directory" value="/var/lib/mysql"/>
        <nvpair name="type" value="ext3"/>
      </attributes>
    </instance_attributes>
  </primitive>
  <primitive class="ocf" type="IPaddr2"
             provider="heartbeat" id="ip_mysql">
    <instance_attributes>
      <attributes>
        <nvpair name="ip" value="192.168.42.1"/>
        <nvpair name="cidr_netmask" value="24"/>
        <nvpair name="nic" value="eth0"/>
      </attributes>
    </instance_attributes>
  </primitive>
  <primitive class="lsb" type="mysqld"
             provider="heartbeat" id="mysqld"/>
</group>

Assuming you created this configuration in a temporary file named /tmp/hb_mysql.xml, you would add this resource group to the cluster configuration using the following command (on any cluster node):

cibadmin -o resources -C -x /tmp/hb_mysql.xml

After this, Heartbeat will automatically propagate the newly-configured resource group to all cluster nodes. Using the drbd OCF resource agent in a Heartbeat CRM configuration

The drbd resource agent is a “pure-bred” OCF RA which provides Master/Slave capability, allowing Heartbeat to start and monitor the DRBD resource on multiple nodes and promoting and demoting as needed. You must, however, understand that the drbd RA disconnects and detaches all DRBD resources it manages on Heartbeat shutdown, and also upon enabling standby mode for a node.

In order to enable a DRBD-backed configuration for a MySQL database in a Heartbeat CRM cluster with the drbd OCF resource agent, you must create both the necessary resources, and Heartbeat constraints to ensure your service only starts on a previously promoted DRBD resource. It is recommended that you start with the constraints, such as shown in this example:

<constraints>
  <rsc_order id="mysql_after_drbd" from="rg_mysql" action="start"
             to="ms_drbd_mysql" to_action="promote" type="after"/>
  <rsc_colocation id="mysql_on_drbd" to="ms_drbd_mysql"
                  to_role="master" from="rg_mysql" score="INFINITY"/>
</constraints>

Assuming you put these settings in a file named /tmp/constraints.xml, here is how you would enable them:

cibadmin -U -x /tmp/constraints.xml

Subsequently, you would create your relevant resources:

<resources>
  <master_slave id="ms_drbd_mysql">
    <meta_attributes id="ms_drbd_mysql-meta_attributes">
      <attributes>
        <nvpair name="notify" value="yes"/>
        <nvpair name="globally_unique" value="false"/>
      </attributes>
    </meta_attributes>
    <primitive id="drbd_mysql" class="ocf" provider="heartbeat"
        type="drbd">
      <instance_attributes id="ms_drbd_mysql-instance_attributes">
        <attributes>
          <nvpair name="drbd_resource" value="mysql"/>
        </attributes>
      </instance_attributes>
      <operations id="ms_drbd_mysql-operations">
        <op id="ms_drbd_mysql-monitor-master"
	    name="monitor" interval="29s"
            timeout="10s" role="Master"/>
        <op id="ms_drbd_mysql-monitor-slave"
            name="monitor" interval="30s"
            timeout="10s" role="Slave"/>
      </operations>
    </primitive>
  </master_slave>
  <group id="rg_mysql">
    <primitive class="ocf" type="Filesystem"
               provider="heartbeat" id="fs_mysql">
      <instance_attributes id="fs_mysql-instance_attributes">
        <attributes>
          <nvpair name="device" value="/dev/drbd0"/>
          <nvpair name="directory" value="/var/lib/mysql"/>
          <nvpair name="type" value="ext3"/>
        </attributes>
      </instance_attributes>
    </primitive>
    <primitive class="ocf" type="IPaddr2"
               provider="heartbeat" id="ip_mysql">
      <instance_attributes id="ip_mysql-instance_attributes">
        <attributes>
          <nvpair name="ip" value="10.9.42.1"/>
          <nvpair name="nic" value="eth0"/>
        </attributes>
      </instance_attributes>
    </primitive>
    <primitive class="lsb" type="mysqld"
               provider="heartbeat" id="mysqld"/>
  </group>
</resources>

Assuming you put these settings in a file named /tmp/resources.xml, here is how you would enable them:

cibadmin -U -x /tmp/resources.xml

After this, your configuration should be enabled. Heartbeat now selects a node on which it promotes the DRBD resource, and then starts the DRBD-backed resource group on that same node.

Source

333.2 Global File System and OCFS2 (weight: 3)

Candidates should know how to install, maintain and troubleshoot installations using GFS and OCFS2. The following is a partial list of the used files, terms and utilities:

  • GFS2
  • Distributed Lock Manager

GFS2

the Global File System 2 or GFS2 is a shared disk file system for Linux computer clusters. GFS2 differs from distributed file systems (such as AFS, Coda, or InterMezzo) because GFS2 allows all nodes to have direct concurrent access to the same shared block storage. In addition, GFS or GFS2 can also be used as a local filesystem.

GFS has no disconnected operating-mode, and no client or server roles. All nodes in a GFS cluster function as peers. Using GFS in a cluster requires hardware to allow access to the shared storage, and a lock manager to control access to the storage. The lock manager operates as a separate module: thus GFS and GFS2 can use the Distributed Lock Manager (DLM) for cluster configurations and the “nolock” lock manager for local filesystems. Older versions of GFS also support GULM, a server based lock manager which implements redundancy via failover.

Source

Setting up GFS2

The 1st command you need to know for creating and modifying your cluster is the ‘ccs_tool‘ command.

Below I will show you the necessary steps to create a cluster and then the GFS2 filesystem
1. First step is to install the necessary RPM’s..

    yum -y install modcluster rgmanager gfs2 gfs2-utils lvm2-cluster cman

2. Second step is to create a cluster on gfs1

    ccs_tool create GFStestCluster

3. Now that the cluster is created, we will now need to add the fencing devices.

  ( For simplicity you can just use fence_manual for each host.. ccs_tool addfence -C gfs1_ipmi fence_manual )
  But if you are using VMware ESX like I am you should use fence_vmware like so…
    ccs_tool addfence -C gfs1_vmware fence_vmware ipaddr=esxtest login=esxuser passwd=esxpass \
     vmlogin=root vmpasswd=esxpass port=”/vmfs/volumes/49086551-c64fd83c-0401-001e0bcd6848/eagle1/gfs1.vmx”
    ccs_tool addfence -C gfs2_vmware fence_vmware ipaddr=esxtest login=esxuser passwd=esxpass \
     vmlogin=root vmpasswd=esxpass port=”vmfs/volumes/49086551-c64fd83c-0401-001e0bcd6848/gfs2/gfs2.vmx”
    ccs_tool addfence -C gfs3_vmware fence_vmware ipaddr=esxtest login=esxuser passwd=esxpass \
     vmlogin=root vmpasswd=esxpass port=”/vmfs/volumes/49086551-c64fd83c-0401-001e0bcd6848/gfs3/gfs3.vmx”

4. Now that we added the Fencing devices, it is time to add the nodes..

    ccs_tool addnode -C gfs1 -n 1 -v 1 -f gfs1_vmware
    ccs_tool addnode -C gfs2 -n 2 -v 1 -f gfs2_vmware
    ccs_tool addnode -C gfs3 -n 3 -v 1 -f gfs3_vmware

5. Now we need to copy this configuration over to the other 2 nodes from gfs1 or we can run the exact same commands above on the other 2 nodes..

    scp /etc/cluster/cluster.conf root@gfs2:/etc/cluster/cluster.conf
    scp /etc/cluster/cluster.conf root@gfs3:/etc/cluster/cluster.conf

6. You can verify the config on all 3 nodes by running the following commands below..

    ccs_tool lsnode
    ccs_tool lsfence

7. You are ready to proceed with starting up the following daemons on all the nodes in the cluster, once you either copied over the configs or re ran the same commands above on the other 2 nodes

    /etc/init.d/cman start
    /etc/init.d/rgmanager start

8. You can now check the status of your cluster by running the commands below…

    clustat
    cman_tool status

9. If you want to test the vmware fencing you can do so by doing the following.. ( run the command below on the 1st node and use the 2nd node as the node to be fenced )

  fence_vmware -a esxtest -l esxuser -p esxpass -L root -P esxpass \
   -n “/vmfs/volumes/49086551-c64fd83c-0401-001e0bcd6848/gfs2/gfs2.vmx” -v

10, Before we start to create the LVM2 volumes and Proceed to GFS2, we will need to enable clustering in LVM2.

    lvmconf –enable-cluster

11. Now it is time to create the LVM2 Volumes…

    pvcreate MyTestGFS /dev/sdb
    vgcreate -c y mytest_gfs2 /dev/sdb
    lvcreate -n MyGFS2test -L 5G mytest_gfs2
    /etc/init.d/clvmd start

12. You should now also start clvmd on the other 2 nodes.. 13. Once the above has been completed, you will now need to create the GFS2 file system.. Example below..

    mkfs -t <filesystem> -p <locking mechanism> -t <ClusterName>:<PhysicalVolumeName> \
     -j <JournalsNeeded == amount of nodes in cluster> <location of filesystem>
    mkfs -t gfs2 -p lock_dlm -t MyCluster:MyTestGFS -j 4 /dev/mapper/mytest_gfs2-MyGFS2test

14. All we need to do on the 3 nodes, is to mount the GFS2 file system.

    mount /dev/mapper/mytest_gfs2-MyGFS2test /mnt/

15. Once you mounted your GFS2 file system You can the following commands..

    gfs2_tool list
    gfs2_tool df


Now it is time to wrap it up with some final commands…

1. Now that we have a fully functional cluster and a mountable GFS2 file system, we need to make sure all the necessary daemons start up with the cluster..

    chkconfig –level 345 rgmanager on
    chkconfig –level 345 clvmd on
    chkconfig –level 345 cman on
    chkconfig –level 345 gfs2 on

2. If you want the GFS2 file system to be mounted at startup you can add this to /etc/fstab..

    echo “/dev/mapper/mytest_gfs2-MyGFS2test /GFS gfs2 defaults,noatime,nodiratime 0 0″ >> /etc/fstab

Source

Distributed Lock Manager

A distributed lock manager (DLM) provides distributed software applications with a means to synchronize their accesses to shared resources.

Lock management is a common cluster-infrastructure service that provides a mechanism for other cluster infrastructure components to synchronize their access to shared resources. In a Red Hat cluster, DLM (Distributed Lock Manager) is the lock manager.

A lock manager is a traffic cop who controls access to resources in the cluster, such as access to a GFS file system. You need it because without a lock manager, there would be no control over access to your shared storage, and the nodes in the cluster would corrupt each other's data.

As implied in its name, DLM is a distributed lock manager and runs in each cluster node; lock management is distributed across all nodes in the cluster. GFS2 and CLVM use locks from the lock manager. GFS2 uses locks from the lock manager to synchronize access to file system metadata (on shared storage). CLVM uses locks from the lock manager to synchronize updates to LVM volumes and volume groups (also on shared storage). In addition, rgmanager uses DLM to synchronize service states.

DLM Locking Model

The DLM locking model provides a rich set of locking modes and both synchronous and asynchronous execution. An application acquires a lock on a lock resource. A one-to-many relationship exists between lock resources and locks: a single lock resource can have multiple locks associated with it.

A lock resource can correspond to an actual object, such as a file, a data structure, a database, or an executable routine, but it does not have to correspond to one of these things. The object you associate with a lock resource determines the granularity of the lock. For example, locking an entire database is considered locking at coarse granularity. Locking each item in a database is considered locking at a fine granularity.

The DLM locking model supports:

  • Six locking modes that increasingly restrict access to a resource
  • The promotion and demotion of locks through conversion
  • Synchronous completion of lock requests
  • Asynchronous completion
  • Global data through lock value blocks

The DLM provides its own mechanisms to support its locking features, such as inter-node communication to manage lock traffic and recovery protocols to re-master locks after a node failure or to migrate locks when a node joins the cluster. However, the DLM does not provide mechanisms to actually manage the cluster itself. Therefore the DLM expects to operate in a cluster in conjunction with another cluster infrastructure environment that provides the following minimum requirements:

  • The node is a part of a cluster.
  • All nodes agree on cluster membership and has quorum.
  • An IP address must communicate with the DLM on a node. Normally the DLM uses TCP/IP for inter-node communications which restricts it to a single IP address per node (though this can be made more redundant using the bonding driver). The DLM can be configured to use SCTP as its inter-node

transport which allows multiple IP addresses per node.
The DLM works with any cluster infrastructure environments that provide the minimum requirements listed above. The choice of an open source or closed source environment is up to the user. However, the DLM’s main limitation is the amount of testing performed with different environments.

Source

Lock States

A lock state indicates the current status of a lock request. A lock is always in one of three states:

  • Granted — The lock request succeeded and attained the requested mode.
  • Converting — A client attempted to change the lock mode and the new mode is incompatible with an existing lock.
  • Blocked — The request for a new lock could not be granted because conflicting locks exist.

A lock's state is determined by its requested mode and the modes of the other locks on the same resource.

Source

333.3 Other Clustered File Systems (weight: 1)

Candidates should have an awareness of other clustered filesystems available in a Linux environment. The following is a partial list of the used files, terms and utilities:

  • Coda
  • AFS
  • GlusterFS

Coda

Coda is a distributed file system developed as a research project at Carnegie Mellon University since 1987 under the direction of Mahadev Satyanarayanan. It descended directly from an older version of AFS (AFS-2) and offers many similar features. The InterMezzo file system was inspired by Coda. Coda is still under development, though the focus has shifted from research to creating a robust product for commercial use.

Coda has many features that are desirable for network file systems, and several features not found elsewhere.

  1. Disconnected operation for mobile computing
  2. Is freely available under a liberal license
  3. High performance through client side persistent caching
  4. Server replication
  5. Security model for authentication, encryption and access control
  6. Continued operation during partial network failures in server network
  7. Network bandwidth adaptation
  8. Good scalability
  9. Well defined semantics of sharing, even in the presence of network failures

Coda uses a local cache to provide access to server data when the network connection is lost. During normal operation, a user reads and writes to the file system normally, while the client fetches, or “hoards”, all of the data the user has listed as important in the event of network disconnection. If the network connection is lost, the Coda client's local cache serves data from this cache and logs all updates. This operating state is called disconnected operation. Upon network reconnection, the client moves to reintegration state; it sends logged updates to the servers. Then it transitions back to normal connected-mode operation.

Also different from AFS is Coda's data replication method. AFS uses a pessimistic replication strategy with its files, only allowing one read/write server to receive updates and all other servers acting as read-only replicas. Coda allows all servers to receive updates, allowing for a greater availability of server data in the event of network partitions, a case which AFS cannot handle.

These unique features introduce the possibility of semantically diverging copies of the same files or directories, known as “conflicts”. Disconnected operation's local updates can potentially clash with other connected users' updates on the same objects, preventing reintegration. Optimistic replication can potentially cause concurrent updates to different servers on the same object, preventing replication. The former case is called a “local/global” conflict, and the latter case a “server/server” conflict. Coda has extensive repair tools, both manual and automated, to handle and repair both types of conflicts.
Source

AFS

The Andrew File System (AFS) is a distributed networked file system which uses a set of trusted servers to present a homogeneous, location-transparent file name space to all the client workstations. It was developed by Carnegie Mellon University as part of the Andrew Project. It is named after Andrew Carnegie and Andrew Mellon. Its primary use is in distributed computing.

AFS has several benefits over traditional networked file systems, particularly in the areas of security and scalability. It is not uncommon for enterprise AFS deployments to exceed 25,000 clients. AFS uses Kerberos for authentication, and implements access control lists on directories for users and groups. Each client caches files on the local filesystem for increased speed on subsequent requests for the same file. This also allows limited filesystem access in the event of a server crash or a network outage.

Read and write operations on an open file are directed only to the locally cached copy. When a modified file is closed, the changed portions are copied back to the file server. Cache consistency is maintained by callback mechanism. When a file is cached, the server makes a note of this and promises to inform the client if the file is updated by someone else. Callbacks are discarded and must be re-established after any client, server, or network failure, including a time-out. Re-establishing a callback involves a status check and does not require re-reading the file itself.

A consequence of the file locking strategy is that AFS does not support large shared databases or record updating within files shared between client systems. This was a deliberate design decision based on the perceived needs of the university computing environment. It leads, for example, to the use of a single file per message in the original email system for the Andrew Project, the Andrew Message System, rather than a single file per mailbox. See file locking (AFS and buffered I/O Problems) for handling shared databases.

A significant feature of AFS is the volume, a tree of files, sub-directories and AFS mountpoints (links to other AFS volumes). Volumes are created by administrators and linked at a specific named path in an AFS cell. Once created, users of the filesystem may create directories and files as usual without concern for the physical location of the volume. A volume may have a quota assigned to it in order to limit the amount of space consumed. As needed, AFS administrators can move that volume to another server and disk location without the need to notify users; indeed the operation can occur while files in that volume are being used.

AFS volumes can be replicated to read-only cloned copies. When accessing files in a read-only volume, a client system will retrieve data from a particular read-only copy. If at some point that copy becomes unavailable, clients will look for any of the remaining copies. Again, users of that data are unaware of the location of the read-only copy; administrators can create and relocate such copies as needed. The AFS command suite guarantees that all read-only volumes contain exact copies of the original read-write volume at the time the read-only copy was created.

The file name space on an Andrew workstation is partitioned into a shared and local name space. The shared name space (usually mounted as /afs on the Unix filesystem) is identical on all workstations. The local name space is unique to each workstation. It only contains temporary files needed for workstation initialization and symbolic links to files in the shared name space.

The Andrew File System heavily influenced Version 4 of Sun Microsystems' popular Network File System (NFS). Additionally, a variant of AFS, the Distributed File System (DFS) was adopted by the Open Software Foundation in 1989 as part of their Distributed Computing Environment.

Source

GlusterFS

GlusterFS is a scale-out NAS file system. It is free software, with some parts licensed under the GNU GPL v3 while others are dual licensed under either GPL v2 or the LGPL v3. It aggregates various storage servers over Ethernet or Infiniband RDMA interconnect into one large parallel network file system. GlusterFS is based on a stackable user space design. It has found a variety of applications including cloud computing, streaming media services, and content delivery networks. GlusterFS was developed originally by Gluster, Inc., then by Red Hat, Inc., after their purchase of Gluster in 2011.

GlusterFS has a client and server component. Servers are typically deployed as storage bricks, with each server running a glusterfsd daemon to export a local file system as a volume. The glusterfs client process, which connects to servers with a custom protocol over TCP/IP, InfiniBand or SDP, creates composite virtual volumes from multiple remote servers using stackable translators. By default, files are stored whole, but striping of files across multiple remote volumes is also supported. The final volume may then be mounted by the client host using its own native protocol via the FUSE mechanism, using NFS v3 protocol using a built-in server translator, or accessed via gfapi client library. Native-protocol mounts may then be re-exported e.g. via the kernel NFSv4 server, SAMBA, or the object-based OpenStack Storage (Swift) protocol using the “UFO” (Unified File and Object) translator.

Most of the functionality of GlusterFS is implemented as translators, including:

  • File-based mirroring and replication
  • File-based striping
  • File-based load balancing
  • Volume failover
  • Scheduling and disk caching
  • Storage quotas

The GlusterFS server is intentionally kept simple: it exports an existing directory as-is, leaving it up to client-side translators to structure the store. The clients themselves are stateless, do not communicate with each other, and are expected to have translator configurations consistent with each other. GlusterFS relies on an elastic hashing algorithm, rather than using either a centralized or distributed metadata model. With version 3.1 and later of GlusterFS, volumes can be added, deleted, or migrated dynamically, helping to avoid configuration coherency problems, and allowing GlusterFS to scale up to several petabytes on commodity hardware by avoiding bottlenecks that normally affect more tightly-coupled distributed file systems.

Source

Acknowledgments

Most of the information in this document was collected from different sites on the internet and was copied (un)modified. Some text was created by me. The copyright of the text in document remains by their owners and is noway claimed by me. If you wrote some of the text we copied, I like to thank you for your excellent work.

Nothing in this document should be published for commercial purposes without gaining the permission of the original copyright owners.

For questions about this document or if you want to help keep this document up-to-date, you can contact me at webmaster@universe-network.net

 
wiki/certification/lpic304.txt · Last modified: 2012/10/16 09:44 by ferry
 
Recent changes RSS feed Creative Commons License Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki