User Tools

Site Tools


wiki:certification:lpic304-200

LPI 304-200: Virtualization

In this document you find information for the different objectives from the LPIC 304 exam. Before using this document you should check on the LPI site if the objectives are still the same. This document is provided as an aid in studying and is in noway a guaranty for passing the exam. Try to gain some practical knowledge and really understand the concepts how things work, that should help.

Topic 330: Virtualization

330.1 Virtualization Concepts and Theory (weight: 8)

Description: Candidates should know and understand the general concepts, theory and terminology of Virtualization. This includes Xen, KVM and libvirt terminology.

Key Knowledge Areas:

  • Terminology
  • Pros and Cons of Virtualization
  • Variations of Virtual Machine Monitors
  • Migration of Physical to Virtual Machines
  • Migration of Virtual Machines between Host systems
  • Cloud Computing

The following is a partial list of the used files, terms and utilities:

  • Hypervisor
  • Hardware Virtual Machine (HVM)
  • Paravirtualization (PV)
  • Container Virtualization
  • Emulation and Simulation
  • CPU flags
  • /proc/cpuinfo
  • Migration (P2V, V2V)
  • IaaS, PaaS, SaaS

Terminology

Hypervisor
In computing, a hypervisor, also called virtual machine manager (VMM), is one of many hardware virtualization techniques allowing multiple operating systems, termed guests, to run concurrently on a host computer. It is so named because it is conceptually one level higher than a supervisory program. The hypervisor presents to the guest operating systems a virtual operating platform and manages the execution of the guest operating systems. Multiple instances of a variety of operating systems may share the virtualized hardware resources. Hypervisors are very commonly installed on server hardware, with the function of running guest operating systems, that themselves act as servers.

HVM(HardwareVirtualMachine)
In computing, hardware-assisted virtualization is a platform virtualization approach that enables efficient full virtualization using help from hardware capabilities, primarily from the host processors. Full virtualization is used to simulate a complete hardware environment, or virtual machine, in which an unmodified guest operating system (using the same instruction set as the host machine) executes in complete isolation. Hardware-assisted virtualization was added to x86 processors (Intel VT-x or AMD-V) in 2006.

Hardware-assisted virtualization is also known as accelerated virtualization; Xen calls it hardware virtual machine (HVM), Virtual Iron calls it native virtualization.

PV(Paravirtualization)
In computing, paravirtualization is a virtualization technique that presents a software interface to virtual machines that is similar but not identical to that of the underlying hardware.

The intent of the modified interface is to reduce the portion of the guest's execution time spent performing operations which are substantially more difficult to run in a virtual environment compared to a non-virtualized environment. The paravirtualization provides specially defined 'hooks' to allow the guest(s) and host to request and acknowledge these tasks, which would otherwise be executed in the virtual domain (where execution performance is worse). A successful paravirtualized platform may allow the virtual machine monitor (VMM) to be simpler (by relocating execution of critical tasks from the virtual domain to the host domain), and/or reduce the overall performance degradation of machine-execution inside the virtual-guest.

Paravirtualization requires the guest operating system to be explicitly ported for the para-API — a conventional OS distribution which is not paravirtualization-aware cannot be run on top of a paravirtualizing VMM. However, even in cases where the operating system cannot be modified, still components may be available that enable many of the significant performance advantages of paravirtualization; for example, the XenWindowsGplPv project provides a kit of paravirtualization-aware device drivers, licensed under the terms of the GPL, that are intended to be installed into a Microsoft Windows virtual-guest running on the Xen hypervisor.

domains
Domain or virtual machine. A virtual machine (VM) is a software implementation of a machine (i.e. a computer) that executes programs like a physical machine. Virtual machines are separated into two major categories, based on their use and degree of correspondence to any real machine. A system virtual machine provides a complete system platform which supports the execution of a complete operating system (OS). In contrast, a process virtual machine is designed to run a single program, which means that it supports a single process. An essential characteristic of a virtual machine is that the software running inside is limited to the resources and abstractions provided by the virtual machine—it cannot break out of its virtual environment.

A virtual machine was originally defined by Popek and Goldberg as “an efficient, isolated duplicate of a real machine”. Current use includes virtual machines which have no direct correspondence to any real hardware.

emulation and simulation
In integrated circuit design, hardware emulation is the process of imitating the behavior of one or more pieces of hardware (typically a system under design) with another piece of hardware, typically a special purpose emulation system. The emulation model is usually based on RTL (e.g. Verilog) source code, which is compiled into the format used by emulation system. The goal is normally debugging and functional verification of the system being designed. Often an emulator is fast enough to be plugged into a working target system in place of a yet-to-be-built chip, so the whole system can be debugged with live data. This is a specific case of in-circuit emulation.

Sometimes hardware emulation can be confused with hardware devices such as expansion cards with hardware processors that assist functions of software emulation, such as older daughterboards with x86 chips to allow x86 OSes to run on motherboards of different processor families.

A computer simulation, a computer model, or a computational model is a computer program, or network of computers, that attempts to simulate an abstract model of a particular system. Computer simulations have become a useful part of mathematical modeling of many natural systems in physics (computational physics), astrophysics, chemistry and biology, human systems in economics, psychology, social science, and engineering. Simulation of a system is represented as the running of the system's model. It can be used to explore and gain new insights into new technology, and to estimate the performance of systems too complex for analytical solutions.

CPU flags
Indicating the features supported by a cpu. Relevant flags for virtualization:

  • HVM Hardware support for virtual machines (Xen abbreviation for AMD SVM / Intel VMX)
  • SVM Secure Virtual Machine. (AMD’s virtualization extensions to the 64-bit x86 architecture, equivalent to Intel’s VMX, both also known as HVM in the Xen hypervisor.)
  • VMX Intel’s equivalent to AMD’s SVM

More cpu flags

Pros and Cons of Virtualization

System virtual machine advantages:

  • multiple OS environments can co-exist on the same computer, in strong isolation from each other
  • the virtual machine can provide an instruction set architecture (ISA) that is somewhat different from that of the real machine
  • application provisioning, maintenance, high availability and disaster recovery


The main disadvantages of VMs are:

  • a virtual machine is less efficient than a real machine when it accesses the hardware indirectly
  • when multiple VMs are concurrently running on the same physical host, each VM may exhibit a varying and unstable performance (Speed of Execution, and not results), which highly depends on the workload imposed on the system by other VMs, unless proper techniques are used for temporal isolation among virtual machines.


Variations of Virtual Machine Monitors

The software that creates a virtual machine environment in a computer. In a regular, non-virtual environment, the operating system is the master control program, which manages the execution of all applications and acts as an interface between the applications and the hardware. The OS has the highest privilege level in the machine, known as “ring 0” (see ring).

In a virtual machine environment, the virtual machine monitor (VMM) becomes the master control program with the highest privilege level, and the VMM manages one or more operating systems, now referred to as “guest operating systems.” Each guest OS manages its own applications as it normally does in a non-virtual environment, except that it has been isolated in the computer by the VMM. Each guest OS with its applications is known as a “virtual machine” and is sometimes called a “guest OS stack.”

Prior to the introduction of hardware support for virtualization, the VMM could only use software techniques for virtualizing x86 processors and providing virtual hardware. This software approach, binary translation (BT), was used for instruction set virtualization and shadow page tables for memory management unit virtualization. Today, both Intel and AMD provide hardware support for CPU virtualization with Intel VT-x and AMD-V, respectively. More recently they added support for memory management unit (MMU) virtualization with Intel EPT and AMD RVI. In the rest of this paper, the following are referred as follows: hardware support for CPU virtualization as hardware virtualization (HV), hardware support for MMU virtualization as hwMMU, and software memory management unit virtualization as swMMU.

For some guests and hardware configurations the VMM may choose to virtualize the CPU and MMU using:

  • no hardware support (BT + swMMU),
  • HV and hwMMU (VT-x + EPT),
  • HV only (VT-x + swMMU).

The method of virtualization that the VMware VMM chooses for a particular guest on a certain platform is known as the monitor execution mode or simply monitor mode. On modern x86 CPUs the VMM has an option of choosing from several possible monitor modes. However, not all modes provide similar performance. A lot depends on the available CPU features and the guest OS behavior. VMware ESX identifies the hardware platform and chooses a default monitor mode for a particular guest on that platform. This decision is made by the VMM based on the available CPU features on a platform and the guest behavior on that platform.
Source: Virtual Machine Monitor Execution Modes in VMware vSphere 4.0

Migration of Physical to Virtual Machines

Migration of Virtual Machines between Host systems

Cloud Computing

Container Virtualization

/proc/cpuinfo

/proc/cpuinfo

root@richard:~# cat /proc/cpuinfo 
processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 15
model name	: Intel(R) Core(TM)2 Quad CPU           @ 2.40GHz
stepping	: 7
microcode	: 0x66
cpu MHz		: 1596.000
cache size	: 4096 KB
physical id	: 0
siblings	: 4
core id		: 0
cpu cores	: 4
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 10
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good nopl aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm lahf_lm dtherm tpr_shadow
bogomips	: 4799.95
clflush size	: 64
cache_alignment	: 64
address sizes	: 36 bits physical, 48 bits virtual
power management:

How can I tell if I have Intel VT or AMD-V?
With a recent enough Linux kernel, run the command:

egrep '^flags.*(vmx|svm)' /proc/cpuinfo

Migration (P2V, V2V)

IaaS, PaaS, SaaS

330.2 Xen (weight: 9)

Description: Candidates should be able to install, configure, maintain, migrate and troubleshoot Xen installations. The focus is on Xen version 4.x.

Key Knowledge Areas:

  • Xen architecture, networking and storage
  • Xen configuration
  • Xen utilities
  • Troubleshooting Xen installations
  • Basic knowledge of XAPI
  • Awareness of XenStore
  • Awareness of Xen Boot Parameters
  • Awareness of the xm utility

The following is a partial list of the used files, terms and utilities:

  • Domain0 (Dom0), DomainU (DomU)
  • PV-DomU, HVM-DomU
  • /etc/xen/
  • xl
  • xl.cfg
  • xl.conf
  • xe
  • xentop

Xen architecture, networking and storage

Xen configuration

Xen utilities

Troubleshooting Xen installations

Basic knowledge of XAPI

Awareness of XenStore

Awareness of Xen Boot Parameters

Awareness of the xm utility

Listing Guest System Status
The status of the host and guest systems may be viewed at any time using the list option of the xm tool. For example:

xm list

The above command will display output containing a line for the host system and a line for each guest similar to the following:

Name                                      ID   Mem VCPUs      State   Time(s)
Domain-0                                   0   389     1     r-----   1414.9
XenFed                                         305     1               349.9
myFedoraXen                                    300     1                 0.0
myXenGuest                                 6   300     1     -b----     10.6

The state column uses a single character to specify the current state of the corresponding guest. These are as follows:

    r - running - The domain is currently running and healthy 

    b - blocked - The domain is blocked, and not running or runnable. This can be caused because the domain 
                  is waiting on IO (a traditional wait state) or has gone to sleep because there was nothing 
                  else for it to do. 

    p - paused - The domain has been paused, typically as a result of the administrator running the xm pause
                 command. When in a paused state the domain will still consume allocated resources like memory, 
                 but will not be eligible for scheduling by the Xen hypervisor. 

    s - shutdown - The guest has requested to be shutdown, rebooted or suspended, and the domain is in 
                   the process of being destroyed in response. 

    c - crashed - The domain has crashed. Usually this state can only occur if the domain has been 
                  configured not to restart on crash. 

    d - dying - The domain is in process of dying, but hasn't completely shutdown or crashed. 


Starting a Xen Guest System
A guest operating system can be started using the xm tool combined with the start option followed by the name of the guest operating system to be launched. For example:

su -
xm start myGuestOS


Connecting to a Running Xen Guest System
Once the guest operating system has started, a connection to the guest may be established using either the vncviewer tool or the virt-manager console. To use virt-manager, select Applications→System Tools→Virtual Machine Manager, select the desired system and click Open.

To connect using vncviewer enter the following command in Terminal window:

vncviewer

When prompted for a server enter localhost:5900. A VNC window will subsequently appear containing the running guest system.
Shutting Down a Guest System
The shutdown option of the xm tool is used to shutdown a guest operating system:

xm shutdown guestName

where guestName is the name of the guest system, to be shutdown.

Note that the shutdown option allows the guest operating system to perform an orderly shutdown when it receives the shutdown instruction. To instantly stop a guest operating system the destroy option may be used (with all the attendant risks of filesystem damage and data loss):

xm destroy myGuestOS


Pausing and Resuming a Guest System
A guest system can be paused and resumed using the xm tool's pause and restore options. For example, to pause a specific system named myXenGuest:

xm pause myXenGuest

Similarly, to resume the paused system:

xm resume myXenGuest

Note that a paused session will be lost if the host system is rebooted. Also, be aware that a paused system continues to reside in memory. To save a session such that it no longer takes up memory and can be restored to its exact state after a reboot, it is necessary to either suspend and resume or save and restore the guest.

Suspending and Resuming a Guest OS
A running guest operating system can be suspended and resumed using the xm utility. When suspended, the current status of the guest operating system is written to disk and removed from system memory. A suspended system may subsequently be restored at any time (including after a host system reboot):

To suspend a guest OS named myGuestOS:

xm suspend myGuestOS

To restore a suspended guest OS:

xm resume myGuestOS


Saving and Restoring Xen Guest Systems
Saving and restoring of a Xen guest operating system is similar to suspending with the exception that the file used to contain the suspended operating system memory image can be specified by the user:

To save a guest:

xm save myGuestOS path_to_save_file

To restore a saved guest operating system session:

xm restore path_to_save_file


Rebooting a Guest System
To reboot a guest operating system:

xm reboot myGuestOS


Configuring the Memory Assigned to a Xen Guest OS
To configure the memory assigned to a guest OS, use the mem-set option of the xm command. For example, the following command reduces the memory allocated to a guest system named myGuestOS to 256Mb:

xm mem-set myGuestOS 256

Note that acceptable memory settings must fall within the memory available to the current Domain. This may be increased using the mem-max option to xm.

Migrating a Domain to a Different Host
The migrate option allows a Xen managed domain to be migrated to a different physical server.

In order to use migrate, Xend must already be running on other host machine, and must be running the same version of Xen as the local host system. In addition, the remote host system must have the migration TCP port open and accepting connections from the source host. Finally, there must be sufficient resources for the domain to run (memory, disk space, etc).

xm migrate domainName host

Optional flags available with this command are:

-l, --live           Use live migration.
-p=portnum, --port=portnum
                     Use specified port for migration.
-r=MBIT, --resource=MBIT
                     Set level of resource usage for migration.

Domain0 (Dom0), DomainU (DomU)

Privileged domain (“dom0”) - the only virtual machine which by default has direct access to hardware. From the dom0 the hypervisor can be managed and unprivileged domains (“domU”) can be launched.
The dom0 domain is typically a modified version of Linux, NetBSD or Solaris. User domains may either be unmodified open-source or proprietary operating systems, such as Microsoft Windows (if the host processor supports x86 virtualization, e.g., Intel VT-x and AMD-V), or modified, para-virtualized operating system with special drivers that support enhanced Xen features.
HostOS=Dom0
GuestOS=DomU

PV-DomU, HVM-DomU

PV-DomU = Paravirtualized (or PV) guests. PV guests are guests that are made Xen-aware and therefore can be optimized for Xen.
HVM-DomU = Hardware assisted virtualization. You need a cpu Intel-VT or AMD-V.

/etc/xen/

Location where xen configuration is stored.
Location where vm configuration is stored.

xl

xl.cfg

xl.conf

xe

xentop

xentop

The Xentop utility is included in all versions of XenServer. It displays real-time information about a XenServer system and running domains. It uses a semigraphical interface to display all details in a more friendly format.

 OPTIONS

-h, --help
Show help and exit

-V, --version
Show version information and exit

-d, --delay=SECONDS
Seconds between updates (default 3)

-n, --networks
Show network information

-x, --vbds
Show vbd block device data

-r, --repeat-header
Repeat table header before each domain

-v, --vcpus
Show VCPU data

-b, --batch
Redirect output data to stdout (batch mode)

-i, --iterations=ITERATIONS
Maximum number of updates that xentop should produce before ending
INTERACTIVE COMMANDS

All interactive commands are case-insensitive.

D
Set delay between updates

N
Toggle display of network information

Q, Esc
Quit

R
Toggle table header before each domain

S
Cycle sort order

V
Toggle display of VCPU information

Arrows
Scroll domain display 

330.3 KVM (weight: 9)

Description: Candidates should be able to install, configure, maintain, migrate and troubleshoot KVM installations.

Key Knowledge Areas:

  • KVM architecture, networking and storage
  • KVM configuration
  • KVM utilities
  • Troubleshooting KVM installations

The following is a partial list of the used files, terms and utilities:

  • Kernel modules: kvm, kvm-intel and kvm-amd
  • /etc/kvm/
  • /dev/kvm
  • kvm
  • KVM monitor
  • qemu
  • qemu-img

kernel modules: kvm kvm-intel kvm-amd

KVM (for Kernel-based Virtual Machine) is a full virtualization solution for Linux on x86 hardware containing virtualization extensions (Intel VT or AMD-V). It consists of a loadable kernel module, kvm.ko, that provides the core virtualization infrastructure and a processor specific module, kvm-intel.ko or kvm-amd.ko. KVM also requires a modified QEMU although work is underway to get the required changes upstream.

Using KVM, one can run multiple virtual machines running unmodified Linux or Windows images. Each virtual machine has private virtualized hardware: a network card, disk, graphics adapter, etc.

modprobe kvm
modprobe kvm_intel

or

modprobe kvm
modprobe kvm_amd

/etc/kvm/

Location for storing vm configuration and scripts.

/dev/kvm

By itself, KVM does not perform any emulation. Instead, it exposes the /dev/kvm interface, which a userspace host can then use to:

  • Set up the guest VM's address space. The host must also supply a firmware image (usually a custom BIOS when emulating PCs) that the guest can use to bootstrap into its main OS.
  • Feed the guest simulated I/O.
  • Map the guest's video display back onto the host.

On Linux, QEMU versions 0.10.1 and later is one such userspace host. QEMU uses KVM when available to virtualize guests at near-native speeds, but otherwise falls back to software-only emulation.

Using KVM directly

While the rest of this documentation focuses on using KVM through libvirt, it is also possible to work with KVM directly. This is not the recommended way due to it being cumbersome but can be very useful at times.

KVM is very similar to Qemu and it is possible to run machines from the command line.

The basic syntax is :

kvm -m 512 -hda disk.img -cdrom ubuntu.iso -boot d -smp 2
  • -m = memory (in MB)
  • -hda = first hard drive
    • You can use a number of image file types including .img, .cow
    • You can also boot a hard drive. Be careful with this option as you do not want to boot the host root partition
      • Syntax -hda /dev/sda
      • This will call your grub menu from your MBR when you boot kvm.
  • -cdrom can be an iso image or a CD/DVD drive.
  • -boot [a|c|d|n] boot on floppy (a), hard disk ©, CD-ROM (d), or network (n)
  • -smp = number of CPU
  • -alt-grab change Ctrl-Alt mouse grab combination for Ctrl-Alt-Shift (very practical if you often use some control key combinations like Ctrl-Alt-Del or Windows-E)


KVM monitor

The kvm_stat command is a python script which retrieves runtime statistics from the kvm kernel module. The kvm_stat command can be used to diagnose guest behavior visible to kvm. In particular, performance related issues with guests. Currently, the reported statistics are for the entire system; the behavior of all running guests is reported.

The kvm_stat command requires that the kvm kernel module is loaded and debugfs is mounted. If either of these features are not enabled, the command will output the required steps to enable debugfs or the kvm module. For example:

# kvm_stat
Please mount debugfs ('mount -t debugfs debugfs /sys/kernel/debug')
and ensure the kvm modules are loaded

Mount debugfs if required:

# mount -t debugfs debugfs /sys/kernel/debug

kvm_stat output
The kvm_stat command outputs statistics for all guests and the host. The output is updated until the command is terminated (using Ctrl+ c or the q key).

# kvm_stat

kvm statistics

efer_reload                 94       0
exits                  4003074   31272
fpu_reload             1313881   10796
halt_exits               14050     259
halt_wakeup               4496     203
host_state_reload	1638354   24893
hypercalls                   0       0
insn_emulation         1093850    1909
insn_emulation_fail          0       0
invlpg                   75569       0
io_exits               1596984   24509
irq_exits                21013     363
irq_injections           48039    1222
irq_window               24656     870
largepages                   0       0
mmio_exits               11873       0
mmu_cache_miss           42565       8
mmu_flooded              14752       0
mmu_pde_zapped           58730       0
mmu_pte_updated              6       0
mmu_pte_write           138795       0
mmu_recycled                 0       0
mmu_shadow_zapped        40358       0
mmu_unsync                 793       0
nmi_injections               0       0
nmi_window                   0       0
pf_fixed                697731    3150
pf_guest                279349       0
remote_tlb_flush             5       0
request_irq                  0       0
signal_exits                 1       0
tlb_flush               200190       0

Explanation of variables:

efer_reload

    The number of Extended Feature Enable Register (EFER) reloads.
exits

    The count of all VMEXIT calls.
fpu_reload

    The number of times a VMENTRY reloaded the FPU state. The fpu_reload is incremented when a guest is 
    using the Floating Point Unit (FPU).
halt_exits

    Number of guest exits due to halt calls. This type of exit is usually seen when a guest is idle.
halt_wakeup

    Number of wakeups from a halt.
host_state_reload

    Count of full reloads of the host state (currently tallies MSR setup and guest MSR reads).
hypercalls

    Number of guest hypervisor service calls.
insn_emulation

    Number of guest instructions emulated by the host.
insn_emulation_fail

    Number of failed insn_emulation attempts.
io_exits

    Number of guest exits from I/O port accesses.
irq_exits

    Number of guest exits due to external interrupts.
irq_injections

    Number of interrupts sent to guests.
irq_window

    Number of guest exits from an outstanding interrupt window.
largepages

    Number of large pages currently in use.
mmio_exits

    Number of guest exits due to memory mapped I/O (MMIO) accesses.
mmu_cache_miss

    Number of KVM MMU shadow pages created.
mmu_flooded

    Detection count of excessive write operations to an MMU page. This counts detected 
    write operations not of individual write operations.
mmu_pde_zapped

    Number of page directory entry (PDE) destruction operations.
mmu_pte_updated

    Number of page table entry (PTE) destruction operations.
mmu_pte_write

    Number of guest page table entry (PTE) write operations.
mmu_recycled

    Number of shadow pages that can be reclaimed.
mmu_shadow_zapped

    Number of invalidated shadow pages.
mmu_unsync

    Number of non-synchronized pages which are not yet unlinked.
nmi_injections

    Number of Non-maskable Interrupt (NMI) injections to the guest.
nmi_window

    Number of guest exits from (outstanding) Non-maskable Interrupt (NMI) windows.
pf_fixed

    Number of fixed (non-paging) page table entry (PTE) maps.
pf_guest

    Number of page faults injected into guests.
remote_tlb_flush

    Number of remote (sibling CPU) Translation Lookaside Buffer (TLB) flush requests.
request_irq

    Number of guest interrupt window request exits.
signal_exits

    Number of guest exits due to pending signals from the host.
tlb_flush

    Number of tlb_flush operations performed by the hypervisor.

Source

kvm networking

There are two parts to networking within QEMU:

  • the virtual network device that is provided to the guest (e.g. a PCI network card).
  • the network backend that interacts with the emulated NIC (e.g. puts packets onto the host's network).

There are a range of options for each part.

Creating a network backend
There are a number of network backends to choose from depending on your environment. Create a network backend like this:

-netdev TYPE,id=NAME,...

The id option gives the name by which the virtual network device and the network backend are associated with each other. If you want multiple virtual network devices inside the guest they each need their own network backend. The name is used to distinguish backends from each other and must be used even when only one backend is specified.

Network backend types
In most cases, if you don't have any specific networking requirements other than to be able to access to a web page from your guest, user networking (slirp) is a good choice. However, if you are looking to run any kind of network service or have your guest participate in a network in any meaningful way, tap is usually the best choice.

User Networking (SLIRP)
This is the default networking backend and generally is the easiest to use. It does not require root / Administrator privileges. It has the following limitations:

  • there is a lot of overhead so the performance is poor
  • ICMP traffic does not work (so you cannot use ping within a guest)
  • the guest is not directly accessible from the host or the external network

User Networking is implemented using “slirp”, which provides a full TCP/IP stack within QEMU and uses that stack to implement a virtual NAT'd network.

You can configure User Networking using the -netdev user command line option.

Adding the following to the qemu command line will change the network configuration to use 192.168.76.0/24 instead of the default (10.0.2.0/24) and will start guest DHCP allocation from 9 (instead of 15):

-netdev user,id=mynet0,net=192.168.76.0/24,dhcpstart=192.168.76.9

You can isolate the guest from the host (and broader network) using the restrict option. For example -netdev user,id=mynet0,restrict=y or -netdev type=user,id=mynet0,restrict=yes will restrict networking to just the guest and any virtual devices. This can be used to prevent software running inside the guest from phoning home while still providing a network inside the guest. You can selectively override this using hostfwd and guestfwd options.

TODO:

-netdev user,id=mynet0,dns=xxx

-netdev user,id=mynet0,tftp=xxx,bootfile=yyy

-netdev user,id=mynet0,smb=xxx,smbserver=yyy

-netdev user,id=mynet0,hostfwd=hostip:hostport-guestip:guestport

-netdev user,id=mynet0,guestfwd=

-netdev user,id=mynet0,host=xxx,hostname=yyy


Tap
The tap networking backend makes use of a tap networking device in the host. It offers very good performance and can be configured to create virtually any type of network topology. Unfortunately, it requires configuration of that network topology in the host which tends to be different depending on the operating system you are using. Generally speaking, it also requires that you have root privileges.

-netdev tap,id=mynet0


VDE
The VDE networking backend uses the Virtual Distributed Ethernet infrastructure to network guests. Unless you specifically know that you want to use VDE, it is probably not the right backend to use.

Socket
The socket networking backend, together with QEMU VLANs, allow you to create a network of guests that can see each other. It's primarily useful in extending the network created by Documentation/Networking/Slirp to multiple virtual machines. In general, if you want to have multiple guests communicate, tap is a better choice unless you do not have root access to the host environment.

-netdev socket,id=mynet0 


Creating a virtual network device
The virtual network device that you choose depends on your needs and the guest environment (i.e. the hardware that you are emulating). For example, if you are emulating a particular embedded board, then you should use the virtual network device that matches that embedded board's configuration.

On machines that have PCI bus, there are a wider range of options. The e1000 is the default network adapter in qemu. The rtl8139 is the default network adapter in qemu-kvm. In both projects, the virtio-net (para-virtualised) network adapter has the best performance, but requires special guest driver support.

Use the -device option to add a particular virtual network device to your virtual machine:

-device TYPE,netdev=NAME

The netdev is the name of a previously defined -netdev. The virtual network device will be associated with this network backend.

Note that there are other device options to select alternative devices, or to change some aspect of the device. For example, you want something like: -device DEVNAME,netdev=NET-ID,macaddr=MACADDR,DEV-OPTS, where DEVNAME is the device (e.g. i82559c for an Intel i82559C Ethernet device), NET_ID is the network identifier to attach the device to (see discussion of -netdev below), MACADDR is the MAC address for the device, and DEV-OPTS are any additional device options that you may wish to pass (e.g. bus=PCI-BUS,addr=DEVFN to control the PCI device address), if supported by the device.

Use -device ? to get a list of the devices (including network devices) you can add using the -device option for a particular guest. Remember that ? is a shell metacharacter, so you may need to use -device \? on the command-line.

Monitoring Networking
You can monitor the network configuration using info network and info usernet commands.

You can capture network traffic from within qemu using the -net dump command line option. See Stefan Hajnoczi's blog post on this feature.

The legacy -net option
QEMU previously used the -net nic option instead of -device DEVNAME and -net TYPE instead of -netdev TYPE. This is considered obsolete since QEMU 0.12, although it continues to work.

The legacy syntax to create virtual network devices is:

-net nic,model=MODEL

You can use -net nic,model=? to get a list of valid network devices that you can pass to the -net nic option. Note that these model names are different from the -device ? names and are therefore only useful if you are using the -net nic,model=MODEL syntax. [If you'd like to know all of the virtual network devices that are currently provided in QEMU, a search for “NetClientInfo” in the source code may be useful.]

QEMU “VLANs”
The obsolete -net syntax automatically created an emulated hub (called a QEMU “VLAN”, for virtual LAN) that forwards traffic from any device connected to it to every other device on the “VLAN”. It is not an 802.1q VLAN, just an isolated network segment. When creating multiple network devices using the -net syntax, you generally want to specify different vlan ids. The exception is when dealing with the socket backend. For example:

-net user,vlan=0 -net nic,vlan=0 -net user,vlan=1 -net nic,vlan=1  

kvm monitor

When QEMU is running, it provides a monitor console for interacting with QEMU. Through various commands, the monitor allows you to inspect the running guest OS, change removable media and USB devices, take screenshots and audio grabs, and control various aspects of the virtual machine.

The monitor is accessed from within QEMU by holding down the Control and Alt keys, and pressing Shift-2. Once in the monitor, Shift-1 switches back to the guest OS. Typing help or ? in the monitor brings up a list of all commands. Alternatively the monitor can be redirected to using the -monitor <dev> command line option Using -monitor stdio will send the monitor to the standard output, this is most useful when using qemu on the command line.

Help and information

help

  • help [command] or ? [command]

With no arguments, the help command lists all commands available. For more detail about another command, type help command, e.g.

(qemu) help info

On a small screen / VM window, the list of commands will scroll off the screen too quickly to let you read them. To scroll back and forth so that you can read the whole list, hold down the control key and press Page Up and Page Down. info

  • info option

Show information on some aspect of the guest OS. Available options are:

  • block – block devices such as hard drives, floppy drives, cdrom
  • blockstats – read and write statistics on block devices
  • capture – active capturing (audio grabs)
  • history – console command history
  • irq – statistics on interrupts (if compiled into QEMU)
  • jit – statistics on QEMU's Just In Time compiler
  • kqemu – whether the kqemu kernel module is being utilised
  • mem – list the active virtual memory mappings
  • mice – mouse on the guest that is receiving events
  • network – network devices and VLANs
  • pci – PCI devices being emulated
  • pcmcia – PCMCIA card devices
  • pic – state of i8259 (PIC)
  • profile – info on the internal profiler, if compiled into QEMU
  • registers – the CPU registers
  • snapshots – list the VM snapshots
  • tlb – list the TLB (Translation Lookaside Buffer), i.e. mappings between physical memory and virtual memory
  • usb – USB devices on the virtual USB hub
  • usbhost – USB devices on the host OS
  • uuid – Unique id of the VM
  • version – QEMU version number
  • vnc – VNC information


Devices

change

  • change device setting

The change command allows you to change removable media (like CD-ROMs), change the display options for a VNC, and change the password used on a VNC.

When you need to change the disc in a CD or DVD drive, or switch between different .iso files, find the name of the CD or DVD drive using info and use change to make the change.

(qemu) info block
ide0-hd0: type=hd removable=0 file=/path/to/winxp.img
ide0-hd1: type=hd removable=0 file=/path/to/pagefile.raw
ide1-hd1: type=hd removable=0 file=/path/to/testing_data.img
ide1-cd0: type=cdrom removable=1 locked=0 file=/dev/sr0 ro=1 drv=host_device
floppy0: type=floppy removable=1 locked=0 [not inserted]
sd0: type=floppy removable=1 locked=0 [not inserted]
(qemu) change ide1-cd0 /path/to/my.iso
(qemu) change ide1-cd0 /dev/sr0 host_device

eject

  • eject [-f] device

Use the eject command to release the device or file connected to the removable media device specified. The -f parameter can be used to force it if it initially refuses!

usb_add
Add a host file as USB flash device ( you need to create in advance the host file: dd if=/dev/zero of=/tmp/disk.usb bs=1024k count=32 )
usb_add disk:/tmp/disk.usb

usb_del
use info usb to get the usb device list

(qemu)info usb
Device 0.1, Speed 480 Mb/s, Product XXXXXX
Device 0.2, Speed 12 Mb/s, Product XXXXX

(qemu)usb_del 0.2

This deletes the device

sendkey keys
You can emulate keyboard events through sendkey command. The syntax is: sendkey keys. To get a list of keys, type sendkey [tab]. Example: sendkey ctrl-alt-f1

Screen and audio grabs

screendump

  • screendump filename

Capture a screendump and save into a PPM image file.

Virtual machine

commit

  • commit device or commit all

When running QEMU with the -snapshot option, commit changes to the device, or all devices.

quit

  • quit or q

Quit QEMU immediately.

savevm

  • savevm name

Save the virtual machine as the tag 'name'. Not all filesystems support this. raw does not, but qcow2 does.

loadvm

  • loadvm name

Load the virtual machine tagged 'name'. This can also be done on the command line: -loadvm name

With the info snapshots command, you can request a list of available machines.

stop
Suspend execution of VM

cont
Reverse a previous stop command - resume execution of VM.

system_reset
This has an effect similar to the physical reset button on a PC. Warning: Filesystems may be left in an unclean state.

system_powerdown
This has an effect similar to the physical power button on a modern PC. The VM will get an ACPI shutdown request and usually shutdown cleanly.

log

  • log option

logfile

  • logfile filename

Write logs to specified file instead of the default path, /tmp/qemu.log .

gdbserver
Starts a remote debugger session for the GNU debugger (gdb). To connect to it from the host machine, run the following commands:

shell$ gdb qemuKernelFile
(gdb) target remote localhost:1234

x
x /format address
Displays memory at the specified virtual address using the specified format.
Refer to the xp section for details on format and address.

xp
x /format address
Displays memory at the specified physical address using the specified format.
format: Used to specify the output format the displayed memory. The format is broken down as /[count][data_format][size]

  • count: number of item to display (base 10)
  • data_format: 'x' for hex, 'd' for decimal, 'u' for unsigned decimal, 'o' for octal, 'c' for char and 'i' for (disassembled) processor instructions
  • size: 'b' for 8 bits, 'h' for 16 bits, 'w' for 32 bits or 'g' for 64 bits. On x86 'h' and 'w' can select instruction disassembly code formats.


address:

  • Direct address, for example: 0x20000
  • Register, for example: $eip

Example - Display 3 instructions on an x86 processor starting at the current instruction:

(qemu) xp /3i $eip

Example - Display the last 20 words on the stack for an x86 processor:

(qemu) xp /20wx $esp

print
Print (or p), evaluates and prints the expression given to it. The result will be printed in hexadecimal, but decimal can also be used in the expression. If the result overflows it will wrap around. To use a the value in a CPU register use $<register name>. The name of the register should be lower case. You can see registers with the info registers command.
Example of qemu simulating an i386.

(qemu) print 16
0x10
(qemu) print 16 + 0x10
0x20
(qemu) print $eax
0xc02e4000
(qemu) print $eax + 2
0xc02e4000
(qemu) print ($eax + 2) * 2
0x805c8004
(qemu) print 0x80000000 * 2
0

kvm storage

Devices and media:

  • Floppy, CD-ROM, USB stick, SD card,harddisk

Host storage:

  • Flat files (img, iso)
    • Also over NFS
  • CD-ROM host device (/dev/cdrom)
  • Block devices (/dev/sda3, LVM volumes,iSCSI LUNs)
  • Distributed storage (Sheepdog, Ceph)


Supported image formats:

  • QCOW2, QED – QEMU
  • VMDK – VMware
  • VHD – Microsoft
  • VDI – VirtualBox

Features that various image formats provide:

  • Sparse images
  • Backing files (delta images)
  • Encryption
  • Compression
  • Snapshots


qemu -drive
   if=ide|virtio|scsi,
   file=path/to/img,
   cache=writethrough|writeback|none|unsafe
  • Storage interface is set with if=
  • Path to image file or device is set with path=
  • Caching mode is set with cache=
qemu -drive file=install-disc-1.iso,media=cdrom ...


QEMU supports a wide variety for storage formats and back-ends. Easiest to use are the raw and qcow2 formats, but for the best performance it is best to use a raw partition. You can create either a logical volume or a partition and assign it to the guest:

 qemu -drive file=/dev/mapper/ImagesVolumeGroup-Guest1,cache=none,if=virtio

QEMU also supports a wide variety of caching modes. If you're using raw volumes or partitions, it is best to avoid the cache completely, which reduces data copies and bus traffic:

 qemu -drive file=/dev/mapper/ImagesVolumeGroup-Guest1,cache=none,if=virtio

As with networking, QEMU supports several storage interfaces. The default, IDE, is highly supported by guests but may be slow, especially with disk arrays. If your guest supports it, use the virtio interface:

 qemu -drive file=/dev/mapper/ImagesVolumeGroup-Guest1,cache=none,if=virtio

Don't use the linux filesystem btrfs on the host for the image files. It will result in low IO performance. The kvm guest may even freeze when high IO traffic is done on the guest.

Virtual FAT filesystem (VVFAT)
Qemu can emulate a virtual drive with a FAT filesystem. It is an easy way to share files between the guest and host.
It works by prepending fat: to a directory name. By default it's read-only, if you need to make it writable append rw: to the aforementioned prefix.

Example:

qemu -drive file=fat:rw:some/directory ...

WARNING: keep in mind that QEMU makes the virtual FAT table once, when adding the device, and then doesn't update it in response to changes to the specified directory made by the host system. If you modify the directory while the VM is running, QEMU might get confused.

Cache policies

QEMU can cache access to the disk image files, and it provides several methods to do so. This can be specified using the cache modifier.

PolicyDescription
unsafeLike writeback, but without performing an fsync.
writethroughData is written to disk and cache simultaneously. (default)
writebackData is written to disk when discarded from the cache.
noneDisable caching.

Example:

qemu -drive file=disk.img,cache=writeback ...


Creating an image
To set up your own guest OS image, you first need to create a blank disc image. QEMU has the qemu-img command for creating and manipulating disc images, and supports a variety of formats. If you don't tell it what format to use, it will use raw files. The “native” format for QEMU is qcow2, and this format offers some flexibility. Here we'll create a 3GB qcow2 image to install Windows XP on:

qemu-img create -f qcow2 winxp.img 3G

The easiest way to install a guest OS is to create an ISO image of a boot CD/DVD and tell QEMU to boot off it. Many free operating systems can be downloaded from the Internet as bootable ISO images, and you can use them directly without having to burn them to disc.
Here we'll boot off an ISO image of a properly licensed Windows XP boot disc. We'll also give it 256MB of RAM, but we won't use the kqemu kernel module just yet because it causes problems during Windows XP installation.

qemu -m 256 -hda winxp.img -cdrom winxpsp2.iso -boot d


Copy on write
The “cow” part of qcow2 is an acronym for copy on write, a neat little trick that allows you to set up an image once and use it many times without changing it. This is ideal for developing and testing software, which generally requires a known stable environment to start off with. You can create your known stable environment in one image, and then create several disposable copy-on-write images to work in.

To start a new disposable environment based on a known good image, invoke the qemu-img command with the option -b and tell it what image to base its copy on. When you run QEMU using the disposable environment, all writes to the virtual disc will go to this disposable image, not the base copy.

qemu-img create -f qcow2 -b winxp.img test01.img 
qemu -m 256 -hda test01.img -kernel-kqemu &

The option -b is not supported on qemu-img, at least not in version 0.12.5. There you use the option backing_file, as shown here:

qemu-img create -f qcow2 -o backing_file=winxp.img test01.img 

Source

qemu

QEMU is a generic and open source machine emulator and virtualizer
Emulation:

  • For cross-compilation, development environments
  • Android Emulator, shipping in an Android. SDK near you

Virtualization:

  • KVM and Xen use QEMU device emulation

330.4 Other Virtualization Solutions (weight: 3)

Description: Candidates should have some basic knowledge and experience with alternatives to Xen and KVM.

Key Knowledge Areas:

  • Basic knowledge of OpenVZ and LXC
  • Awareness of other virtualization technologies
  • Basic knowledge of virtualization provisioning tools

The following is a partial list of the used files, terms and utilities:

  • OpenVZ
  • VirtualBox
  • LXC
  • docker
  • packer
  • vagrant

OpenVZ

OpenVZ is not true virtualization but really containerization like FreeBSD Jails. Technologies like VMWare and Xen are more flexible in that they virtualize the entire machine and can run multiple operating systems, at the expense of greater overhead required to handle hardware virtualization. OpenVZ uses a single patched Linux kernel and therefore can run only Linux. However because it doesn't have the overhead of a true hypervisor, it is very fast and efficient. The disadvantage with this approach is the single kernel. All guests must function with the same kernel version that the host uses.

The advantages, however, are that memory allocation is soft in that memory not used in one virtual environment can be used by others or for disk caching. OpenVZ uses a common file system so each virtual environment is just a directory of files that is isolated using chroot, newer versions of OpenVZ also allow the container to have its own file system.[4] Thus a virtual machine can be cloned by just copying the files in one directory to another and creating a config file for the virtual machine and starting it.

Kernel

The OpenVZ kernel is a Linux kernel, modified to add support for OpenVZ containers. The modified kernel provides virtualization, isolation, resource management, and checkpointing.

Virtualization and isolation
Each container is a separate entity, and behaves largely as a physical server would. Each has its own:

  • Files : System libraries, applications, virtualized /proc and /sys, virtualized locks, etc.
  • Users and groups : Each container has its own root user, as well as other users and groups.
  • Process tree : A container only sees its own processes (starting from init). PIDs are virtualized, so that the init PID is 1 as it should be.
  • Network : Virtual network device, which allows a container to have its own IP addresses, as well as a set of netfilter (iptables), and routing rules.

* Devices : If needed, any container can be granted access to real devices like network interfaces, serial ports, disk partitions, etc. * IPC objects : Shared memory, semaphores, messages.

Resource management

OpenVZ resource management consists of three components: two-level disk quota, fair CPU scheduler, and user beancounters. These resources can be changed during container run time, eliminating the need to reboot.

Two-level disk quota

Each container can have its own disk quotas, measured in terms of disk blocks and inodes (roughly number of files). Within the container, it is possible to use standard tools to set UNIX per-user and per-group disk quotas.

CPU scheduler
The CPU scheduler in OpenVZ is a two-level implementation of fair-share scheduling strategy.
On the first level, the scheduler decides which container it is to give the CPU time slice to, based on per-container cpuunits values. On the second level the standard Linux scheduler decides which process to run in that container, using standard Linux process priorities.
It is possible to set different values for the CPUs in each container. Real CPU time will be distributed proportionally to these values.
Strict limits, such as 10% of total CPU time, are also possible.

I/O scheduler
Similar to the CPU scheduler described above, I/O scheduler in OpenVZ is also two-level, utilizing Jens Axboe's CFQ I/O scheduler on its second level.
Each container is assigned an I/O priority, and the scheduler distributes the available I/O bandwidth according to the priorities assigned. Thus no single container can saturate an I/O channel.

User Beancounters
User Beancounters is a set of per-container counters, limits, and guarantees. There is a set of about 20 parameters which is meant to control all the aspects of container operation. This is meant to prevent a single container from monopolizing system resources.
These resources primarily consist of memory and various in-kernel objects such as IPC shared memory segments, and network buffers. Each resource can be seen from /proc/user_beancounters and has five values associated with it: current usage, maximum usage (for the lifetime of a container), barrier, limit, and fail counter. The meaning of barrier and limit is parameter-dependent; in short, those can be thought of as a soft limit and a hard limit. If any resource hits the limit, the fail counter for it is increased. This allows the owner to detect problems by monitoring /proc/user_beancounters in the container.

ValueMeaning
lockedpagesThe memory not allowed to be swapped out (locked with the mlock() system call), in pages.
shmpagesThe total size of shared memory (including IPC, shared anonymous mappings and tmpfs objects) allocated by the processes of a particular VPS, in pages.
privvmpagesThe size of private (or potentially private) memory allocated by an application. The memory that is always shared among different applications is not included in this resource parameter.
numfileThe number of files opened by all VPS processes.
numflockThe number of file locks created by all VPS processes.
numptyThe number of pseudo-terminals, such as an ssh session, the screen or xterm applications, etc.
numsiginfoThe number of siginfo structures (essentially, this parameter limits the size of the signal delivery queue).
dcachesizeThe total size of dentry and inode structures locked in the memory.
physpagesThe total size of RAM used by the VPS processes. This is an accounting-only parameter currently. It shows the usage of RAM by the VPS. For the memory pages used by several different VPSs (mappings of shared libraries, for example), only the corresponding fraction of a page is charged to each VPS. The sum of the physpages usage for all VPSs corresponds to the total number of pages used in the system by all the accounted users.
numiptentThe number of IP packet filtering entries


Checkpointing and live migration
A live migration and checkpointing feature was released for OpenVZ in the middle of April 2006. This makes it possible to move a container from one physical server to another without shutting down the container. The process is known as checkpointing: a container is frozen and its whole state is saved to a file on disk. This file can then be transferred to another machine and a container can be unfrozen (restored) there; the delay is roughly a few seconds. Because state is usually preserved completely, this pause may appear to be an ordinary computational delay.

OpenVZ distinct features

Scalability
As OpenVZ employs a single kernel model, it is as scalable as the Linux kernel; that is, it supports up to 4096 CPUs and up to 64 GiB of RAM on 32-bit with PAE. Please note that 64-bit kernels are strongly recommended for production. A single container can scale up to the whole physical system, i.e. use all the CPUs and all the RAM.

Performance
The virtualization overhead observed in OpenVZ is minimal; More computing power is available for each container.

Density
By decreasing the overhead required for each container, it is possible to serve more containers from a given physical server, so long as the computational demands do not exceed the physical availability.

Mass-management
An administrator (i.e. root) of an OpenVZ physical server (also known as a hardware node or host system) can see all the running processes and files of all the containers on the system, and this has convenience implications. Some fixes (such as a kernel update) will affect all containers automatically, while other changes can simply be “pushed” to all the containers by a simple shell script.
Compare this with managing a VMware- or Xen-based virtualized environment: in order to apply a security update to 10 virtual servers, one either needs a more elaborate pull system (on all the virtual servers) for such updates, or an administrator is required to log in to each virtual server and apply the update. This makes OpenVZ more convenient in those cases where a pull system has not been or can not be implemented.

Limitations

OpenVZ restricts access to /dev devices to a small subset. The container may be impacted in not having access to devices that are used – not in providing access to physical hardware – but in adding or configuring kernel-level features.

/dev/loopN is often restricted in deployments, as it relies on a limit pool of kernel threads. It's absence restricts the ability to mount disk images. Some work-arounds exist using FUSE.

OpenVZ is limited to the providing only some VPN technologies based on PPP (such as PPTP/L2TP) and TUN/TAP. IPsec is not supported inside containers, including L2TP secured with IPsec.

Full virtualization solutions are free of these limitation.
Source

VirtualBox

Oracle VM VirtualBox (formerly Sun VirtualBox, Sun xVM VirtualBox and innotek VirtualBox) is an x86 virtualization software package, created by software company Innotek GmbH, purchased in 2008 by Sun Microsystems, and now developed by Oracle Corporation as part of its family of virtualization products. Oracle VM VirtualBox is installed on an existing host operating system as an application; this host application allows additional guest operating systems, each known as a Guest OS, to be loaded and run, each with its own virtual environment.

Supported host operating systems include Linux, Mac OS X, Windows XP, Windows Vista, Windows 7, Windows 8, Solaris, and OpenSolaris; there is also a port to FreeBSD. Supported guest operating systems include versions and derivations of Windows, Linux, BSD, OS/2, Solaris and others. Since release 3.2.0, VirtualBox also allows limited virtualization of Mac OS X guests on Apple hardware, though OSX86 can also be installed using VirtualBox

Since version 4.1, Windows guests on supported hardware can take advantage of the recently implemented WDDM driver included in the guest additions; this allows Windows Aero to be enabled along with Direct3D support.

Emulated environment

Multiple guest OSs can be loaded under the host operating system (host OS). Each guest can be started, paused and stopped independently within its own virtual machine (VM). The user can independently configure each VM and run it under a choice of software-based virtualization or hardware assisted virtualization if the underlying host hardware supports this. The host OS and guest OSs and applications can communicate with each other through a number of mechanisms including a common clipboard and a virtualized network facility provided. Guest VMs can also directly communicate with each other if configured to do so.

Software-based virtualization

In the absence of hardware-assisted virtualization, VirtualBox adopts a standard software-based virtualization approach. This mode supports 32-bit guest OSs which run in rings 0 and 3 of the Intel ring architecture.

  • The guest OS code, running in ring 0, is reconfigured to execute in ring 1 on the host hardware. Because this code contains many privileged instructions which cannot run natively in ring 1, VirtualBox employs a Code Scanning and Analysis Manager (CSAM) to scan the ring 0 code recursively before its first execution to identify problematic instructions and then calls the Patch Manager (PATM) to perform in-situ patching. This replaces the instruction with a jump to a VM-safe equivalent compiled code fragment in hypervisor memory.
  • The guest user-mode code, running in the ring 3, is generally run directly on the host hardware at ring 3.

In both cases, VirtualBox uses CSAM and PATM to inspect and patch the offending instructions whenever a fault occurs. VirtualBox also contains a dynamic recompiler, based on QEMU to recompile any real mode or protected mode code entirely (e.g. BIOS code, a DOS guest, or any operating system startup).
Using these techniques, VirtualBox can achieve a performance that is comparable to that of VMware.

Hardware-assisted virtualization

VirtualBox supports both Intel's VT-x and AMD's AMD-V hardware virtualization. Making use of these facilities, VirtualBox can run each guest VM in its own separate address space; the guest OS ring 0 code runs on the host at ring 0 in VMX non-root mode rather than in ring 1.

Some guests, including 64-bit guests, SMP guests and certain proprietary OSs, are only supported by VirtualBox on hosts with hardware-assisted virtualization.

Device virtualization

Hard disks are emulated in one of three disk image formats: a VirtualBox-specific container format, called “Virtual Disk Image” (VDI), which are stored as files (with a .vdi suffix) on the host operating system; VMware Virtual Machine Disk Format (VMDK); and Microsoft Virtual PC VHD format. A VirtualBox virtual machine can, therefore, use disks that were created in VMware or Microsoft Virtual PC, as well as its own native format. VirtualBox can also connect to iSCSI targets and to raw partitions on the host, using either as virtual hard disks. VirtualBox emulates IDE (PIIX4 and ICH6 controllers), SCSI, SATA (ICH8M controller) and SAS controllers to which hard drives can be attached.

Both ISO images and host-connected physical devices can be mounted as CD/DVD drives. For example, the DVD image of a Linux distribution can be downloaded and used directly by VirtualBox.

By default VirtualBox provides graphics support through a custom virtual graphics card that is VESA compatible. The Guest Additions for Windows, Linux, Solaris, OpenSolaris, or OS/2 guests include a special video driver that increases video performance and includes additional features, such as automatically adjusting the guest resolution when resizing the VM window, or desktop composition via virtualized WDDM drivers .

For an Ethernet network adapter, VirtualBox virtualizes these Network Interface Cards: AMD PCnet PCI II (Am79C970A), AMD PCnet-Fast III (Am79C973), Intel Pro/1000 MT Desktop (82540EM), Intel Pro/1000 MT Server (82545EM), and Intel Pro/1000 T Server (82543GC).[25] The emulated network cards allow most guest OSs to run without the need to find and install drivers for networking hardware as they are shipped as part of the guest OS. A special paravirtualized network adapter is also available, which improves network performance by eliminating the need to match a specific hardware interface, but requires special driver support in the guest. (Many distributions of Linux are shipped with this driver included.) By default, VirtualBox uses NAT through which Internet software for end users such as Firefox or ssh can operate. Bridged networking via a host network adapter or virtual networks between guests can also be configured. Up to 36 network adapters can be attached simultaneously, but only four are configurable through the graphical interface.

For a sound card, VirtualBox virtualizes Intel HD Audio, Intel ICH AC'97 device and SoundBlaster 16 cards.

A USB 1.1 controller is emulated so that any USB devices attached to the host can be seen in the guest. The closed source extension pack adds a USB 2.0 controller and, if VirtualBox acts as an RDP server, it can also use USB devices on the remote RDP client as if they were connected to the host, although only if the client supports this VirtualBox-specific extension (Oracle provides clients for Solaris, Linux and Sun Ray thin clients that can do this, and have promised support for other platforms in future versions).

Virtual Disk Image

VirtualBox uses its own format for storage containers – Virtual Disk Image (VDI). VirtualBox also supports other well-known storage formats[30] such as VMDK (used in particular by VMware) as well as the VHD format used by Microsoft.

VirtualBox's command-line utility VBoxManage includes options for cloning disks and importing and exporting file systems, however, it does not include a tool for increasing the size of the filesystem within a VDI container: this can be achieved in many ways with third-party tools (e.g. CloneVDI provides a GUI for cloning and increasing the size [31]) or in the guest OS itself.

VirtualBox has supported Open Virtualization Format (OVF) since version 2.2.0 (April 2009).
Source

LXC

TODO

docker

TODO

packer

TODO

vagrant

TODO

Description: Candidates should have basic knowledge and experience with the libvirt library and commonly available tools.

Key Knowledge Areas:

  • libvirt architecture, networking and storage
  • Basic technical knowledge of libvirt and virsh
  • Awareness of oVirt

The following is a partial list of the used files, terms and utilities:

  • libvirtd
  • /etc/libvirt/
  • virsh
  • oVirt

libvirtd

TODO

/etc/libvirt/

TODO

virsh

TODO

oVirt

TODO

330.6 Cloud Management Tools (weight: 2)

Description: Candidates should have basic feature knowledge of commonly available cloud management tools.

Key Knowledge Areas:

  • Basic feature knowledge of OpenStack and CloudStack
  • Awareness of Eucalyptus and OpenNebula

The following is a partial list of the used files, terms and utilities:

  • OpenStack
  • CloudStack
  • Eucalyptus
  • OpenNebula

OpenStack

TODO

CloudStack

TODO

Eucalyptus

TODO

OpenNebula

TODO

Topic 334: High Availability Cluster Management

334.1 High Availability Concepts and Theory (weight: 5)

Description: Candidates should understand the properties and design approaches of high availability clusters.

Key Knowledge Areas:

  • Understand the most important cluster architectures.
  • Understand recovery and cluster reorganization mechanisms.
  • Design an appropriate cluster architecture for a given purpose.
  • Application aspects of high availability.
  • Operational considerations of high availability.

The following is a partial list of the used files, terms and utilities:

  • Active/Passive Cluster, Active/Active Cluster
  • Failover Cluster, Load Balanced Cluster
  • Shared-Nothing Cluster, Shared-Disk Cluster
  • Cluster resources
  • Cluster services
  • Quorum
  • Fencing
  • Split brain
  • Redundancy
  • Mean Time Before Failure (MTBF)
  • Mean Time To Repair (MTTR)
  • Service Level Agreement (SLA)
  • Desaster Recovery
  • Replication
  • Session handling

334.2 Load Balanced Clusters (weight: 6)

Description: Candidates should know how to install, configure, maintain and troubleshoot LVS. This includes the configuration and use of keepalived and ldirectord. Candidates should further be able to install, configure, maintain and troubleshoot HAProxy.

Key Knowledge Areas:

  • Understanding of LVS / IPVS.
  • Basic knowledge of VRRP.
  • Configuration of keepalived.
  • Configuration of ldirectord.
  • Backend server network configuration.
  • Understanding of HAProxy.
  • Configuration of HAProxy.

The following is a partial list of the used files, terms and utilities:

  • ipvsadm
  • syncd
  • LVS Forwarding (NAT, Direct Routing, Tunneling, Local Node)
  • connection scheduling algorithms
  • keepalived configuration file
  • ldirectord configuration file
  • genhash
  • HAProxy configuration file
  • load balancing algorithms
  • ACLs

334.3 Failover Clusters (weight: 6)

Description: Candidates should have experience in the installation, configuration, maintenance and troubleshooting of a Pacemaker cluster. This includes the use of Corosync. The focus is on Pacemaker 1.1 for Corosync 2.x.

Key Knowledge Areas:

  • Pacemaker architecture and components (CIB, CRMd, PEngine, LRMd, DC, STONITHd).
  • Pacemaker cluster configuration.
  • Resource classes (OCF, LSB, Systemd, Upstart, Service, STONITH, Nagios).
  • Resource rules and constraints (location, order, colocation).
  • Advanced resource features (templates, groups, clone resources, multi-state resources).
  • Pacemaker management using pcs.
  • Pacemaker management using crmsh.
  • Configuration and Management of corosync in conjunction with Pacemaker.
  • Awareness of other cluster engines (OpenAIS, Heartbeat, CMAN).

The following is a partial list of the used files, terms and utilities:

  • pcs
  • crm
  • crm_mon
  • crm_verify
  • crm_simulate
  • crm_shadow
  • crm_resource
  • crm_attribute
  • crm_node
  • crm_standby
  • cibadmin
  • corosync.conf
  • authkey
  • corosync-cfgtool
  • corosync-cmapctl
  • corosync-quorumtool
  • stonith_admin

334.4 High Availability in Enterprise Linux Distributions (weight: 1)

Description: Candidates should be aware of how enterprise Linux distributions integrate High Availability technologies.

Key Knowledge Areas:

  • Basic knowledge of Red Hat Enterprise Linux High Availability Add-On.
  • Basic knowledge of SUSE Linux Enterprise High Availability Extension.

The following is a partial list of the used files, terms and utilities:

  • Distribution specific configuration tools
  • Integration of cluster engines, load balancers, storage technology, cluster filesystems, etc.

Topic 335: High Availability Cluster Storage

335.1 DRBD / cLVM (weight: 3)

Description: Candidates are expected to have the experience and knowledge to install, configure, maintain and troubleshoot DRBD devices. This includes integration with Pacemaker. DRBD configuration of version 8.4.x is covered. Candidates are further expected to be able to manage LVM configuration within a shared storage cluster.

Key Knowledge Areas:

  • Understanding of DRBD resources, states and replication modes.
  • Configuration of DRBD resources, networking, disks and devices.
  • Configuration of DRBD automatic recovery and error handling.
  • Management of DRBD using drbdadm.
  • Basic knowledge of drbdsetup and drbdmeta.
  • Integration of DRBD with Pacemaker.
  • cLVM
  • Integration of cLVM with Pacemaker.

The following is a partial list of the used files, terms and utilities:

  • Protocol A, B and C
  • Primary, Secondary
  • Three-way replication
  • drbd kernel module
  • drbdadm
  • drbdsetup
  • drbdmeta
  • /etc/drbd.conf
  • /proc/drbd
  • LVM2
  • clvmd
  • vgchange, vgs

DRBD w/Pacemaker

Basic configuration

The most common way to configure DRBD to replicate a volume between two fixed nodes, using IP addresses statically assigned on each.

Setting up DRBD

Please refer to the DRBD docs on how to install it and set it up.
From now on, we will assume that you've setup DRBD and that it is working (test it with the DRBD init script outside Pacemaker's control). If not, debug this first.

Configuring the resource in the CIB

In the crm shell, you first have to create the primitive resource and then embed that into the master resource.

crm commands

configure

primitive drbd0 ocf:heartbeat:drbd \ 
 params drbd_resource=drbd0 \
 op monitor role=Master interval=59s timeout=30s \
 op monitor role=Slave interval=60s timeout=30s

ms ms-drbd0 drbd0 \
 meta clone-max=2 notify=true globally-unique=false target-role=stopped

commit

quit

The primitive DRBD resource, similar to what you would have used to configure drbddisk, is now embedded in a complex object master. This specifies the abilities and limitations of DRBD there can be only two instances (clone-max), one per node (clone-node-max), and only one master ever (master-max). The notify attribute specifies that DRBD needs to be told about what happens to its peer; globally-unique set to false lets Pacemaker know that the instances cannot be told apart on a single node.

Note that we're creating the resource in stopped state first, so that we can finish configuring its constraints and dependencies before activating it. Specifying the nodes where the DRBD RA can be run

If you have a two node cluster, you could skip this step, because obviously, it can only run on those two. If you want to run drbd0 on two out of more nodes only, you will have to tell the cluster about this constraint:

crm configure location ms-drbd0-placement ms-drbd0 rule -inf: \#uname ne xen-1 and \#uname ne xen-2

This will tell the Policy Engine that, first, drbd0 can not run anywhere else except on xen-1 or xen-2. Second, it tells the PE that yes, it can run on those two.

Note: This assumes a symmetric cluster. If your cluster is asymmetric, you will have to invert the rules (Don't worry - if you do not specifically configure asymmetric, your cluster is symmetric by default).

Prefering a node to run the master role

With the configuration so far, the cluster would pick a node to promote DRBD on. If you want to prefer a node to run the master role (xen-1 in this example), you can express that like this:

crm configure location ms-drbd0-master-on-xen-1 ms-drbd0 rule role=master 100: \#uname eq xen-1

You can now activate the DRBD resource:

crm resource start ms-drbd0

It should be started and promoted on one of the two nodes - or, if you specified a constraint as shown above, on the node you preferred.

Referencing the master or slave resource in constraints

DRBD is rarely useful by itself; you will propably want to run a service on top of it. Or, very likely, you want to mount the filesystem on the master side.

Let us assume that you've created an ext3 filesystem on /dev/drbd0, which you now want managed by the cluster as well. The filesystem resource object is straightforward, and if you have got any experience with configuring Pacemaker at all, will look rather familar:

crm configure primitive fs0 ocf:heartbeat:Filesystem params fstype=ext3 directory=/mnt/share1 \
 device=/dev/drbd0 meta target-role=stopped

Make sure that the various settings match your setup. Again, this object has been created as stopped first.

Now the interesting bits. Obviously, the filesystem should only be mounted on the same node where drbd0 is in primary state, and only after drbd0 has been promoted, which is expressed in these two constraints:

crm commands

configure

order ms-drbd0-before-fs0 mandatory: ms-drbd0:promote fs0:start

colocation fs0-on-ms-drbd0 inf: fs0 ms-drbd0:Master

commit

quit

Et voila! You now can activate the filesystem resource and it'll be mounted at the proper time in the proper place.

crm resource start fs0

Just as this was done with a single filesystem resource, this can be done with a group: In a lot of cases, you will not just want a filesystem, but also an IP-address and some sort of daemon to run on top of the DRBD master. Put those resources in a group, use the constraints above and replace fs0 with the name of your group. The following example includes an apache webserver.

crm commands

configure

primitive drbd0 ocf:heartbeat:drbd \
 params drbd_resource=drbd0 \
 op monitor role=Master interval=59s timeout=30s \
 op monitor role=Slave interval=60s timeout=30s

ms ms-drbd0 drbd0 \
 meta clone-max=2 notify=true globally-unique=false target-role=stopped 

primitive fs0 ocf:heartbeat:Filesystem \ 
 params fstype=ext3 directory=/usr/local/apache/htdocs device=/dev/drbd0

primitive webserver ocf:heartbeat:apache \
 params configfile=/usr/local/apache/conf/httpd.conf httpd=/usr/local/apache/bin/httpd port=80 \ 
 op monitor interval=30s timeout=30s

primitive virtual-ip ocf:heartbeat:IPaddr2 \
 params ip=10.0.0.1 broadcast=10.0.0.255 nic=eth0 cidr_netmask=24 \
 op monitor interval=21s timeout=5s

group apache-group fs0 webserver virtual-ip

order ms-drbd0-before-apache-group mandatory: ms-drbd0:promote apache-group:start

colocation apache-group-on-ms-drbd0 inf: apache-group ms-drbd0:Master

location ms-drbd0-master-on-xen-1 ms-drbd0 rule role=master 100: #uname eq xen-1

commit

end

resource start ms-drbd0

quit

This will load the drbd module on both nodes and promote the instance on xen-1. After successful promotion, it will first mount /dev/drbd0 to /usr/local/apache/htdocs, then start the apache webserver and in the end configure the service IP-address 10.0.0.1/24 on network card eth0.

Moving the master role to a different node

If you want to move the DRBD master role the other node, you should not attempt to just move the master role. On top of DRBD, you will propably have a Filesystem resource or a resource group with your application/Filesystem/IP-Address or whatever (remember, DRBD isn't usually useful by itself). If you want to move the master role, you can accomplish that by moving the resource that is co-located with the DRBD master (and properly ordered). This can be done with the crm shell or crm_resource. Given the group example from above, you would use

crm resource migrate apache-group [hostname] 

This will stop all resources in the group, demote the current master, promote the other DRBD instance and start the group after successful promotion.

Keeping the master role on a network connected node

It is most likely desirable to keep the master role on a node with a working network connection. I assume you are familiar with [pingd]. So if you configured pingd, all you need to do is a rsc_location constraint for the master role, which looks at the pingd attribute of the node.

crm configure location ms-drbd-0_master_on_connected_node ms-drbd0 \
 rule role=master -inf: not_defined pingd or pingd lte 0

This will force the master role off of any node with a pingd attribute value of less or equal 0 or without a pingd attribute at all.

Note: This will prevent the master role and all its colocated resources from running at all if all your nodes lose network connection to the ping nodes.

If you don't want that, you can also configure a different score value than -INFINITY, but that requires cluster-individual score-maths depending on your number of resources, stickiness values and constraint scores.

Source

DRBD w/heartbeat

Heartbeat R1-style configuration

In R1-style clusters, Heartbeat keeps its complete configuration in three simple configuration files:

  • /etc/ha.d/ha.cf, as described in the section called “The ha.cf file”.
  • /etc/ha.d/authkeys, as described in the section called “The authkeys file”.
  • /etc/ha.d/haresources — the resource configuration file, described below.

The haresources file
The following is an example of a Heartbeat R1-compatible resource configuration involving a MySQL database backed by DRBD:

bob drbddisk::mysql Filesystem::/dev/drbd0::/var/lib/mysql::ext3 \
    10.9.42.1 mysql

This resource configuration contains one resource group whose home node (the node where its resources are expected to run under normal circumstances) is named bob. Consequentially, this resource group would be considered the local resource group on host bob, whereas it would be the foreign resource group on its peer host.

The resource group includes a DRBD resource named mysql, which will be promoted to the primary role by the cluster manager (specifically, the drbddisk resource agent) on whichever node is currently the active node. Of course, a corresponding resource must exist and be configured in /etc/drbd.conf for this to work.

That DRBD resource translates to the block device named /dev/drbd0, which contains an ext3 filesystem that is to be mounted at /var/lib/mysql (the default location for MySQL data files).

The resource group also contains a service IP address, 10.9.42.1. Heartbeat will make sure that this IP address is configured and available on whichever node is currently active.

Finally, Heartbeat will use the LSB resource agent named mysql in order to start the MySQL daemon, which will then find its data files at /var/lib/mysql and be able to listen on the service IP address, 192.168.42.1.

It is important to understand that the resources listed in the haresources file are always evaluated from left to right when resources are being started, and from right to left when they are being stopped.

Source

Heartbeat CRM configuration

In CRM clusters, Heartbeat keeps part of configuration in the following configuration files:

  • /etc/ha.d/ha.cf, as described in the section called “The ha.cf file”. You must include the following line in this configuration file to enable CRM mode:
    crm yes
  • /etc/ha.d/authkeys. The contents of this file are the same as for R1 style clusters. See the section called “The authkeys file” for details.

The remainder of the cluster configuration is maintained in the Cluster Information Base (CIB), covered in detail in the following section. Contrary to the two relevant configuration files, the CIB need not be manually distributed among cluster nodes; the Heartbeat services take care of that automatically.

The Cluster Information Base
The Cluster Information Base (CIB) is kept in one XML file, /var/lib/heartbeat/crm/cib.xml. It is, however, not recommended to edit the contents of this file directly, except in the case of creating a new cluster configuration from scratch. Instead, Heartbeat comes with both command-line applications and a GUI to modify the CIB.

The CIB actually contains both the cluster configuration (which is persistent and is kept in the cib.xml file), and information about the current cluster status (which is volatile). Status information, too, may be queried either using Heartbeat command-line tools, and the Heartbeat GUI.

After creating a new Heartbeat CRM cluster — that is, creating the ha.cf and authkeys files, distributing them among cluster nodes, starting Heartbeat services, and waiting for nodes to establish intra-cluster communications — a new, empty CIB is created automatically. Its contents will be similar to this:

<cib>
   <configuration>
     <crm_config>
       <cluster_property_set id="cib-bootstrap-options">
         <attributes/>
       </cluster_property_set>
     </crm_config>
     <nodes>
       <node uname="alice" type="normal"
             id="f11899c3-ed6e-4e63-abae-b9af90c62283"/>
       <node uname="bob" type="normal"
             id="663bae4d-44a0-407f-ac14-389150407159"/>
     </nodes>
     <resources/>
     <constraints/>
   </configuration>
 </cib>

The exact format and contents of this file are documented at length on the Linux-HA web site, but for practical purposes it is important to understand that this cluster has two nodes named alice and bob, and that neither any resources nor any resource constraints have been configured at this point. Adding a DRBD-backed service to the cluster configuration

This section explains how to enable a DRBD-backed service in a Heartbeat CRM cluster. The examples used in this section mimic, in functionality, those described in the section called “Heartbeat resources”, dealing with R1-style Heartbeat clusters.

The complexity of the configuration steps described in this section may seem overwhelming to some, particularly those having previously dealt only with R1-style Heartbeat configurations. While the configuration of Heartbeat CRM clusters is indeed complex (and sometimes not very user-friendly), the CRM's advantages may outweigh those of R1-style clusters. Which approach to follow is entirely up to the administrator's discretion.
Using the drbddisk resource agent in a Heartbeat CRM configuration

Even though you are using Heartbeat in CRM mode, you may still utilize R1-compatible resource agents such as drbddisk. This resource agent provides no secondary node monitoring, and ensures only resource promotion and demotion.

In order to enable a DRBD-backed configuration for a MySQL database in a Heartbeat CRM cluster with drbddisk, you would use a configuration like this:

<group ordered="true" collocated="true" id="rg_mysql">
  <primitive class="heartbeat" type="drbddisk"
             provider="heartbeat" id="drbddisk_mysql">
    <meta_attributes>
      <attributes>
        <nvpair name="target_role" value="started"/>
      </attributes>
    </meta_attributes>
    <instance_attributes>
      <attributes>
        <nvpair name="1" value="mysql"/>
      </attributes>
    </instance_attributes>
  </primitive>
  <primitive class="ocf" type="Filesystem"
             provider="heartbeat" id="fs_mysql">
    <instance_attributes>
      <attributes>
        <nvpair name="device" value="/dev/drbd0"/>
        <nvpair name="directory" value="/var/lib/mysql"/>
        <nvpair name="type" value="ext3"/>
      </attributes>
    </instance_attributes>
  </primitive>
  <primitive class="ocf" type="IPaddr2"
             provider="heartbeat" id="ip_mysql">
    <instance_attributes>
      <attributes>
        <nvpair name="ip" value="192.168.42.1"/>
        <nvpair name="cidr_netmask" value="24"/>
        <nvpair name="nic" value="eth0"/>
      </attributes>
    </instance_attributes>
  </primitive>
  <primitive class="lsb" type="mysqld"
             provider="heartbeat" id="mysqld"/>
</group>

Assuming you created this configuration in a temporary file named /tmp/hb_mysql.xml, you would add this resource group to the cluster configuration using the following command (on any cluster node):

cibadmin -o resources -C -x /tmp/hb_mysql.xml

After this, Heartbeat will automatically propagate the newly-configured resource group to all cluster nodes. Using the drbd OCF resource agent in a Heartbeat CRM configuration

The drbd resource agent is a “pure-bred” OCF RA which provides Master/Slave capability, allowing Heartbeat to start and monitor the DRBD resource on multiple nodes and promoting and demoting as needed. You must, however, understand that the drbd RA disconnects and detaches all DRBD resources it manages on Heartbeat shutdown, and also upon enabling standby mode for a node.

In order to enable a DRBD-backed configuration for a MySQL database in a Heartbeat CRM cluster with the drbd OCF resource agent, you must create both the necessary resources, and Heartbeat constraints to ensure your service only starts on a previously promoted DRBD resource. It is recommended that you start with the constraints, such as shown in this example:

<constraints>
  <rsc_order id="mysql_after_drbd" from="rg_mysql" action="start"
             to="ms_drbd_mysql" to_action="promote" type="after"/>
  <rsc_colocation id="mysql_on_drbd" to="ms_drbd_mysql"
                  to_role="master" from="rg_mysql" score="INFINITY"/>
</constraints>

Assuming you put these settings in a file named /tmp/constraints.xml, here is how you would enable them:

cibadmin -U -x /tmp/constraints.xml

Subsequently, you would create your relevant resources:

<resources>
  <master_slave id="ms_drbd_mysql">
    <meta_attributes id="ms_drbd_mysql-meta_attributes">
      <attributes>
        <nvpair name="notify" value="yes"/>
        <nvpair name="globally_unique" value="false"/>
      </attributes>
    </meta_attributes>
    <primitive id="drbd_mysql" class="ocf" provider="heartbeat"
        type="drbd">
      <instance_attributes id="ms_drbd_mysql-instance_attributes">
        <attributes>
          <nvpair name="drbd_resource" value="mysql"/>
        </attributes>
      </instance_attributes>
      <operations id="ms_drbd_mysql-operations">
        <op id="ms_drbd_mysql-monitor-master"
	    name="monitor" interval="29s"
            timeout="10s" role="Master"/>
        <op id="ms_drbd_mysql-monitor-slave"
            name="monitor" interval="30s"
            timeout="10s" role="Slave"/>
      </operations>
    </primitive>
  </master_slave>
  <group id="rg_mysql">
    <primitive class="ocf" type="Filesystem"
               provider="heartbeat" id="fs_mysql">
      <instance_attributes id="fs_mysql-instance_attributes">
        <attributes>
          <nvpair name="device" value="/dev/drbd0"/>
          <nvpair name="directory" value="/var/lib/mysql"/>
          <nvpair name="type" value="ext3"/>
        </attributes>
      </instance_attributes>
    </primitive>
    <primitive class="ocf" type="IPaddr2"
               provider="heartbeat" id="ip_mysql">
      <instance_attributes id="ip_mysql-instance_attributes">
        <attributes>
          <nvpair name="ip" value="10.9.42.1"/>
          <nvpair name="nic" value="eth0"/>
        </attributes>
      </instance_attributes>
    </primitive>
    <primitive class="lsb" type="mysqld"
               provider="heartbeat" id="mysqld"/>
  </group>
</resources>

Assuming you put these settings in a file named /tmp/resources.xml, here is how you would enable them:

cibadmin -U -x /tmp/resources.xml

After this, your configuration should be enabled. Heartbeat now selects a node on which it promotes the DRBD resource, and then starts the DRBD-backed resource group on that same node.

Source

335.2 Clustered File Systems (weight: 3)

Description: Candidates should know how to install, maintain and troubleshoot installations using GFS2 and OCFS2. This includes integration with Pacemaker as well as awareness of other clustered filesystems available in a Linux environment.

Key Knowledge Areas:

  • Understand the principles of cluster file systems.
  • Create, maintain and troubleshoot GFS2 file systems in a cluster.
  • Create, maintain and troubleshoot OCFS2 file systems in a cluster.
  • Integration of GFS2 and OCFS2 with Pacemaker.
  • Awareness of the O2CB cluster stack.
  • Awareness of other commonly used clustered file systems.

The following is a partial list of the used files, terms and utilities:

  • Distributed Lock Manager (DLM)
  • mkfs.gfs2
  • mount.gfs2
  • fsck.gfs2
  • gfs2_grow
  • gfs2_edit
  • gfs2_jadd
  • mkfs.ocfs2
  • mount.ocfs2
  • fsck.ocfs2
  • tunefs.ocfs2
  • mounted.ocfs2
  • o2info
  • o2image
  • CephFS
  • GlusterFS
  • AFS

GFS2

the Global File System 2 or GFS2 is a shared disk file system for Linux computer clusters. GFS2 differs from distributed file systems (such as AFS, Coda, or InterMezzo) because GFS2 allows all nodes to have direct concurrent access to the same shared block storage. In addition, GFS or GFS2 can also be used as a local filesystem.

GFS has no disconnected operating-mode, and no client or server roles. All nodes in a GFS cluster function as peers. Using GFS in a cluster requires hardware to allow access to the shared storage, and a lock manager to control access to the storage. The lock manager operates as a separate module: thus GFS and GFS2 can use the Distributed Lock Manager (DLM) for cluster configurations and the “nolock” lock manager for local filesystems. Older versions of GFS also support GULM, a server based lock manager which implements redundancy via failover.

Source

Setting up GFS2

The 1st command you need to know for creating and modifying your cluster is the ‘ccs_tool‘ command.

Below I will show you the necessary steps to create a cluster and then the GFS2 filesystem
1. First step is to install the necessary RPM’s..

    yum -y install modcluster rgmanager gfs2 gfs2-utils lvm2-cluster cman

2. Second step is to create a cluster on gfs1

    ccs_tool create GFStestCluster

3. Now that the cluster is created, we will now need to add the fencing devices.

  ( For simplicity you can just use fence_manual for each host.. ccs_tool addfence -C gfs1_ipmi fence_manual )
  But if you are using VMware ESX like I am you should use fence_vmware like so…
    ccs_tool addfence -C gfs1_vmware fence_vmware ipaddr=esxtest login=esxuser passwd=esxpass \
     vmlogin=root vmpasswd=esxpass port=”/vmfs/volumes/49086551-c64fd83c-0401-001e0bcd6848/eagle1/gfs1.vmx”
    ccs_tool addfence -C gfs2_vmware fence_vmware ipaddr=esxtest login=esxuser passwd=esxpass \
     vmlogin=root vmpasswd=esxpass port=”vmfs/volumes/49086551-c64fd83c-0401-001e0bcd6848/gfs2/gfs2.vmx”
    ccs_tool addfence -C gfs3_vmware fence_vmware ipaddr=esxtest login=esxuser passwd=esxpass \
     vmlogin=root vmpasswd=esxpass port=”/vmfs/volumes/49086551-c64fd83c-0401-001e0bcd6848/gfs3/gfs3.vmx”

4. Now that we added the Fencing devices, it is time to add the nodes..

    ccs_tool addnode -C gfs1 -n 1 -v 1 -f gfs1_vmware
    ccs_tool addnode -C gfs2 -n 2 -v 1 -f gfs2_vmware
    ccs_tool addnode -C gfs3 -n 3 -v 1 -f gfs3_vmware

5. Now we need to copy this configuration over to the other 2 nodes from gfs1 or we can run the exact same commands above on the other 2 nodes..

    scp /etc/cluster/cluster.conf root@gfs2:/etc/cluster/cluster.conf
    scp /etc/cluster/cluster.conf root@gfs3:/etc/cluster/cluster.conf

6. You can verify the config on all 3 nodes by running the following commands below..

    ccs_tool lsnode
    ccs_tool lsfence

7. You are ready to proceed with starting up the following daemons on all the nodes in the cluster, once you either copied over the configs or re ran the same commands above on the other 2 nodes

    /etc/init.d/cman start
    /etc/init.d/rgmanager start

8. You can now check the status of your cluster by running the commands below…

    clustat
    cman_tool status

9. If you want to test the vmware fencing you can do so by doing the following.. ( run the command below on the 1st node and use the 2nd node as the node to be fenced )

  fence_vmware -a esxtest -l esxuser -p esxpass -L root -P esxpass \
   -n “/vmfs/volumes/49086551-c64fd83c-0401-001e0bcd6848/gfs2/gfs2.vmx” -v

10, Before we start to create the LVM2 volumes and Proceed to GFS2, we will need to enable clustering in LVM2.

    lvmconf –enable-cluster

11. Now it is time to create the LVM2 Volumes…

    pvcreate MyTestGFS /dev/sdb
    vgcreate -c y mytest_gfs2 /dev/sdb
    lvcreate -n MyGFS2test -L 5G mytest_gfs2
    /etc/init.d/clvmd start

12. You should now also start clvmd on the other 2 nodes.. 13. Once the above has been completed, you will now need to create the GFS2 file system.. Example below..

    mkfs -t <filesystem> -p <locking mechanism> -t <ClusterName>:<PhysicalVolumeName> \
     -j <JournalsNeeded == amount of nodes in cluster> <location of filesystem>
    mkfs -t gfs2 -p lock_dlm -t MyCluster:MyTestGFS -j 4 /dev/mapper/mytest_gfs2-MyGFS2test

14. All we need to do on the 3 nodes, is to mount the GFS2 file system.

    mount /dev/mapper/mytest_gfs2-MyGFS2test /mnt/

15. Once you mounted your GFS2 file system You can the following commands..

    gfs2_tool list
    gfs2_tool df


Now it is time to wrap it up with some final commands…

1. Now that we have a fully functional cluster and a mountable GFS2 file system, we need to make sure all the necessary daemons start up with the cluster..

    chkconfig –level 345 rgmanager on
    chkconfig –level 345 clvmd on
    chkconfig –level 345 cman on
    chkconfig –level 345 gfs2 on

2. If you want the GFS2 file system to be mounted at startup you can add this to /etc/fstab..

    echo “/dev/mapper/mytest_gfs2-MyGFS2test /GFS gfs2 defaults,noatime,nodiratime 0 0″ >> /etc/fstab

Source

Distributed Lock Manager

A distributed lock manager (DLM) provides distributed software applications with a means to synchronize their accesses to shared resources.

Lock management is a common cluster-infrastructure service that provides a mechanism for other cluster infrastructure components to synchronize their access to shared resources. In a Red Hat cluster, DLM (Distributed Lock Manager) is the lock manager.

A lock manager is a traffic cop who controls access to resources in the cluster, such as access to a GFS file system. You need it because without a lock manager, there would be no control over access to your shared storage, and the nodes in the cluster would corrupt each other's data.

As implied in its name, DLM is a distributed lock manager and runs in each cluster node; lock management is distributed across all nodes in the cluster. GFS2 and CLVM use locks from the lock manager. GFS2 uses locks from the lock manager to synchronize access to file system metadata (on shared storage). CLVM uses locks from the lock manager to synchronize updates to LVM volumes and volume groups (also on shared storage). In addition, rgmanager uses DLM to synchronize service states.

DLM Locking Model

The DLM locking model provides a rich set of locking modes and both synchronous and asynchronous execution. An application acquires a lock on a lock resource. A one-to-many relationship exists between lock resources and locks: a single lock resource can have multiple locks associated with it.

A lock resource can correspond to an actual object, such as a file, a data structure, a database, or an executable routine, but it does not have to correspond to one of these things. The object you associate with a lock resource determines the granularity of the lock. For example, locking an entire database is considered locking at coarse granularity. Locking each item in a database is considered locking at a fine granularity.

The DLM locking model supports:

  • Six locking modes that increasingly restrict access to a resource
  • The promotion and demotion of locks through conversion
  • Synchronous completion of lock requests
  • Asynchronous completion
  • Global data through lock value blocks

The DLM provides its own mechanisms to support its locking features, such as inter-node communication to manage lock traffic and recovery protocols to re-master locks after a node failure or to migrate locks when a node joins the cluster. However, the DLM does not provide mechanisms to actually manage the cluster itself. Therefore the DLM expects to operate in a cluster in conjunction with another cluster infrastructure environment that provides the following minimum requirements:

  • The node is a part of a cluster.
  • All nodes agree on cluster membership and has quorum.
  • An IP address must communicate with the DLM on a node. Normally the DLM uses TCP/IP for inter-node communications which restricts it to a single IP address per node (though this can be made more redundant using the bonding driver). The DLM can be configured to use SCTP as its inter-node

transport which allows multiple IP addresses per node.
The DLM works with any cluster infrastructure environments that provide the minimum requirements listed above. The choice of an open source or closed source environment is up to the user. However, the DLM’s main limitation is the amount of testing performed with different environments.

Source

Lock States

A lock state indicates the current status of a lock request. A lock is always in one of three states:

  • Granted — The lock request succeeded and attained the requested mode.
  • Converting — A client attempted to change the lock mode and the new mode is incompatible with an existing lock.
  • Blocked — The request for a new lock could not be granted because conflicting locks exist.

A lock's state is determined by its requested mode and the modes of the other locks on the same resource.

Source

Coda

Coda is a distributed file system developed as a research project at Carnegie Mellon University since 1987 under the direction of Mahadev Satyanarayanan. It descended directly from an older version of AFS (AFS-2) and offers many similar features. The InterMezzo file system was inspired by Coda. Coda is still under development, though the focus has shifted from research to creating a robust product for commercial use.

Coda has many features that are desirable for network file systems, and several features not found elsewhere.

  1. Disconnected operation for mobile computing
  2. Is freely available under a liberal license
  3. High performance through client side persistent caching
  4. Server replication
  5. Security model for authentication, encryption and access control
  6. Continued operation during partial network failures in server network
  7. Network bandwidth adaptation
  8. Good scalability
  9. Well defined semantics of sharing, even in the presence of network failures

Coda uses a local cache to provide access to server data when the network connection is lost. During normal operation, a user reads and writes to the file system normally, while the client fetches, or “hoards”, all of the data the user has listed as important in the event of network disconnection. If the network connection is lost, the Coda client's local cache serves data from this cache and logs all updates. This operating state is called disconnected operation. Upon network reconnection, the client moves to reintegration state; it sends logged updates to the servers. Then it transitions back to normal connected-mode operation.

Also different from AFS is Coda's data replication method. AFS uses a pessimistic replication strategy with its files, only allowing one read/write server to receive updates and all other servers acting as read-only replicas. Coda allows all servers to receive updates, allowing for a greater availability of server data in the event of network partitions, a case which AFS cannot handle.

These unique features introduce the possibility of semantically diverging copies of the same files or directories, known as “conflicts”. Disconnected operation's local updates can potentially clash with other connected users' updates on the same objects, preventing reintegration. Optimistic replication can potentially cause concurrent updates to different servers on the same object, preventing replication. The former case is called a “local/global” conflict, and the latter case a “server/server” conflict. Coda has extensive repair tools, both manual and automated, to handle and repair both types of conflicts.
Source

AFS

The Andrew File System (AFS) is a distributed networked file system which uses a set of trusted servers to present a homogeneous, location-transparent file name space to all the client workstations. It was developed by Carnegie Mellon University as part of the Andrew Project. It is named after Andrew Carnegie and Andrew Mellon. Its primary use is in distributed computing.

AFS has several benefits over traditional networked file systems, particularly in the areas of security and scalability. It is not uncommon for enterprise AFS deployments to exceed 25,000 clients. AFS uses Kerberos for authentication, and implements access control lists on directories for users and groups. Each client caches files on the local filesystem for increased speed on subsequent requests for the same file. This also allows limited filesystem access in the event of a server crash or a network outage.

Read and write operations on an open file are directed only to the locally cached copy. When a modified file is closed, the changed portions are copied back to the file server. Cache consistency is maintained by callback mechanism. When a file is cached, the server makes a note of this and promises to inform the client if the file is updated by someone else. Callbacks are discarded and must be re-established after any client, server, or network failure, including a time-out. Re-establishing a callback involves a status check and does not require re-reading the file itself.

A consequence of the file locking strategy is that AFS does not support large shared databases or record updating within files shared between client systems. This was a deliberate design decision based on the perceived needs of the university computing environment. It leads, for example, to the use of a single file per message in the original email system for the Andrew Project, the Andrew Message System, rather than a single file per mailbox. See file locking (AFS and buffered I/O Problems) for handling shared databases.

A significant feature of AFS is the volume, a tree of files, sub-directories and AFS mountpoints (links to other AFS volumes). Volumes are created by administrators and linked at a specific named path in an AFS cell. Once created, users of the filesystem may create directories and files as usual without concern for the physical location of the volume. A volume may have a quota assigned to it in order to limit the amount of space consumed. As needed, AFS administrators can move that volume to another server and disk location without the need to notify users; indeed the operation can occur while files in that volume are being used.

AFS volumes can be replicated to read-only cloned copies. When accessing files in a read-only volume, a client system will retrieve data from a particular read-only copy. If at some point that copy becomes unavailable, clients will look for any of the remaining copies. Again, users of that data are unaware of the location of the read-only copy; administrators can create and relocate such copies as needed. The AFS command suite guarantees that all read-only volumes contain exact copies of the original read-write volume at the time the read-only copy was created.

The file name space on an Andrew workstation is partitioned into a shared and local name space. The shared name space (usually mounted as /afs on the Unix filesystem) is identical on all workstations. The local name space is unique to each workstation. It only contains temporary files needed for workstation initialization and symbolic links to files in the shared name space.

The Andrew File System heavily influenced Version 4 of Sun Microsystems' popular Network File System (NFS). Additionally, a variant of AFS, the Distributed File System (DFS) was adopted by the Open Software Foundation in 1989 as part of their Distributed Computing Environment.

Source

GlusterFS

GlusterFS is a scale-out NAS file system. It is free software, with some parts licensed under the GNU GPL v3 while others are dual licensed under either GPL v2 or the LGPL v3. It aggregates various storage servers over Ethernet or Infiniband RDMA interconnect into one large parallel network file system. GlusterFS is based on a stackable user space design. It has found a variety of applications including cloud computing, streaming media services, and content delivery networks. GlusterFS was developed originally by Gluster, Inc., then by Red Hat, Inc., after their purchase of Gluster in 2011.

GlusterFS has a client and server component. Servers are typically deployed as storage bricks, with each server running a glusterfsd daemon to export a local file system as a volume. The glusterfs client process, which connects to servers with a custom protocol over TCP/IP, InfiniBand or SDP, creates composite virtual volumes from multiple remote servers using stackable translators. By default, files are stored whole, but striping of files across multiple remote volumes is also supported. The final volume may then be mounted by the client host using its own native protocol via the FUSE mechanism, using NFS v3 protocol using a built-in server translator, or accessed via gfapi client library. Native-protocol mounts may then be re-exported e.g. via the kernel NFSv4 server, SAMBA, or the object-based OpenStack Storage (Swift) protocol using the “UFO” (Unified File and Object) translator.

Most of the functionality of GlusterFS is implemented as translators, including:

  • File-based mirroring and replication
  • File-based striping
  • File-based load balancing
  • Volume failover
  • Scheduling and disk caching
  • Storage quotas

The GlusterFS server is intentionally kept simple: it exports an existing directory as-is, leaving it up to client-side translators to structure the store. The clients themselves are stateless, do not communicate with each other, and are expected to have translator configurations consistent with each other. GlusterFS relies on an elastic hashing algorithm, rather than using either a centralized or distributed metadata model. With version 3.1 and later of GlusterFS, volumes can be added, deleted, or migrated dynamically, helping to avoid configuration coherency problems, and allowing GlusterFS to scale up to several petabytes on commodity hardware by avoiding bottlenecks that normally affect more tightly-coupled distributed file systems.

Source

CephFS

In computing, Ceph (pronounced /ˈsɛf/ or /ˈkɛf/), a free-software storage platform, implements object storage on a single distributed computer cluster, and provides interfaces for object-, block- and file-level storage. Ceph aims primarily for completely distributed operation without a single point of failure, scalable to the exabyte level, and freely available.
Ceph replicates data and makes it fault-tolerant, using commodity hardware and requiring no specific hardware support. As a result of its design, the system is both self-healing and self-managing, aiming to minimize administration time and other costs.
On April 21, 2016, the Ceph development team released “Jewel”, the first Ceph release in which CephFS is considered stable.[citation needed] The CephFS repair and disaster recovery tools are feature-complete (snapshots, multiple active metadata servers and some other functionality is disabled by default).

Design
A high-level overview of the Ceph's internal organization: {{:wiki:certification:ceph_components.svg.png?400|}}

Ceph employs four distinct kinds of daemons:

  • Cluster monitors (ceph-mon) that keep track of active and failed cluster nodes
  • Metadata servers (ceph-mds) that store the metadata of inodes and directories
  • Object storage devices (ceph-osd) that actually store the content of files in a XFS file system.
  • Representational state transfer (RESTful) gateways (ceph-rgw) that expose the object storage layer as an interface compatible with Amazon S3 or OpenStack Swift APIs

All of these are fully distributed, and may run on the same set of servers. Clients directly interact with all of them. Ceph does striping of individual files across multiple nodes to achieve higher throughput, similarly to how RAID0 stripes partitions across multiple hard drives. Adaptive load balancing is supported whereby frequently accessed objects are replicated over more nodes. As of December 2014, XFS is the recommended underlying filesystem type for production environments, while Btrfs is recommended for non-production environments. ext4 filesystems are not recommended because of resulting limitations on the maximum RADOS objects length.

Object storage
An architecture diagram showing the relations between components of the Ceph storage platform Ceph implements distributed object storage. Ceph’s software libraries provide client applications with direct access to the reliable autonomic distributed object store (RADOS) object-based storage system, and also provide a foundation for some of Ceph’s features, including RADOS Block Device (RBD), RADOS Gateway, and the Ceph File System.
The librados software libraries provide access in C, C++, Java, PHP, and Python. The RADOS Gateway also exposes the object store as a RESTful interface which can present as both native Amazon S3 and OpenStack Swift APIs.

Block storage
Ceph’s object storage system allows users to mount Ceph as a thin-provisioned block device. When an application writes data to Ceph using a block device, Ceph automatically stripes and replicates the data across the cluster. Ceph's RADOS Block Device (RBD) also integrates with Kernel-based Virtual Machines (KVMs).
Ceph RBD interfaces with the same Ceph object storage system that provides the librados interface and the CephFS file system, and it stores block device images as objects. Since RBD is built on librados, RBD inherits librados's abilities, including read-only snapshots and revert to snapshot. By striping images across the cluster, Ceph improves read access performance for large block device images.
The block device can be virtualized, providing block storage to virtual machines, in virtualization platforms such as Apache CloudStack, OpenStack, OpenNebula, Ganeti, and Proxmox Virtual Environment.

File system
Ceph’s file system (CephFS) runs on top of the same object storage system that provides object storage and block device interfaces. The Ceph metadata server cluster provides a service that maps the directories and file names of the file system to objects stored within RADOS clusters. The metadata server cluster can expand or contract, and it can rebalance the file system dynamically to distribute data evenly among cluster hosts. This ensures high performance and prevents heavy loads on specific hosts within the cluster.
Clients mount the POSIX-compatible file system using a Linux kernel client. On March 19, 2010, Linus Torvalds merged the Ceph client into Linux kernel version 2.6.34[13] which was released on May 16, 2010. An older FUSE-based client is also available. The servers run as regular Unix daemons.
Source

Acknowledgments

Most of the information in this document was collected from different sites on the internet and was copied (un)modified. Some text was created by me. The copyright of the text in document remains by their owners and is noway claimed by me. If you wrote some of the text we copied, I like to thank you for your excellent work.

Nothing in this document should be published for commercial purposes without gaining the permission of the original copyright owners.

For questions about this document or if you want to help keep this document up-to-date, you can contact me at webmaster@universe-network.net

wiki/certification/lpic304-200.txt · Last modified: 2017/08/28 20:51 by ferry