Elepffwdsff

korsygfhrtzangaiide Elepffwdsff

Kexec/Kdump HOWTO

Introduction

Kexec and kdump are new features in the 2.6 mainstream kernel. These features
are included in Red Hat Enterprise Linux 5. The purpose of these features
is to ensure faster boot up and creation of reliable kernel vmcores for
diagnostic purposes.

Overview

Kexec

Kexec is a fastboot mechanism which allows booting a Linux kernel from the
context of already running kernel without going through BIOS. BIOS can be very
time consuming especially on the big servers with lots of peripherals. This can
save a lot of time for developers who end up booting a machine numerous times.

Kdump

Kdump is a new kernel crash dumping mechanism and is very reliable because
the crash dump is captured from the context of a freshly booted kernel and
not from the context of the crashed kernel. Kdump uses kexec to boot into
a second kernel whenever system crashes. This second kernel, often called
a capture kernel, boots with very little memory and captures the dump image.

The first kernel reserves a section of memory that the second kernel uses
to boot. Kexec enables booting the capture kernel without going through BIOS
hence contents of first kernel's memory are preserved, which is essentially
the kernel crash dump.

Kdump is supported on the i686, x86_64, ia64 and ppc64 platforms. The
standard kernel and capture kernel are one in the same on i686, x86_64,
ia64 and ppc64.

If you're reading this document, you should already have kexec-tools
installed. If not, you install it via the following command:

# yum install kexec-tools

Now load a kernel with kexec:

# kver=`uname -r` # kexec -l /boot/vmlinuz-$kver
    --initrd=/boot/initrd-$kver.img \
        --command-line="`cat /proc/cmdline`"

NOTE: The above will boot you back into the kernel you're currently running,
if you want to load a different kernel, substitute it in place of `uname -r`.

Now reboot your system, taking note that it should bypass the BIOS:

# reboot

How to configure kdump:

Again, we assume if you're reading this document, you should already have
kexec-tools installed. If not, you install it via the following command:

# yum install kexec-tools

To be able to do much of anything interesting in the way of debug analysis,
you'll also need to install the kernel-debuginfo package, of the same arch
as your running kernel, and the crash utility:

# yum --enablerepo=\*debuginfo install kernel-debuginfo.$(uname -m) crash

Next up, we need to modify some boot parameters to reserve a chunk of memory for
the capture kernel. With the help of grubby, it's very easy to append
"crashkernel=128M" to the end of your kernel boot parameters. Note that the X
values are such that X = the amount of memory to reserve for the capture kernel.
And based on arch and system configuration, one might require more than 128M to
be reserved for kdump. One need to experiment and test kdump, if 128M is not
sufficient, try reserving more memory.

# grubby --args="crashkernel=128M" --update-kernel=/boot/vmlinuz-`uname -r`

Note that there is an alternative form in which to specify a crashkernel
memory reservation, in the event that more control is needed over the size and
placement of the reserved memory.  The format is:

crashkernel=range1:size1[,range2:size2,...][@offset]

Where range<n> specifies a range of values that are matched against the amount
of physical RAM present in the system, and the corresponding size<n> value
specifies the amount of kexec memory to reserve.  For example:

crashkernel=512M-2G:64M,2G-:128M

This line tells kexec to reserve 64M of ram if the system contains between
512M and 2G of physical memory.  If the system contains 2G or more of physical
memory, 128M should be reserved.

You can also use the default crashkernel=auto to let kernel set the
crashkernel size.

crashkernel=auto indicates a best effort estimation for usual use cases,
however one still needs do a test to ensure that the kernel reserved
memory size is enough.

NOTE:
When a debug variant kernel is used as the capture kernel and the
primary kernel was booted with 'crashkernel=auto' set in the bootargs,
the capture kernel boot can fail.

A debug variant kernel usually is the same stable kernel with some
debug options enabled which uses much more memory in the kdump kernel.
Thus when you use 'crashkernel=auto', kdump kernel will likely run out
of memory.

So it is not advisable to use a debug variant kernel as the capture
kernel when primary kernel is booted with 'crashkernel=auto' set in
bootargs.

Besides, since kdump needs to access /proc/kallsyms during a kernel
loading if KASLR is enabled, check /proc/sys/kernel/kptr_restrict to
make sure that the content of /proc/kallsyms is exposed correctly.
We recommend to set the value of kptr_restrict to '1'. Otherwise
capture kernel loading could fail.

After making said changes, reboot your system, so that the X MB of memory is
left untouched by the normal system, reserved for the capture kernel. Take note
that the output of 'free -m' will show X MB less memory than without this
parameter, which is expected. You may be able to get by with less than 128M, but
testing with only 64M has proven unreliable of late. On ia64, as much as 512M
may be required.

Now that you've got that reserved memory region set up, you want to turn on
the kdump init script:

# chkconfig kdump on

Then, start up kdump as well:

# systemctl start kdump.service

This should load your kernel-kdump image via kexec, leaving the system ready
to capture a vmcore upon crashing. To test this out, you can force-crash
your system by echo'ing a c into /proc/sysrq-trigger:

# echo c > /proc/sysrq-trigger

You should see some panic output, followed by the system restarting into
the kdump kernel. When the boot process gets to the point where it starts
the kdump service, your vmcore should be copied out to disk (by default,
in /var/crash/<YYYY-MM-DD-HH:MM>/vmcore), then the system rebooted back into
your normal kernel.

Once back to your normal kernel, you can use the previously installed crash
kernel in conjunction with the previously installed kernel-debuginfo to
perform postmortem analysis:

# crash /usr/lib/debug/lib/modules/2.6.17-1.2621.el5/vmlinux
    /var/crash/2006-08-23-15:34/vmcore

crash> bt

and so on...

Notes:

When kdump starts, the kdump kernel is loaded together with the kdump
initramfs. To save memory usage and disk space, the kdump initramfs is
generated strictly against the system it will run on, and contains the
minimum set of kernel modules and utilities to boot the machine to a stage
where the dump target could be mounted.

With kdump service enabled, kdumpctl will try to detect possible system
change and rebuild the kdump initramfs if needed. But it can not guarantee
to cover every possible case. So after a hardware change, disk migration,
storage setup update or any similar system level changes, it's highly
recommended to rebuild the initramfs manually with following command:

# kdumpctl rebuild

Saving vmcore-dmesg.txt
----------------------
Kernel log bufferes are one of the most important information available
in vmcore. Now before saving vmcore, kernel log bufferes are extracted
from /proc/vmcore and saved into a file vmcore-dmesg.txt. After
vmcore-dmesg.txt, vmcore is saved. Destination disk and directory for
vmcore-dmesg.txt is same as vmcore. Note that kernel log buffers will
not be available if dump target is raw device.

Dump Triggering methods:

This section talks about the various ways, other than a Kernel Panic, in which
Kdump can be triggered. The following methods assume that Kdump is configured
on your system, with the scripts enabled as described in the section above.

1) AltSysRq C

Kdump can be triggered with the combination of the 'Alt','SysRq' and 'C'
keyboard keys. Please refer to the following link for more details:

http://kbase.redhat.com/faq/FAQ_43_5559.shtm

In addition, on PowerPC boxes, Kdump can also be triggered via Hardware
Management Console(HMC) using 'Ctrl', 'O' and 'C' keyboard keys.

2) NMI_WATCHDOG

In case a machine has a hard hang, it is quite possible that it does not
respond to keyboard interrupts. As a result 'Alt-SysRq' keys will not help
trigger a dump. In such scenarios Nmi Watchdog feature can prove to be useful.
The following link has more details on configuring Nmi watchdog option.

http://kbase.redhat.com/faq/FAQ_85_9129.shtm

Once this feature has been enabled in the kernel, any lockups will result in an
OOPs message to be generated, followed by Kdump being triggered.

3) Kernel OOPs

If we want to generate a dump everytime the Kernel OOPses, we can achieve this
by setting the 'Panic On OOPs' option as follows:

# echo 1 > /proc/sys/kernel/panic_on_oops

This is enabled by default on RHEL5.

4) NMI(Non maskable interrupt) button

In cases where the system is in a hung state, and is not accepting keyboard
interrupts, using NMI button for triggering Kdump can be very useful. NMI
button is present on most of the newer x86 and x86_64 machines. Please refer
to the User guides/manuals to locate the button, though in most occasions it
is not very well documented. In most cases it is hidden behind a small hole
on the front or back panel of the machine. You could use a toothpick or some
other non-conducting probe to press the button.

For example, on the IBM X series 366 machine, the NMI button is located behind
a small hole on the bottom center of the rear panel.

To enable this method of dump triggering using NMI button, you will need to set
the 'unknown_nmi_panic' option as follows:

# echo 1 > /proc/sys/kernel/unknown_nmi_panic

5) PowerPC specific methods:

On IBM PowerPC machines, issuing a soft reset invokes the XMON debugger(if
XMON is configured). To configure XMON one needs to compile the kernel with
the CONFIG_XMON and CONFIG_XMON_DEFAULT options, or by compiling with
CONFIG_XMON and booting the kernel with xmon=on option.

Following are the ways to remotely issue a soft reset on PowerPC boxes, which
would drop you to XMON. Pressing a 'X' (capital alphabet X) followed by an
'Enter' here will trigger the dump.

5.1) HMC

Hardware Management Console(HMC) available on Power4 and Power5 machines allow
partitions to be reset remotely. This is specially useful in hang situations
where the system is not accepting any keyboard inputs.

Once you have HMC configured, the following steps will enable you to trigger
Kdump via a soft reset:

On Power4
  Using GUI

* In the right pane, right click on the partition you wish to dump.
    * Select "Operating System->Reset".
    * Select "Soft Reset".
    * Select "Yes".

Using HMC Commandline

# reset_partition -m <machine> -p <partition> -t soft

On Power5
  Using GUI

* In the right pane, right click on the partition you wish to dump.
    * Select "Restart Partition".
    * Select "Dump".
    * Select "OK".

Using HMC Commandline

# chsysstate -m <managed system name> -n <lpar name> -o dumprestart -r lpar

5.2) Blade Management Console for Blade Center

To initiate a dump operation, go to Power/Restart option under "Blade Tasks" in
the Blade Management Console. Select the corresponding blade for which you want
to initate the dump and then click "Restart blade with NMI". This issues a
system reset and invokes xmon debugger.

Advanced Setups:

In addition to being able to capture a vmcore to your system's local file
system, kdump can be configured to capture a vmcore to a number of other
locations, including a raw disk partition, a dedicated file system, an NFS
mounted file system, or a remote system via ssh/scp. Additional options
exist for specifying the relative path under which the dump is captured,
what to do if the capture fails, and for compressing and filtering the dump
(so as to produce smaller, more manageable, vmcore files).

In theory, dumping to a location other than the local file system should be
safer than kdump's default setup, as its possible the default setup will try
dumping to a file system that has become corrupted. The raw disk partition and
dedicated file system options allow you to still dump to the local system,
but without having to remount your possibly corrupted file system(s),
thereby decreasing the chance a vmcore won't be captured. Dumping to an
NFS server or remote system via ssh/scp also has this advantage, as well
as allowing for the centralization of vmcore files, should you have several
systems from which you'd like to obtain vmcore files. Of course, note that
these configurations could present problems if your network is unreliable.

Advanced setups are configured via modifications to /etc/kdump.conf,
which out of the box, is fairly well documented itself. Any alterations to
/etc/kdump.conf should be followed by a restart of the kdump service, so
the changes can be incorporated in the kdump initrd. Restarting the kdump
service is as simple as '/sbin/systemctl restart kdump.service'.

Note that kdump.conf is used as a configuration mechanism for capturing dump
files from the initramfs (in the interests of safety), the root file system is
mounted, and the init process is started, only as a last resort if the
initramfs fails to capture the vmcore.  As such, configuration made in
/etc/kdump.conf is only applicable to capture recorded in the initramfs.  If
for any reason the init process is started on the root file system, only a
simple copying of the vmcore from /proc/vmcore to /var/crash/$DATE/vmcore will
be preformed.

For both local filesystem and nfs dump the dump target must be mounted before
building kdump initramfs. That means one needs to put an entry for the dump
file system in /etc/fstab so that after reboot when kdump service starts,
it can find the dump target and build initramfs instead of failing.
Usually the dump target should be used only for kdump. If you worry about
someone uses the filesystem for something else other than dumping vmcore
you can mount it as read-only. Mkdumprd will still remount it as read-write
for creating dump directory and will move it back to read-only afterwards.

Raw partition

Raw partition dumping requires that a disk partition in the system, at least
as large as the amount of memory in the system, be left unformatted. Assuming
/dev/vg/lv_kdump is left unformatted, kdump.conf can be configured with
'raw /dev/vg/lv_kdump', and the vmcore file will be copied via dd directly
onto partition /dev/vg/lv_kdump. Restart the kdump service via
'/sbin/systemctl restart kdump.service' to commit this change to your kdump
initrd. Dump target should be persistent device name, such as lvm or device
mapper canonical name.

Dedicated file system

Similar to raw partition dumping, you can format a partition with the file
system of your choice, Again, it should be at least as large as the amount
of memory in the system. Assuming it should be at least as large as the
amount of memory in the system. Assuming /dev/vg/lv_kdump has been
formatted ext4, specify 'ext4 /dev/vg/lv_kdump' in kdump.conf, and a
vmcore file will be copied onto the file system after it has been mounted.
Dumping to a dedicated partition has the advantage that you can dump multiple
vmcores to the file system, space permitting, without overwriting previous ones,
as would be the case in a raw partition setup. Restart the kdump service via
'/sbin/systemctl restart kdump.service' to commit this change to
your kdump initrd.  Note that for local file systems ext4 and ext2 are
supported as dumpable targets.  Kdump will not prevent you from specifying
other filesystems, and they will most likely work, but their operation
cannot be guaranteed.  for instance specifying a vfat filesystem or msdos
filesystem will result in a successful load of the kdump service, but during
crash recovery, the dump will fail if the system has more than 2GB of memory
(since vfat and msdos filesystems do not support more than 2GB files).
Be careful of your filesystem selection when using this target.

It is recommended to use persistent device names or UUID/LABEL for file system
dumps. One example of persistent device is /dev/vg/<devname>.

NFS mount

Dumping over NFS requires an NFS server configured to export a file system
with full read/write access for the root user. All operations done within
the kdump initial ramdisk are done as root, and to write out a vmcore file,
we obviously must be able to write to the NFS mount. Configuring an NFS
server is outside the scope of this document, but either the no_root_squash
or anonuid options on the NFS server side are likely of interest to permit
the kdump initrd operations write to the NFS mount as root.

Assuming your're exporting /dump on the machine nfs-server.example.com,
once the mount is properly configured, specify it in kdump.conf, via
'nfs nfs-server.example.com:/dump'. The server portion can be specified either
by host name or IP address. Following a system crash, the kdump initrd will
mount the NFS mount and copy out the vmcore to your NFS server. Restart the
kdump service via '/sbin/systemctl restart kdump.service' to commit this change
to your kdump initrd.

Special mount via "dracut_args"

You can utilize "dracut_args" to pass "--mount" to kdump, see dracut manpage
about the format of "--mount" for details. If there is any "--mount" specified
via "dracut_args", kdump will build it as the mount target without doing any
validation (mounting or checking like mount options, fs size, save path, etc),
so you must test it to ensure all the correctness. You cannot use other targets
in /etc/kdump.conf if you use "--mount" in "dracut_args". You also cannot specify
mutliple "--mount" targets via "dracut_args".

One use case of "--mount" in "dracut_args" is you do not want to mount dump target
before kdump service startup, for example, to reduce the burden of the shared nfs
server. Such as the example below:
dracut_args --mount "192.168.1.1:/share /mnt/test nfs4 defaults"

NOTE:
- <mountpoint> must be specified as an absolute path.

Remote system via ssh/scp

Dumping over ssh/scp requires setting up passwordless ssh keys for every
machine you wish to have dump via this method. First up, configure kdump.conf
for ssh/scp dumping, adding a config line of 'ssh user@server', where 'user'
can be any user on the target system you choose, and 'server' is the host
name or IP address of the target system. Using a dedicated, restricted user
account on the target system is recommended, as there will be keyless ssh
access to this account.

Once kdump.conf is appropriately configured, issue the command
'kdumpctl propagate' to automatically set up the ssh host keys and transmit
the necessary bits to the target server. You'll have to type in 'yes'
to accept the host key for your targer server if this is the first time
you've connected to it, and then input the target system user's password
to send over the necessary ssh key file. Restart the kdump service via
'/sbin/systemctl restart kdump.service' to commit this change to your kdump initrd.

Path
====
"path" represents the file system path in which vmcore will be saved. In
fact kdump creates a directory $hostip-$date with-in "path" and saves
vmcore there. So practically dump is saved in $path/$hostip-$date/. To
simplify discussion further, if we say dump will be saved in $path, it
is implied that kdump will create another directory inside path and
save vmcore there.

If a dump target is specified in kdump.conf, then "path" is relative to the
specified dump target. For example, if dump target is "ext4 /dev/sda", then
dump will be saved in "$path" directory on /dev/sda.

Same is the case for nfs dump. If user specified "nfs foo.com:/export/tmp/"
as dump target, then dump will effectively be saved in
"foo.com:/export/tmp/var/crash/" directory.

Interpretation of path changes a bit if user has not specified a dump
target explicitly in kdump.conf. In this case, "path" represents the
absolute path from root. And dump target and adjusted path are arrived
at automatically depending on what's mounted in the current system.

Following are few examples.

path /var/crash/
----------------
Assuming there is no disk mounted on /var/ or on /var/crash, dump will
be saved on disk backing rootfs in directory /var/crash.

path /var/crash/ (A separate disk mounted on /var)
--------------------------------------------------
Say a disk /dev/sdb is mouted on /var. In this case dump target will
become /dev/sdb and path will become "/crash" and dump will be saved
on "sdb:/crash/" directory.

path /var/crash/ (NFS mounted on /var)
-------------------------------------
Say foo.com:/export/tmp is mounted on /var. In this case dump target is
nfs server and path will be adjusted to "/crash" and dump will be saved to
foo.com:/export/tmp/crash/ directory.

Kdump boot directory
====================
Usually kdump kernel is the same as 1st kernel. So kdump will try to find
kdump kernel under /boot according to /proc/cmdline. E.g we execute below
command and get an output:
	cat /proc/cmdline
	BOOT_IMAGE=/xxx/vmlinuz-3.yyy.zzz  root=xxxx .....
Then kdump kernel will be /boot/xxx/vmlinuz-3.yyy.zzz.
However a variable KDUMP_BOOTDIR in /etc/sysconfig/kdump is provided to
user if kdump kernel is put in a different directory.

Kdump Post-Capture Executable

It is possible to specify a custom script or binary you wish to run following
an attempt to capture a vmcore. The executable is passed an exit code from
the capture process, which can be used to trigger different actions from
within your post-capture executable.

Kdump Pre-Capture Executable

It is possible to specify a custom script or binary you wish to run before
capturing a vmcore. Exit status of this binary is interpreted:
0 - continue with dump process as usual
non 0 - reboot the system

Extra Binaries

If you have specific binaries or scripts you want to have made available
within your kdump initrd, you can specify them by their full path, and they
will be included in your kdump initrd, along with all dependent libraries.
This may be particularly useful for those running post-capture scripts that
rely on other binaries.

Extra Modules

By default, only the bare minimum of kernel modules will be included in your
kdump initrd. Should you wish to capture your vmcore files to a non-boot-path
storage device, such as an iscsi target disk or clustered file system, you may
need to manually specify additional kernel modules to load into your kdump
initrd.

Default action
==============
Default action specifies what to do when dump to configured dump target
fails. By default, default action is "reboot" and that is system reboots
if attempt to save dump to dump target fails.

There are other default actions available though.

- dump_to_rootfs
	This option tries to mount root and save dump on root filesystem
	in a path specified by "path". This option will generally make
	sense when dump target is not root filesystem. For example, if
	dump is being saved over network using "ssh" then one can specify
	default to "dump_to_rootfs" to try saving dump to root filesystem
	if dump over network fails.

- shell
	Drop into a shell session inside initramfs.
- halt
	Halt system after failure
- poweroff
	Poweroff system after failure.

Compression and filtering

The 'core_collector' parameter in kdump.conf allows you to specify a custom
dump capture method. The most common alternate method is makedumpfile, which
is a dump filtering and compression utility provided with kexec-tools. On
some architectures, it can drastically reduce the size of your vmcore files,
which becomes very useful on systems with large amounts of memory.

A typical setup is 'core_collector makedumpfile -F -l --message-level 1 -d 31',
but check the output of '/sbin/makedumpfile --help' for a list of all available
options (-i and -g don't need to be specified, they're automatically taken care
of). Note that use of makedumpfile requires that the kernel-debuginfo package
corresponding with your running kernel be installed.

Core collector command format depends on dump target type. Typically for
filesystem (local/remote), core_collector should accept two arguments.
First one is source file and second one is target file. For ex.

ex1.
---
core_collector "cp --sparse=always"

Above will effectively be translated to:

cp --sparse=always /proc/vmcore <dest-path>/vmcore

ex2.
---
core_collector "makedumpfile -l --message-level 1 -d 31"

Above will effectively be translated to:

makedumpfile -l --message-level 1 -d 31 /proc/vmcore <dest-path>/vmcore

For dump targets like raw and ssh, in general, core collector should expect
one argument (source file) and should output the processed core on standard
output (There is one exception of "scp", discussed later). This standard
output will be saved to destination using appropriate commands.

raw dumps core_collector examples:
---------
ex3.
---
core_collector "cat"