187 lines
		
	
	
		
			6.0 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
	
	
	
			
		
		
	
	
			187 lines
		
	
	
		
			6.0 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
	
	
	
.. hwpoison:
 | 
						|
 | 
						|
========
 | 
						|
hwpoison
 | 
						|
========
 | 
						|
 | 
						|
What is hwpoison?
 | 
						|
=================
 | 
						|
 | 
						|
Upcoming Intel CPUs have support for recovering from some memory errors
 | 
						|
(``MCA recovery``). This requires the OS to declare a page "poisoned",
 | 
						|
kill the processes associated with it and avoid using it in the future.
 | 
						|
 | 
						|
This patchkit implements the necessary infrastructure in the VM.
 | 
						|
 | 
						|
To quote the overview comment:
 | 
						|
 | 
						|
 * High level machine check handler. Handles pages reported by the
 | 
						|
 * hardware as being corrupted usually due to a 2bit ECC memory or cache
 | 
						|
 * failure.
 | 
						|
 *
 | 
						|
 * This focusses on pages detected as corrupted in the background.
 | 
						|
 * When the current CPU tries to consume corruption the currently
 | 
						|
 * running process can just be killed directly instead. This implies
 | 
						|
 * that if the error cannot be handled for some reason it's safe to
 | 
						|
 * just ignore it because no corruption has been consumed yet. Instead
 | 
						|
 * when that happens another machine check will happen.
 | 
						|
 *
 | 
						|
 * Handles page cache pages in various states. The tricky part
 | 
						|
 * here is that we can access any page asynchronous to other VM
 | 
						|
 * users, because memory failures could happen anytime and anywhere,
 | 
						|
 * possibly violating some of their assumptions. This is why this code
 | 
						|
 * has to be extremely careful. Generally it tries to use normal locking
 | 
						|
 * rules, as in get the standard locks, even if that means the
 | 
						|
 * error handling takes potentially a long time.
 | 
						|
 *
 | 
						|
 * Some of the operations here are somewhat inefficient and have non
 | 
						|
 * linear algorithmic complexity, because the data structures have not
 | 
						|
 * been optimized for this case. This is in particular the case
 | 
						|
 * for the mapping from a vma to a process. Since this case is expected
 | 
						|
 * to be rare we hope we can get away with this.
 | 
						|
 | 
						|
The code consists of a the high level handler in mm/memory-failure.c,
 | 
						|
a new page poison bit and various checks in the VM to handle poisoned
 | 
						|
pages.
 | 
						|
 | 
						|
The main target right now is KVM guests, but it works for all kinds
 | 
						|
of applications. KVM support requires a recent qemu-kvm release.
 | 
						|
 | 
						|
For the KVM use there was need for a new signal type so that
 | 
						|
KVM can inject the machine check into the guest with the proper
 | 
						|
address. This in theory allows other applications to handle
 | 
						|
memory failures too. The expection is that near all applications
 | 
						|
won't do that, but some very specialized ones might.
 | 
						|
 | 
						|
Failure recovery modes
 | 
						|
======================
 | 
						|
 | 
						|
There are two (actually three) modes memory failure recovery can be in:
 | 
						|
 | 
						|
vm.memory_failure_recovery sysctl set to zero:
 | 
						|
	All memory failures cause a panic. Do not attempt recovery.
 | 
						|
	(on x86 this can be also affected by the tolerant level of the
 | 
						|
	MCE subsystem)
 | 
						|
 | 
						|
early kill
 | 
						|
	(can be controlled globally and per process)
 | 
						|
	Send SIGBUS to the application as soon as the error is detected
 | 
						|
	This allows applications who can process memory errors in a gentle
 | 
						|
	way (e.g. drop affected object)
 | 
						|
	This is the mode used by KVM qemu.
 | 
						|
 | 
						|
late kill
 | 
						|
	Send SIGBUS when the application runs into the corrupted page.
 | 
						|
	This is best for memory error unaware applications and default
 | 
						|
	Note some pages are always handled as late kill.
 | 
						|
 | 
						|
User control
 | 
						|
============
 | 
						|
 | 
						|
vm.memory_failure_recovery
 | 
						|
	See sysctl.txt
 | 
						|
 | 
						|
vm.memory_failure_early_kill
 | 
						|
	Enable early kill mode globally
 | 
						|
 | 
						|
PR_MCE_KILL
 | 
						|
	Set early/late kill mode/revert to system default
 | 
						|
 | 
						|
	arg1: PR_MCE_KILL_CLEAR:
 | 
						|
		Revert to system default
 | 
						|
	arg1: PR_MCE_KILL_SET:
 | 
						|
		arg2 defines thread specific mode
 | 
						|
 | 
						|
		PR_MCE_KILL_EARLY:
 | 
						|
			Early kill
 | 
						|
		PR_MCE_KILL_LATE:
 | 
						|
			Late kill
 | 
						|
		PR_MCE_KILL_DEFAULT
 | 
						|
			Use system global default
 | 
						|
 | 
						|
	Note that if you want to have a dedicated thread which handles
 | 
						|
	the SIGBUS(BUS_MCEERR_AO) on behalf of the process, you should
 | 
						|
	call prctl(PR_MCE_KILL_EARLY) on the designated thread. Otherwise,
 | 
						|
	the SIGBUS is sent to the main thread.
 | 
						|
 | 
						|
PR_MCE_KILL_GET
 | 
						|
	return current mode
 | 
						|
 | 
						|
Testing
 | 
						|
=======
 | 
						|
 | 
						|
* madvise(MADV_HWPOISON, ....) (as root) - Poison a page in the
 | 
						|
  process for testing
 | 
						|
 | 
						|
* hwpoison-inject module through debugfs ``/sys/kernel/debug/hwpoison/``
 | 
						|
 | 
						|
  corrupt-pfn
 | 
						|
	Inject hwpoison fault at PFN echoed into this file. This does
 | 
						|
	some early filtering to avoid corrupted unintended pages in test suites.
 | 
						|
 | 
						|
  unpoison-pfn
 | 
						|
	Software-unpoison page at PFN echoed into this file. This way
 | 
						|
	a page can be reused again.  This only works for Linux
 | 
						|
	injected failures, not for real memory failures.
 | 
						|
 | 
						|
  Note these injection interfaces are not stable and might change between
 | 
						|
  kernel versions
 | 
						|
 | 
						|
  corrupt-filter-dev-major, corrupt-filter-dev-minor
 | 
						|
	Only handle memory failures to pages associated with the file
 | 
						|
	system defined by block device major/minor.  -1U is the
 | 
						|
	wildcard value.  This should be only used for testing with
 | 
						|
	artificial injection.
 | 
						|
 | 
						|
  corrupt-filter-memcg
 | 
						|
	Limit injection to pages owned by memgroup. Specified by inode
 | 
						|
	number of the memcg.
 | 
						|
 | 
						|
	Example::
 | 
						|
 | 
						|
		mkdir /sys/fs/cgroup/mem/hwpoison
 | 
						|
 | 
						|
	        usemem -m 100 -s 1000 &
 | 
						|
		echo `jobs -p` > /sys/fs/cgroup/mem/hwpoison/tasks
 | 
						|
 | 
						|
		memcg_ino=$(ls -id /sys/fs/cgroup/mem/hwpoison | cut -f1 -d' ')
 | 
						|
		echo $memcg_ino > /debug/hwpoison/corrupt-filter-memcg
 | 
						|
 | 
						|
		page-types -p `pidof init`   --hwpoison  # shall do nothing
 | 
						|
		page-types -p `pidof usemem` --hwpoison  # poison its pages
 | 
						|
 | 
						|
  corrupt-filter-flags-mask, corrupt-filter-flags-value
 | 
						|
	When specified, only poison pages if ((page_flags & mask) ==
 | 
						|
	value).  This allows stress testing of many kinds of
 | 
						|
	pages. The page_flags are the same as in /proc/kpageflags. The
 | 
						|
	flag bits are defined in include/linux/kernel-page-flags.h and
 | 
						|
	documented in Documentation/admin-guide/mm/pagemap.rst
 | 
						|
 | 
						|
* Architecture specific MCE injector
 | 
						|
 | 
						|
  x86 has mce-inject, mce-test
 | 
						|
 | 
						|
  Some portable hwpoison test programs in mce-test, see below.
 | 
						|
 | 
						|
References
 | 
						|
==========
 | 
						|
 | 
						|
http://halobates.de/mce-lc09-2.pdf
 | 
						|
	Overview presentation from LinuxCon 09
 | 
						|
 | 
						|
git://git.kernel.org/pub/scm/utils/cpu/mce/mce-test.git
 | 
						|
	Test suite (hwpoison specific portable tests in tsrc)
 | 
						|
 | 
						|
git://git.kernel.org/pub/scm/utils/cpu/mce/mce-inject.git
 | 
						|
	x86 specific injector
 | 
						|
 | 
						|
 | 
						|
Limitations
 | 
						|
===========
 | 
						|
- Not all page types are supported and never will. Most kernel internal
 | 
						|
  objects cannot be recovered, only LRU pages for now.
 | 
						|
- Right now hugepage support is missing.
 | 
						|
 | 
						|
---
 | 
						|
Andi Kleen, Oct 2009
 |