499 lines
		
	
	
		
			16 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
			
		
		
	
	
			499 lines
		
	
	
		
			16 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
| Cluster-wide Power-up/power-down race avoidance algorithm
 | |
| =========================================================
 | |
| 
 | |
| This file documents the algorithm which is used to coordinate CPU and
 | |
| cluster setup and teardown operations and to manage hardware coherency
 | |
| controls safely.
 | |
| 
 | |
| The section "Rationale" explains what the algorithm is for and why it is
 | |
| needed.  "Basic model" explains general concepts using a simplified view
 | |
| of the system.  The other sections explain the actual details of the
 | |
| algorithm in use.
 | |
| 
 | |
| 
 | |
| Rationale
 | |
| ---------
 | |
| 
 | |
| In a system containing multiple CPUs, it is desirable to have the
 | |
| ability to turn off individual CPUs when the system is idle, reducing
 | |
| power consumption and thermal dissipation.
 | |
| 
 | |
| In a system containing multiple clusters of CPUs, it is also desirable
 | |
| to have the ability to turn off entire clusters.
 | |
| 
 | |
| Turning entire clusters off and on is a risky business, because it
 | |
| involves performing potentially destructive operations affecting a group
 | |
| of independently running CPUs, while the OS continues to run.  This
 | |
| means that we need some coordination in order to ensure that critical
 | |
| cluster-level operations are only performed when it is truly safe to do
 | |
| so.
 | |
| 
 | |
| Simple locking may not be sufficient to solve this problem, because
 | |
| mechanisms like Linux spinlocks may rely on coherency mechanisms which
 | |
| are not immediately enabled when a cluster powers up.  Since enabling or
 | |
| disabling those mechanisms may itself be a non-atomic operation (such as
 | |
| writing some hardware registers and invalidating large caches), other
 | |
| methods of coordination are required in order to guarantee safe
 | |
| power-down and power-up at the cluster level.
 | |
| 
 | |
| The mechanism presented in this document describes a coherent memory
 | |
| based protocol for performing the needed coordination.  It aims to be as
 | |
| lightweight as possible, while providing the required safety properties.
 | |
| 
 | |
| 
 | |
| Basic model
 | |
| -----------
 | |
| 
 | |
| Each cluster and CPU is assigned a state, as follows:
 | |
| 
 | |
| 	DOWN
 | |
| 	COMING_UP
 | |
| 	UP
 | |
| 	GOING_DOWN
 | |
| 
 | |
| 	    +---------> UP ----------+
 | |
| 	    |                        v
 | |
| 
 | |
| 	COMING_UP                GOING_DOWN
 | |
| 
 | |
| 	    ^                        |
 | |
| 	    +--------- DOWN <--------+
 | |
| 
 | |
| 
 | |
| DOWN:	The CPU or cluster is not coherent, and is either powered off or
 | |
| 	suspended, or is ready to be powered off or suspended.
 | |
| 
 | |
| COMING_UP: The CPU or cluster has committed to moving to the UP state.
 | |
| 	It may be part way through the process of initialisation and
 | |
| 	enabling coherency.
 | |
| 
 | |
| UP:	The CPU or cluster is active and coherent at the hardware
 | |
| 	level.  A CPU in this state is not necessarily being used
 | |
| 	actively by the kernel.
 | |
| 
 | |
| GOING_DOWN: The CPU or cluster has committed to moving to the DOWN
 | |
| 	state.  It may be part way through the process of teardown and
 | |
| 	coherency exit.
 | |
| 
 | |
| 
 | |
| Each CPU has one of these states assigned to it at any point in time.
 | |
| The CPU states are described in the "CPU state" section, below.
 | |
| 
 | |
| Each cluster is also assigned a state, but it is necessary to split the
 | |
| state value into two parts (the "cluster" state and "inbound" state) and
 | |
| to introduce additional states in order to avoid races between different
 | |
| CPUs in the cluster simultaneously modifying the state.  The cluster-
 | |
| level states are described in the "Cluster state" section.
 | |
| 
 | |
| To help distinguish the CPU states from cluster states in this
 | |
| discussion, the state names are given a CPU_ prefix for the CPU states,
 | |
| and a CLUSTER_ or INBOUND_ prefix for the cluster states.
 | |
| 
 | |
| 
 | |
| CPU state
 | |
| ---------
 | |
| 
 | |
| In this algorithm, each individual core in a multi-core processor is
 | |
| referred to as a "CPU".  CPUs are assumed to be single-threaded:
 | |
| therefore, a CPU can only be doing one thing at a single point in time.
 | |
| 
 | |
| This means that CPUs fit the basic model closely.
 | |
| 
 | |
| The algorithm defines the following states for each CPU in the system:
 | |
| 
 | |
| 	CPU_DOWN
 | |
| 	CPU_COMING_UP
 | |
| 	CPU_UP
 | |
| 	CPU_GOING_DOWN
 | |
| 
 | |
| 	 cluster setup and
 | |
| 	CPU setup complete          policy decision
 | |
| 	      +-----------> CPU_UP ------------+
 | |
| 	      |                                v
 | |
| 
 | |
| 	CPU_COMING_UP                   CPU_GOING_DOWN
 | |
| 
 | |
| 	      ^                                |
 | |
| 	      +----------- CPU_DOWN <----------+
 | |
| 	 policy decision           CPU teardown complete
 | |
| 	or hardware event
 | |
| 
 | |
| 
 | |
| The definitions of the four states correspond closely to the states of
 | |
| the basic model.
 | |
| 
 | |
| Transitions between states occur as follows.
 | |
| 
 | |
| A trigger event (spontaneous) means that the CPU can transition to the
 | |
| next state as a result of making local progress only, with no
 | |
| requirement for any external event to happen.
 | |
| 
 | |
| 
 | |
| CPU_DOWN:
 | |
| 
 | |
| 	A CPU reaches the CPU_DOWN state when it is ready for
 | |
| 	power-down.  On reaching this state, the CPU will typically
 | |
| 	power itself down or suspend itself, via a WFI instruction or a
 | |
| 	firmware call.
 | |
| 
 | |
| 	Next state:	CPU_COMING_UP
 | |
| 	Conditions:	none
 | |
| 
 | |
| 	Trigger events:
 | |
| 
 | |
| 		a) an explicit hardware power-up operation, resulting
 | |
| 		   from a policy decision on another CPU;
 | |
| 
 | |
| 		b) a hardware event, such as an interrupt.
 | |
| 
 | |
| 
 | |
| CPU_COMING_UP:
 | |
| 
 | |
| 	A CPU cannot start participating in hardware coherency until the
 | |
| 	cluster is set up and coherent.  If the cluster is not ready,
 | |
| 	then the CPU will wait in the CPU_COMING_UP state until the
 | |
| 	cluster has been set up.
 | |
| 
 | |
| 	Next state:	CPU_UP
 | |
| 	Conditions:	The CPU's parent cluster must be in CLUSTER_UP.
 | |
| 	Trigger events:	Transition of the parent cluster to CLUSTER_UP.
 | |
| 
 | |
| 	Refer to the "Cluster state" section for a description of the
 | |
| 	CLUSTER_UP state.
 | |
| 
 | |
| 
 | |
| CPU_UP:
 | |
| 	When a CPU reaches the CPU_UP state, it is safe for the CPU to
 | |
| 	start participating in local coherency.
 | |
| 
 | |
| 	This is done by jumping to the kernel's CPU resume code.
 | |
| 
 | |
| 	Note that the definition of this state is slightly different
 | |
| 	from the basic model definition: CPU_UP does not mean that the
 | |
| 	CPU is coherent yet, but it does mean that it is safe to resume
 | |
| 	the kernel.  The kernel handles the rest of the resume
 | |
| 	procedure, so the remaining steps are not visible as part of the
 | |
| 	race avoidance algorithm.
 | |
| 
 | |
| 	The CPU remains in this state until an explicit policy decision
 | |
| 	is made to shut down or suspend the CPU.
 | |
| 
 | |
| 	Next state:	CPU_GOING_DOWN
 | |
| 	Conditions:	none
 | |
| 	Trigger events:	explicit policy decision
 | |
| 
 | |
| 
 | |
| CPU_GOING_DOWN:
 | |
| 
 | |
| 	While in this state, the CPU exits coherency, including any
 | |
| 	operations required to achieve this (such as cleaning data
 | |
| 	caches).
 | |
| 
 | |
| 	Next state:	CPU_DOWN
 | |
| 	Conditions:	local CPU teardown complete
 | |
| 	Trigger events:	(spontaneous)
 | |
| 
 | |
| 
 | |
| Cluster state
 | |
| -------------
 | |
| 
 | |
| A cluster is a group of connected CPUs with some common resources.
 | |
| Because a cluster contains multiple CPUs, it can be doing multiple
 | |
| things at the same time.  This has some implications.  In particular, a
 | |
| CPU can start up while another CPU is tearing the cluster down.
 | |
| 
 | |
| In this discussion, the "outbound side" is the view of the cluster state
 | |
| as seen by a CPU tearing the cluster down.  The "inbound side" is the
 | |
| view of the cluster state as seen by a CPU setting the CPU up.
 | |
| 
 | |
| In order to enable safe coordination in such situations, it is important
 | |
| that a CPU which is setting up the cluster can advertise its state
 | |
| independently of the CPU which is tearing down the cluster.  For this
 | |
| reason, the cluster state is split into two parts:
 | |
| 
 | |
| 	"cluster" state: The global state of the cluster; or the state
 | |
| 		on the outbound side:
 | |
| 
 | |
| 		CLUSTER_DOWN
 | |
| 		CLUSTER_UP
 | |
| 		CLUSTER_GOING_DOWN
 | |
| 
 | |
| 	"inbound" state: The state of the cluster on the inbound side.
 | |
| 
 | |
| 		INBOUND_NOT_COMING_UP
 | |
| 		INBOUND_COMING_UP
 | |
| 
 | |
| 
 | |
| 	The different pairings of these states results in six possible
 | |
| 	states for the cluster as a whole:
 | |
| 
 | |
| 	                            CLUSTER_UP
 | |
| 	          +==========> INBOUND_NOT_COMING_UP -------------+
 | |
| 	          #                                               |
 | |
| 	                                                          |
 | |
| 	     CLUSTER_UP     <----+                                |
 | |
| 	  INBOUND_COMING_UP      |                                v
 | |
| 
 | |
| 	          ^             CLUSTER_GOING_DOWN       CLUSTER_GOING_DOWN
 | |
| 	          #              INBOUND_COMING_UP <=== INBOUND_NOT_COMING_UP
 | |
| 
 | |
| 	    CLUSTER_DOWN         |                                |
 | |
| 	  INBOUND_COMING_UP <----+                                |
 | |
| 	                                                          |
 | |
| 	          ^                                               |
 | |
| 	          +===========     CLUSTER_DOWN      <------------+
 | |
| 	                       INBOUND_NOT_COMING_UP
 | |
| 
 | |
| 	Transitions -----> can only be made by the outbound CPU, and
 | |
| 	only involve changes to the "cluster" state.
 | |
| 
 | |
| 	Transitions ===##> can only be made by the inbound CPU, and only
 | |
| 	involve changes to the "inbound" state, except where there is no
 | |
| 	further transition possible on the outbound side (i.e., the
 | |
| 	outbound CPU has put the cluster into the CLUSTER_DOWN state).
 | |
| 
 | |
| 	The race avoidance algorithm does not provide a way to determine
 | |
| 	which exact CPUs within the cluster play these roles.  This must
 | |
| 	be decided in advance by some other means.  Refer to the section
 | |
| 	"Last man and first man selection" for more explanation.
 | |
| 
 | |
| 
 | |
| 	CLUSTER_DOWN/INBOUND_NOT_COMING_UP is the only state where the
 | |
| 	cluster can actually be powered down.
 | |
| 
 | |
| 	The parallelism of the inbound and outbound CPUs is observed by
 | |
| 	the existence of two different paths from CLUSTER_GOING_DOWN/
 | |
| 	INBOUND_NOT_COMING_UP (corresponding to GOING_DOWN in the basic
 | |
| 	model) to CLUSTER_DOWN/INBOUND_COMING_UP (corresponding to
 | |
| 	COMING_UP in the basic model).  The second path avoids cluster
 | |
| 	teardown completely.
 | |
| 
 | |
| 	CLUSTER_UP/INBOUND_COMING_UP is equivalent to UP in the basic
 | |
| 	model.  The final transition to CLUSTER_UP/INBOUND_NOT_COMING_UP
 | |
| 	is trivial and merely resets the state machine ready for the
 | |
| 	next cycle.
 | |
| 
 | |
| 	Details of the allowable transitions follow.
 | |
| 
 | |
| 	The next state in each case is notated
 | |
| 
 | |
| 		<cluster state>/<inbound state> (<transitioner>)
 | |
| 
 | |
| 	where the <transitioner> is the side on which the transition
 | |
| 	can occur; either the inbound or the outbound side.
 | |
| 
 | |
| 
 | |
| CLUSTER_DOWN/INBOUND_NOT_COMING_UP:
 | |
| 
 | |
| 	Next state:	CLUSTER_DOWN/INBOUND_COMING_UP (inbound)
 | |
| 	Conditions:	none
 | |
| 	Trigger events:
 | |
| 
 | |
| 		a) an explicit hardware power-up operation, resulting
 | |
| 		   from a policy decision on another CPU;
 | |
| 
 | |
| 		b) a hardware event, such as an interrupt.
 | |
| 
 | |
| 
 | |
| CLUSTER_DOWN/INBOUND_COMING_UP:
 | |
| 
 | |
| 	In this state, an inbound CPU sets up the cluster, including
 | |
| 	enabling of hardware coherency at the cluster level and any
 | |
| 	other operations (such as cache invalidation) which are required
 | |
| 	in order to achieve this.
 | |
| 
 | |
| 	The purpose of this state is to do sufficient cluster-level
 | |
| 	setup to enable other CPUs in the cluster to enter coherency
 | |
| 	safely.
 | |
| 
 | |
| 	Next state:	CLUSTER_UP/INBOUND_COMING_UP (inbound)
 | |
| 	Conditions:	cluster-level setup and hardware coherency complete
 | |
| 	Trigger events:	(spontaneous)
 | |
| 
 | |
| 
 | |
| CLUSTER_UP/INBOUND_COMING_UP:
 | |
| 
 | |
| 	Cluster-level setup is complete and hardware coherency is
 | |
| 	enabled for the cluster.  Other CPUs in the cluster can safely
 | |
| 	enter coherency.
 | |
| 
 | |
| 	This is a transient state, leading immediately to
 | |
| 	CLUSTER_UP/INBOUND_NOT_COMING_UP.  All other CPUs on the cluster
 | |
| 	should consider treat these two states as equivalent.
 | |
| 
 | |
| 	Next state:	CLUSTER_UP/INBOUND_NOT_COMING_UP (inbound)
 | |
| 	Conditions:	none
 | |
| 	Trigger events:	(spontaneous)
 | |
| 
 | |
| 
 | |
| CLUSTER_UP/INBOUND_NOT_COMING_UP:
 | |
| 
 | |
| 	Cluster-level setup is complete and hardware coherency is
 | |
| 	enabled for the cluster.  Other CPUs in the cluster can safely
 | |
| 	enter coherency.
 | |
| 
 | |
| 	The cluster will remain in this state until a policy decision is
 | |
| 	made to power the cluster down.
 | |
| 
 | |
| 	Next state:	CLUSTER_GOING_DOWN/INBOUND_NOT_COMING_UP (outbound)
 | |
| 	Conditions:	none
 | |
| 	Trigger events:	policy decision to power down the cluster
 | |
| 
 | |
| 
 | |
| CLUSTER_GOING_DOWN/INBOUND_NOT_COMING_UP:
 | |
| 
 | |
| 	An outbound CPU is tearing the cluster down.  The selected CPU
 | |
| 	must wait in this state until all CPUs in the cluster are in the
 | |
| 	CPU_DOWN state.
 | |
| 
 | |
| 	When all CPUs are in the CPU_DOWN state, the cluster can be torn
 | |
| 	down, for example by cleaning data caches and exiting
 | |
| 	cluster-level coherency.
 | |
| 
 | |
| 	To avoid wasteful unnecessary teardown operations, the outbound
 | |
| 	should check the inbound cluster state for asynchronous
 | |
| 	transitions to INBOUND_COMING_UP.  Alternatively, individual
 | |
| 	CPUs can be checked for entry into CPU_COMING_UP or CPU_UP.
 | |
| 
 | |
| 
 | |
| 	Next states:
 | |
| 
 | |
| 	CLUSTER_DOWN/INBOUND_NOT_COMING_UP (outbound)
 | |
| 		Conditions:	cluster torn down and ready to power off
 | |
| 		Trigger events:	(spontaneous)
 | |
| 
 | |
| 	CLUSTER_GOING_DOWN/INBOUND_COMING_UP (inbound)
 | |
| 		Conditions:	none
 | |
| 		Trigger events:
 | |
| 
 | |
| 			a) an explicit hardware power-up operation,
 | |
| 			   resulting from a policy decision on another
 | |
| 			   CPU;
 | |
| 
 | |
| 			b) a hardware event, such as an interrupt.
 | |
| 
 | |
| 
 | |
| CLUSTER_GOING_DOWN/INBOUND_COMING_UP:
 | |
| 
 | |
| 	The cluster is (or was) being torn down, but another CPU has
 | |
| 	come online in the meantime and is trying to set up the cluster
 | |
| 	again.
 | |
| 
 | |
| 	If the outbound CPU observes this state, it has two choices:
 | |
| 
 | |
| 		a) back out of teardown, restoring the cluster to the
 | |
| 		   CLUSTER_UP state;
 | |
| 
 | |
| 		b) finish tearing the cluster down and put the cluster
 | |
| 		   in the CLUSTER_DOWN state; the inbound CPU will
 | |
| 		   set up the cluster again from there.
 | |
| 
 | |
| 	Choice (a) permits the removal of some latency by avoiding
 | |
| 	unnecessary teardown and setup operations in situations where
 | |
| 	the cluster is not really going to be powered down.
 | |
| 
 | |
| 
 | |
| 	Next states:
 | |
| 
 | |
| 	CLUSTER_UP/INBOUND_COMING_UP (outbound)
 | |
| 		Conditions:	cluster-level setup and hardware
 | |
| 				coherency complete
 | |
| 		Trigger events:	(spontaneous)
 | |
| 
 | |
| 	CLUSTER_DOWN/INBOUND_COMING_UP (outbound)
 | |
| 		Conditions:	cluster torn down and ready to power off
 | |
| 		Trigger events:	(spontaneous)
 | |
| 
 | |
| 
 | |
| Last man and First man selection
 | |
| --------------------------------
 | |
| 
 | |
| The CPU which performs cluster tear-down operations on the outbound side
 | |
| is commonly referred to as the "last man".
 | |
| 
 | |
| The CPU which performs cluster setup on the inbound side is commonly
 | |
| referred to as the "first man".
 | |
| 
 | |
| The race avoidance algorithm documented above does not provide a
 | |
| mechanism to choose which CPUs should play these roles.
 | |
| 
 | |
| 
 | |
| Last man:
 | |
| 
 | |
| When shutting down the cluster, all the CPUs involved are initially
 | |
| executing Linux and hence coherent.  Therefore, ordinary spinlocks can
 | |
| be used to select a last man safely, before the CPUs become
 | |
| non-coherent.
 | |
| 
 | |
| 
 | |
| First man:
 | |
| 
 | |
| Because CPUs may power up asynchronously in response to external wake-up
 | |
| events, a dynamic mechanism is needed to make sure that only one CPU
 | |
| attempts to play the first man role and do the cluster-level
 | |
| initialisation: any other CPUs must wait for this to complete before
 | |
| proceeding.
 | |
| 
 | |
| Cluster-level initialisation may involve actions such as configuring
 | |
| coherency controls in the bus fabric.
 | |
| 
 | |
| The current implementation in mcpm_head.S uses a separate mutual exclusion
 | |
| mechanism to do this arbitration.  This mechanism is documented in
 | |
| detail in vlocks.txt.
 | |
| 
 | |
| 
 | |
| Features and Limitations
 | |
| ------------------------
 | |
| 
 | |
| Implementation:
 | |
| 
 | |
| 	The current ARM-based implementation is split between
 | |
| 	arch/arm/common/mcpm_head.S (low-level inbound CPU operations) and
 | |
| 	arch/arm/common/mcpm_entry.c (everything else):
 | |
| 
 | |
| 	__mcpm_cpu_going_down() signals the transition of a CPU to the
 | |
| 		CPU_GOING_DOWN state.
 | |
| 
 | |
| 	__mcpm_cpu_down() signals the transition of a CPU to the CPU_DOWN
 | |
| 		state.
 | |
| 
 | |
| 	A CPU transitions to CPU_COMING_UP and then to CPU_UP via the
 | |
| 		low-level power-up code in mcpm_head.S.  This could
 | |
| 		involve CPU-specific setup code, but in the current
 | |
| 		implementation it does not.
 | |
| 
 | |
| 	__mcpm_outbound_enter_critical() and __mcpm_outbound_leave_critical()
 | |
| 		handle transitions from CLUSTER_UP to CLUSTER_GOING_DOWN
 | |
| 		and from there to CLUSTER_DOWN or back to CLUSTER_UP (in
 | |
| 		the case of an aborted cluster power-down).
 | |
| 
 | |
| 		These functions are more complex than the __mcpm_cpu_*()
 | |
| 		functions due to the extra inter-CPU coordination which
 | |
| 		is needed for safe transitions at the cluster level.
 | |
| 
 | |
| 	A cluster transitions from CLUSTER_DOWN back to CLUSTER_UP via
 | |
| 		the low-level power-up code in mcpm_head.S.  This
 | |
| 		typically involves platform-specific setup code,
 | |
| 		provided by the platform-specific power_up_setup
 | |
| 		function registered via mcpm_sync_init.
 | |
| 
 | |
| Deep topologies:
 | |
| 
 | |
| 	As currently described and implemented, the algorithm does not
 | |
| 	support CPU topologies involving more than two levels (i.e.,
 | |
| 	clusters of clusters are not supported).  The algorithm could be
 | |
| 	extended by replicating the cluster-level states for the
 | |
| 	additional topological levels, and modifying the transition
 | |
| 	rules for the intermediate (non-outermost) cluster levels.
 | |
| 
 | |
| 
 | |
| Colophon
 | |
| --------
 | |
| 
 | |
| Originally created and documented by Dave Martin for Linaro Limited, in
 | |
| collaboration with Nicolas Pitre and Achin Gupta.
 | |
| 
 | |
| Copyright (C) 2012-2013  Linaro Limited
 | |
| Distributed under the terms of Version 2 of the GNU General Public
 | |
| License, as defined in linux/COPYING.
 | 
