395 lines
		
	
	
		
			17 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
			
		
		
	
	
			395 lines
		
	
	
		
			17 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
| Ethernet switch device driver model (switchdev)
 | |
| ===============================================
 | |
| Copyright (c) 2014 Jiri Pirko <jiri@resnulli.us>
 | |
| Copyright (c) 2014-2015 Scott Feldman <sfeldma@gmail.com>
 | |
| 
 | |
| 
 | |
| The Ethernet switch device driver model (switchdev) is an in-kernel driver
 | |
| model for switch devices which offload the forwarding (data) plane from the
 | |
| kernel.
 | |
| 
 | |
| Figure 1 is a block diagram showing the components of the switchdev model for
 | |
| an example setup using a data-center-class switch ASIC chip.  Other setups
 | |
| with SR-IOV or soft switches, such as OVS, are possible.
 | |
| 
 | |
| 
 | |
|                              User-space tools
 | |
| 
 | |
|        user space                   |
 | |
|       +-------------------------------------------------------------------+
 | |
|        kernel                       | Netlink
 | |
|                                     |
 | |
|                      +--------------+-------------------------------+
 | |
|                      |         Network stack                        |
 | |
|                      |           (Linux)                            |
 | |
|                      |                                              |
 | |
|                      +----------------------------------------------+
 | |
| 
 | |
|                            sw1p2     sw1p4     sw1p6
 | |
|                       sw1p1  +  sw1p3  +  sw1p5  +          eth1
 | |
|                         +    |    +    |    +    |            +
 | |
|                         |    |    |    |    |    |            |
 | |
|                      +--+----+----+----+----+----+---+  +-----+-----+
 | |
|                      |         Switch driver         |  |    mgmt   |
 | |
|                      |        (this document)        |  |   driver  |
 | |
|                      |                               |  |           |
 | |
|                      +--------------+----------------+  +-----------+
 | |
|                                     |
 | |
|        kernel                       | HW bus (eg PCI)
 | |
|       +-------------------------------------------------------------------+
 | |
|        hardware                     |
 | |
|                      +--------------+----------------+
 | |
|                      |         Switch device (sw1)   |
 | |
|                      |  +----+                       +--------+
 | |
|                      |  |    v offloaded data path   | mgmt port
 | |
|                      |  |    |                       |
 | |
|                      +--|----|----+----+----+----+---+
 | |
|                         |    |    |    |    |    |
 | |
|                         +    +    +    +    +    +
 | |
|                        p1   p2   p3   p4   p5   p6
 | |
| 
 | |
|                              front-panel ports
 | |
| 
 | |
| 
 | |
|                                     Fig 1.
 | |
| 
 | |
| 
 | |
| Include Files
 | |
| -------------
 | |
| 
 | |
| #include <linux/netdevice.h>
 | |
| #include <net/switchdev.h>
 | |
| 
 | |
| 
 | |
| Configuration
 | |
| -------------
 | |
| 
 | |
| Use "depends NET_SWITCHDEV" in driver's Kconfig to ensure switchdev model
 | |
| support is built for driver.
 | |
| 
 | |
| 
 | |
| Switch Ports
 | |
| ------------
 | |
| 
 | |
| On switchdev driver initialization, the driver will allocate and register a
 | |
| struct net_device (using register_netdev()) for each enumerated physical switch
 | |
| port, called the port netdev.  A port netdev is the software representation of
 | |
| the physical port and provides a conduit for control traffic to/from the
 | |
| controller (the kernel) and the network, as well as an anchor point for higher
 | |
| level constructs such as bridges, bonds, VLANs, tunnels, and L3 routers.  Using
 | |
| standard netdev tools (iproute2, ethtool, etc), the port netdev can also
 | |
| provide to the user access to the physical properties of the switch port such
 | |
| as PHY link state and I/O statistics.
 | |
| 
 | |
| There is (currently) no higher-level kernel object for the switch beyond the
 | |
| port netdevs.  All of the switchdev driver ops are netdev ops or switchdev ops.
 | |
| 
 | |
| A switch management port is outside the scope of the switchdev driver model.
 | |
| Typically, the management port is not participating in offloaded data plane and
 | |
| is loaded with a different driver, such as a NIC driver, on the management port
 | |
| device.
 | |
| 
 | |
| Switch ID
 | |
| ^^^^^^^^^
 | |
| 
 | |
| The switchdev driver must implement the switchdev op switchdev_port_attr_get
 | |
| for SWITCHDEV_ATTR_ID_PORT_PARENT_ID for each port netdev, returning the same
 | |
| physical ID for each port of a switch.  The ID must be unique between switches
 | |
| on the same system.  The ID does not need to be unique between switches on
 | |
| different systems.
 | |
| 
 | |
| The switch ID is used to locate ports on a switch and to know if aggregated
 | |
| ports belong to the same switch.
 | |
| 
 | |
| Port Netdev Naming
 | |
| ^^^^^^^^^^^^^^^^^^
 | |
| 
 | |
| Udev rules should be used for port netdev naming, using some unique attribute
 | |
| of the port as a key, for example the port MAC address or the port PHYS name.
 | |
| Hard-coding of kernel netdev names within the driver is discouraged; let the
 | |
| kernel pick the default netdev name, and let udev set the final name based on a
 | |
| port attribute.
 | |
| 
 | |
| Using port PHYS name (ndo_get_phys_port_name) for the key is particularly
 | |
| useful for dynamically-named ports where the device names its ports based on
 | |
| external configuration.  For example, if a physical 40G port is split logically
 | |
| into 4 10G ports, resulting in 4 port netdevs, the device can give a unique
 | |
| name for each port using port PHYS name.  The udev rule would be:
 | |
| 
 | |
| SUBSYSTEM=="net", ACTION=="add", ATTR{phys_switch_id}=="<phys_switch_id>", \
 | |
| 	ATTR{phys_port_name}!="", NAME="swX$attr{phys_port_name}"
 | |
| 
 | |
| Suggested naming convention is "swXpYsZ", where X is the switch name or ID, Y
 | |
| is the port name or ID, and Z is the sub-port name or ID.  For example, sw1p1s0
 | |
| would be sub-port 0 on port 1 on switch 1.
 | |
| 
 | |
| Port Features
 | |
| ^^^^^^^^^^^^^
 | |
| 
 | |
| NETIF_F_NETNS_LOCAL
 | |
| 
 | |
| If the switchdev driver (and device) only supports offloading of the default
 | |
| network namespace (netns), the driver should set this feature flag to prevent
 | |
| the port netdev from being moved out of the default netns.  A netns-aware
 | |
| driver/device would not set this flag and be responsible for partitioning
 | |
| hardware to preserve netns containment.  This means hardware cannot forward
 | |
| traffic from a port in one namespace to another port in another namespace.
 | |
| 
 | |
| Port Topology
 | |
| ^^^^^^^^^^^^^
 | |
| 
 | |
| The port netdevs representing the physical switch ports can be organized into
 | |
| higher-level switching constructs.  The default construct is a standalone
 | |
| router port, used to offload L3 forwarding.  Two or more ports can be bonded
 | |
| together to form a LAG.  Two or more ports (or LAGs) can be bridged to bridge
 | |
| L2 networks.  VLANs can be applied to sub-divide L2 networks.  L2-over-L3
 | |
| tunnels can be built on ports.  These constructs are built using standard Linux
 | |
| tools such as the bridge driver, the bonding/team drivers, and netlink-based
 | |
| tools such as iproute2.
 | |
| 
 | |
| The switchdev driver can know a particular port's position in the topology by
 | |
| monitoring NETDEV_CHANGEUPPER notifications.  For example, a port moved into a
 | |
| bond will see it's upper master change.  If that bond is moved into a bridge,
 | |
| the bond's upper master will change.  And so on.  The driver will track such
 | |
| movements to know what position a port is in in the overall topology by
 | |
| registering for netdevice events and acting on NETDEV_CHANGEUPPER.
 | |
| 
 | |
| L2 Forwarding Offload
 | |
| ---------------------
 | |
| 
 | |
| The idea is to offload the L2 data forwarding (switching) path from the kernel
 | |
| to the switchdev device by mirroring bridge FDB entries down to the device.  An
 | |
| FDB entry is the {port, MAC, VLAN} tuple forwarding destination.
 | |
| 
 | |
| To offloading L2 bridging, the switchdev driver/device should support:
 | |
| 
 | |
| 	- Static FDB entries installed on a bridge port
 | |
| 	- Notification of learned/forgotten src mac/vlans from device
 | |
| 	- STP state changes on the port
 | |
| 	- VLAN flooding of multicast/broadcast and unknown unicast packets
 | |
| 
 | |
| Static FDB Entries
 | |
| ^^^^^^^^^^^^^^^^^^
 | |
| 
 | |
| The switchdev driver should implement ndo_fdb_add, ndo_fdb_del and ndo_fdb_dump
 | |
| to support static FDB entries installed to the device.  Static bridge FDB
 | |
| entries are installed, for example, using iproute2 bridge cmd:
 | |
| 
 | |
| 	bridge fdb add ADDR dev DEV [vlan VID] [self]
 | |
| 
 | |
| The driver should use the helper switchdev_port_fdb_xxx ops for ndo_fdb_xxx
 | |
| ops, and handle add/delete/dump of SWITCHDEV_OBJ_ID_PORT_FDB object using
 | |
| switchdev_port_obj_xxx ops.
 | |
| 
 | |
| XXX: what should be done if offloading this rule to hardware fails (for
 | |
| example, due to full capacity in hardware tables) ?
 | |
| 
 | |
| Note: by default, the bridge does not filter on VLAN and only bridges untagged
 | |
| traffic.  To enable VLAN support, turn on VLAN filtering:
 | |
| 
 | |
| 	echo 1 >/sys/class/net/<bridge>/bridge/vlan_filtering
 | |
| 
 | |
| Notification of Learned/Forgotten Source MAC/VLANs
 | |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 | |
| 
 | |
| The switch device will learn/forget source MAC address/VLAN on ingress packets
 | |
| and notify the switch driver of the mac/vlan/port tuples.  The switch driver,
 | |
| in turn, will notify the bridge driver using the switchdev notifier call:
 | |
| 
 | |
| 	err = call_switchdev_notifiers(val, dev, info);
 | |
| 
 | |
| Where val is SWITCHDEV_FDB_ADD when learning and SWITCHDEV_FDB_DEL when
 | |
| forgetting, and info points to a struct switchdev_notifier_fdb_info.  On
 | |
| SWITCHDEV_FDB_ADD, the bridge driver will install the FDB entry into the
 | |
| bridge's FDB and mark the entry as NTF_EXT_LEARNED.  The iproute2 bridge
 | |
| command will label these entries "offload":
 | |
| 
 | |
| 	$ bridge fdb
 | |
| 	52:54:00:12:35:01 dev sw1p1 master br0 permanent
 | |
| 	00:02:00:00:02:00 dev sw1p1 master br0 offload
 | |
| 	00:02:00:00:02:00 dev sw1p1 self
 | |
| 	52:54:00:12:35:02 dev sw1p2 master br0 permanent
 | |
| 	00:02:00:00:03:00 dev sw1p2 master br0 offload
 | |
| 	00:02:00:00:03:00 dev sw1p2 self
 | |
| 	33:33:00:00:00:01 dev eth0 self permanent
 | |
| 	01:00:5e:00:00:01 dev eth0 self permanent
 | |
| 	33:33:ff:00:00:00 dev eth0 self permanent
 | |
| 	01:80:c2:00:00:0e dev eth0 self permanent
 | |
| 	33:33:00:00:00:01 dev br0 self permanent
 | |
| 	01:00:5e:00:00:01 dev br0 self permanent
 | |
| 	33:33:ff:12:35:01 dev br0 self permanent
 | |
| 
 | |
| Learning on the port should be disabled on the bridge using the bridge command:
 | |
| 
 | |
| 	bridge link set dev DEV learning off
 | |
| 
 | |
| Learning on the device port should be enabled, as well as learning_sync:
 | |
| 
 | |
| 	bridge link set dev DEV learning on self
 | |
| 	bridge link set dev DEV learning_sync on self
 | |
| 
 | |
| Learning_sync attribute enables syncing of the learned/forgotten FDB entry to
 | |
| the bridge's FDB.  It's possible, but not optimal, to enable learning on the
 | |
| device port and on the bridge port, and disable learning_sync.
 | |
| 
 | |
| To support learning and learning_sync port attributes, the driver implements
 | |
| switchdev op switchdev_port_attr_get/set for
 | |
| SWITCHDEV_ATTR_PORT_ID_BRIDGE_FLAGS. The driver should initialize the attributes
 | |
| to the hardware defaults.
 | |
| 
 | |
| FDB Ageing
 | |
| ^^^^^^^^^^
 | |
| 
 | |
| The bridge will skip ageing FDB entries marked with NTF_EXT_LEARNED and it is
 | |
| the responsibility of the port driver/device to age out these entries.  If the
 | |
| port device supports ageing, when the FDB entry expires, it will notify the
 | |
| driver which in turn will notify the bridge with SWITCHDEV_FDB_DEL.  If the
 | |
| device does not support ageing, the driver can simulate ageing using a
 | |
| garbage collection timer to monitor FDB entries.  Expired entries will be
 | |
| notified to the bridge using SWITCHDEV_FDB_DEL.  See rocker driver for
 | |
| example of driver running ageing timer.
 | |
| 
 | |
| To keep an NTF_EXT_LEARNED entry "alive", the driver should refresh the FDB
 | |
| entry by calling call_switchdev_notifiers(SWITCHDEV_FDB_ADD, ...).  The
 | |
| notification will reset the FDB entry's last-used time to now.  The driver
 | |
| should rate limit refresh notifications, for example, no more than once a
 | |
| second.  (The last-used time is visible using the bridge -s fdb option).
 | |
| 
 | |
| STP State Change on Port
 | |
| ^^^^^^^^^^^^^^^^^^^^^^^^
 | |
| 
 | |
| Internally or with a third-party STP protocol implementation (e.g. mstpd), the
 | |
| bridge driver maintains the STP state for ports, and will notify the switch
 | |
| driver of STP state change on a port using the switchdev op
 | |
| switchdev_attr_port_set for SWITCHDEV_ATTR_PORT_ID_STP_UPDATE.
 | |
| 
 | |
| State is one of BR_STATE_*.  The switch driver can use STP state updates to
 | |
| update ingress packet filter list for the port.  For example, if port is
 | |
| DISABLED, no packets should pass, but if port moves to BLOCKED, then STP BPDUs
 | |
| and other IEEE 01:80:c2:xx:xx:xx link-local multicast packets can pass.
 | |
| 
 | |
| Note that STP BDPUs are untagged and STP state applies to all VLANs on the port
 | |
| so packet filters should be applied consistently across untagged and tagged
 | |
| VLANs on the port.
 | |
| 
 | |
| Flooding L2 domain
 | |
| ^^^^^^^^^^^^^^^^^^
 | |
| 
 | |
| For a given L2 VLAN domain, the switch device should flood multicast/broadcast
 | |
| and unknown unicast packets to all ports in domain, if allowed by port's
 | |
| current STP state.  The switch driver, knowing which ports are within which
 | |
| vlan L2 domain, can program the switch device for flooding.  The packet may
 | |
| be sent to the port netdev for processing by the bridge driver.  The
 | |
| bridge should not reflood the packet to the same ports the device flooded,
 | |
| otherwise there will be duplicate packets on the wire.
 | |
| 
 | |
| To avoid duplicate packets, the switch driver should mark a packet as already
 | |
| forwarded by setting the skb->offload_fwd_mark bit. The bridge driver will mark
 | |
| the skb using the ingress bridge port's mark and prevent it from being forwarded
 | |
| through any bridge port with the same mark.
 | |
| 
 | |
| It is possible for the switch device to not handle flooding and push the
 | |
| packets up to the bridge driver for flooding.  This is not ideal as the number
 | |
| of ports scale in the L2 domain as the device is much more efficient at
 | |
| flooding packets that software.
 | |
| 
 | |
| If supported by the device, flood control can be offloaded to it, preventing
 | |
| certain netdevs from flooding unicast traffic for which there is no FDB entry.
 | |
| 
 | |
| IGMP Snooping
 | |
| ^^^^^^^^^^^^^
 | |
| 
 | |
| In order to support IGMP snooping, the port netdevs should trap to the bridge
 | |
| driver all IGMP join and leave messages.
 | |
| The bridge multicast module will notify port netdevs on every multicast group
 | |
| changed whether it is static configured or dynamically joined/leave.
 | |
| The hardware implementation should be forwarding all registered multicast
 | |
| traffic groups only to the configured ports.
 | |
| 
 | |
| L3 Routing Offload
 | |
| ------------------
 | |
| 
 | |
| Offloading L3 routing requires that device be programmed with FIB entries from
 | |
| the kernel, with the device doing the FIB lookup and forwarding.  The device
 | |
| does a longest prefix match (LPM) on FIB entries matching route prefix and
 | |
| forwards the packet to the matching FIB entry's nexthop(s) egress ports.
 | |
| 
 | |
| To program the device, the driver has to register a FIB notifier handler
 | |
| using register_fib_notifier. The following events are available:
 | |
| FIB_EVENT_ENTRY_ADD: used for both adding a new FIB entry to the device,
 | |
|                      or modifying an existing entry on the device.
 | |
| FIB_EVENT_ENTRY_DEL: used for removing a FIB entry
 | |
| FIB_EVENT_RULE_ADD, FIB_EVENT_RULE_DEL: used to propagate FIB rule changes
 | |
| 
 | |
| FIB_EVENT_ENTRY_ADD and FIB_EVENT_ENTRY_DEL events pass:
 | |
| 
 | |
| 	struct fib_entry_notifier_info {
 | |
| 		struct fib_notifier_info info; /* must be first */
 | |
| 		u32 dst;
 | |
| 		int dst_len;
 | |
| 		struct fib_info *fi;
 | |
| 		u8 tos;
 | |
| 		u8 type;
 | |
| 		u32 tb_id;
 | |
| 		u32 nlflags;
 | |
| 	};
 | |
| 
 | |
| to add/modify/delete IPv4 dst/dest_len prefix on table tb_id.  The *fi
 | |
| structure holds details on the route and route's nexthops.  *dev is one of the
 | |
| port netdevs mentioned in the route's next hop list.
 | |
| 
 | |
| Routes offloaded to the device are labeled with "offload" in the ip route
 | |
| listing:
 | |
| 
 | |
| 	$ ip route show
 | |
| 	default via 192.168.0.2 dev eth0
 | |
| 	11.0.0.0/30 dev sw1p1  proto kernel  scope link  src 11.0.0.2 offload
 | |
| 	11.0.0.4/30 via 11.0.0.1 dev sw1p1  proto zebra  metric 20 offload
 | |
| 	11.0.0.8/30 dev sw1p2  proto kernel  scope link  src 11.0.0.10 offload
 | |
| 	11.0.0.12/30 via 11.0.0.9 dev sw1p2  proto zebra  metric 20 offload
 | |
| 	12.0.0.2  proto zebra  metric 30 offload
 | |
| 		nexthop via 11.0.0.1  dev sw1p1 weight 1
 | |
| 		nexthop via 11.0.0.9  dev sw1p2 weight 1
 | |
| 	12.0.0.3 via 11.0.0.1 dev sw1p1  proto zebra  metric 20 offload
 | |
| 	12.0.0.4 via 11.0.0.9 dev sw1p2  proto zebra  metric 20 offload
 | |
| 	192.168.0.0/24 dev eth0  proto kernel  scope link  src 192.168.0.15
 | |
| 
 | |
| The "offload" flag is set in case at least one device offloads the FIB entry.
 | |
| 
 | |
| XXX: add/mod/del IPv6 FIB API
 | |
| 
 | |
| Nexthop Resolution
 | |
| ^^^^^^^^^^^^^^^^^^
 | |
| 
 | |
| The FIB entry's nexthop list contains the nexthop tuple (gateway, dev), but for
 | |
| the switch device to forward the packet with the correct dst mac address, the
 | |
| nexthop gateways must be resolved to the neighbor's mac address.  Neighbor mac
 | |
| address discovery comes via the ARP (or ND) process and is available via the
 | |
| arp_tbl neighbor table.  To resolve the routes nexthop gateways, the driver
 | |
| should trigger the kernel's neighbor resolution process.  See the rocker
 | |
| driver's rocker_port_ipv4_resolve() for an example.
 | |
| 
 | |
| The driver can monitor for updates to arp_tbl using the netevent notifier
 | |
| NETEVENT_NEIGH_UPDATE.  The device can be programmed with resolved nexthops
 | |
| for the routes as arp_tbl updates.  The driver implements ndo_neigh_destroy
 | |
| to know when arp_tbl neighbor entries are purged from the port.
 | |
| 
 | |
| Transaction item queue
 | |
| ^^^^^^^^^^^^^^^^^^^^^^
 | |
| 
 | |
| For switchdev ops attr_set and obj_add, there is a 2 phase transaction model
 | |
| used. First phase is to "prepare" anything needed, including various checks,
 | |
| memory allocation, etc. The goal is to handle the stuff that is not unlikely
 | |
| to fail here. The second phase is to "commit" the actual changes.
 | |
| 
 | |
| Switchdev provides an infrastructure for sharing items (for example memory
 | |
| allocations) between the two phases.
 | |
| 
 | |
| The object created by a driver in "prepare" phase and it is queued up by:
 | |
| switchdev_trans_item_enqueue()
 | |
| During the "commit" phase, the driver gets the object by:
 | |
| switchdev_trans_item_dequeue()
 | |
| 
 | |
| If a transaction is aborted during "prepare" phase, switchdev code will handle
 | |
| cleanup of the queued-up objects.
 | 
