From 0a9627f2649a02bea165cfd529d7bcb625c2fcad Mon Sep 17 00:00:00 2001 From: Tom Herbert Date: Tue, 16 Mar 2010 08:03:29 +0000 Subject: rps: Receive Packet Steering This patch implements software receive side packet steering (RPS). RPS distributes the load of received packet processing across multiple CPUs. Problem statement: Protocol processing done in the NAPI context for received packets is serialized per device queue and becomes a bottleneck under high packet load. This substantially limits pps that can be achieved on a single queue NIC and provides no scaling with multiple cores. This solution queues packets early on in the receive path on the backlog queues of other CPUs. This allows protocol processing (e.g. IP and TCP) to be performed on packets in parallel. For each device (or each receive queue in a multi-queue device) a mask of CPUs is set to indicate the CPUs that can process packets. A CPU is selected on a per packet basis by hashing contents of the packet header (e.g. the TCP or UDP 4-tuple) and using the result to index into the CPU mask. The IPI mechanism is used to raise networking receive softirqs between CPUs. This effectively emulates in software what a multi-queue NIC can provide, but is generic requiring no device support. Many devices now provide a hash over the 4-tuple on a per packet basis (e.g. the Toeplitz hash). This patch allow drivers to set the HW reported hash in an skb field, and that value in turn is used to index into the RPS maps. Using the HW generated hash can avoid cache misses on the packet when steering it to a remote CPU. The CPU mask is set on a per device and per queue basis in the sysfs variable /sys/class/net//queues/rx-/rps_cpus. This is a set of canonical bit maps for receive queues in the device (numbered by ). If a device does not support multi-queue, a single variable is used for the device (rx-0). Generally, we have found this technique increases pps capabilities of a single queue device with good CPU utilization. Optimal settings for the CPU mask seem to depend on architectures and cache hierarcy. Below are some results running 500 instances of netperf TCP_RR test with 1 byte req. and resp. Results show cumulative transaction rate and system CPU utilization. e1000e on 8 core Intel Without RPS: 108K tps at 33% CPU With RPS: 311K tps at 64% CPU forcedeth on 16 core AMD Without RPS: 156K tps at 15% CPU With RPS: 404K tps at 49% CPU bnx2x on 16 core AMD Without RPS 567K tps at 61% CPU (4 HW RX queues) Without RPS 738K tps at 96% CPU (8 HW RX queues) With RPS: 854K tps at 76% CPU (4 HW RX queues) Caveats: - The benefits of this patch are dependent on architecture and cache hierarchy. Tuning the masks to get best performance is probably necessary. - This patch adds overhead in the path for processing a single packet. In a lightly loaded server this overhead may eliminate the advantages of increased parallelism, and possibly cause some relative performance degradation. We have found that masks that are cache aware (share same caches with the interrupting CPU) mitigate much of this. - The RPS masks can be changed dynamically, however whenever the mask is changed this introduces the possibility of generating out of order packets. It's probably best not change the masks too frequently. Signed-off-by: Tom Herbert include/linux/netdevice.h | 32 ++++- include/linux/skbuff.h | 3 + net/core/dev.c | 335 +++++++++++++++++++++++++++++++++++++-------- net/core/net-sysfs.c | 225 ++++++++++++++++++++++++++++++- net/core/skbuff.c | 2 + 5 files changed, 538 insertions(+), 59 deletions(-) Signed-off-by: Eric Dumazet Signed-off-by: David S. Miller --- include/linux/netdevice.h | 32 ++++++++++++++++++++++++++++++-- 1 file changed, 30 insertions(+), 2 deletions(-) (limited to 'include/linux/netdevice.h') diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index c79a88b..de1a52b 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -223,6 +223,7 @@ struct netif_rx_stats { unsigned dropped; unsigned time_squeeze; unsigned cpu_collision; + unsigned received_rps; }; DECLARE_PER_CPU(struct netif_rx_stats, netdev_rx_stat); @@ -530,6 +531,24 @@ struct netdev_queue { unsigned long tx_dropped; } ____cacheline_aligned_in_smp; +/* + * This structure holds an RPS map which can be of variable length. The + * map is an array of CPUs. + */ +struct rps_map { + unsigned int len; + struct rcu_head rcu; + u16 cpus[0]; +}; +#define RPS_MAP_SIZE(_num) (sizeof(struct rps_map) + (_num * sizeof(u16))) + +/* This structure contains an instance of an RX queue. */ +struct netdev_rx_queue { + struct rps_map *rps_map; + struct kobject kobj; + struct netdev_rx_queue *first; + atomic_t count; +} ____cacheline_aligned_in_smp; /* * This structure defines the management hooks for network devices. @@ -878,6 +897,13 @@ struct net_device { unsigned char broadcast[MAX_ADDR_LEN]; /* hw bcast add */ + struct kset *queues_kset; + + struct netdev_rx_queue *_rx; + + /* Number of RX queues allocated at alloc_netdev_mq() time */ + unsigned int num_rx_queues; + struct netdev_queue rx_queue; struct netdev_queue *_tx ____cacheline_aligned_in_smp; @@ -1311,14 +1337,16 @@ static inline int unregister_gifconf(unsigned int family) */ struct softnet_data { struct Qdisc *output_queue; - struct sk_buff_head input_pkt_queue; struct list_head poll_list; struct sk_buff *completion_queue; + /* Elements below can be accessed between CPUs for RPS */ + struct call_single_data csd ____cacheline_aligned_in_smp; + struct sk_buff_head input_pkt_queue; struct napi_struct backlog; }; -DECLARE_PER_CPU(struct softnet_data,softnet_data); +DECLARE_PER_CPU_ALIGNED(struct softnet_data, softnet_data); #define HAVE_NETIF_QUEUE -- cgit v1.1 From 1e94d72feab025b8f7c55d07020602f82f3a97dd Mon Sep 17 00:00:00 2001 From: Tom Herbert Date: Thu, 18 Mar 2010 17:45:44 -0700 Subject: rps: Fixed build with CONFIG_SMP not enabled. Signed-off-by: Tom Herbert Signed-off-by: David S. Miller --- include/linux/netdevice.h | 2 ++ 1 file changed, 2 insertions(+) (limited to 'include/linux/netdevice.h') diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index de1a52b..726ecd1 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -1341,7 +1341,9 @@ struct softnet_data { struct sk_buff *completion_queue; /* Elements below can be accessed between CPUs for RPS */ +#ifdef CONFIG_SMP struct call_single_data csd ____cacheline_aligned_in_smp; +#endif struct sk_buff_head input_pkt_queue; struct napi_struct backlog; }; -- cgit v1.1 From 3ca5b4042ecae5e73c59de62e4ac0db31c10e0f8 Mon Sep 17 00:00:00 2001 From: Jiri Pirko Date: Wed, 10 Mar 2010 10:29:35 +0000 Subject: bonding: check return value of nofitier when changing type This patch adds the possibility to refuse the bonding type change for other subsystems (such as for example bridge, vlan, etc.) Signed-off-by: Jiri Pirko Signed-off-by: David S. Miller --- include/linux/netdevice.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'include/linux/netdevice.h') diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 726ecd1..813bed7 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -2005,7 +2005,7 @@ extern void __dev_addr_unsync(struct dev_addr_list **to, int *to_count, struct extern int dev_set_promiscuity(struct net_device *dev, int inc); extern int dev_set_allmulti(struct net_device *dev, int inc); extern void netdev_state_change(struct net_device *dev); -extern void netdev_bonding_change(struct net_device *dev, +extern int netdev_bonding_change(struct net_device *dev, unsigned long event); extern void netdev_features_change(struct net_device *dev); /* Load a device via the kmod */ -- cgit v1.1 From 32a806c194ea112cfab00f558482dd97bee5e44e Mon Sep 17 00:00:00 2001 From: Jiri Pirko Date: Fri, 19 Mar 2010 04:00:23 +0000 Subject: bonding: flush unicast and multicast lists when changing type After the type change, addresses in unicast and multicast lists wouldn't make sense, not to mention possible different lenghts. So flush both lists here. Note "dev_addr_discard" will be very soon replaced by "dev_mc_flush" (once mc_list conversion will be done). Signed-off-by: Jiri Pirko Signed-off-by: David S. Miller --- include/linux/netdevice.h | 2 ++ 1 file changed, 2 insertions(+) (limited to 'include/linux/netdevice.h') diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 9fc6ee8..c96c41e 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -1994,10 +1994,12 @@ extern int dev_unicast_delete(struct net_device *dev, void *addr); extern int dev_unicast_add(struct net_device *dev, void *addr); extern int dev_unicast_sync(struct net_device *to, struct net_device *from); extern void dev_unicast_unsync(struct net_device *to, struct net_device *from); +extern void dev_unicast_flush(struct net_device *dev); extern int dev_mc_delete(struct net_device *dev, void *addr, int alen, int all); extern int dev_mc_add(struct net_device *dev, void *addr, int alen, int newonly); extern int dev_mc_sync(struct net_device *to, struct net_device *from); extern void dev_mc_unsync(struct net_device *to, struct net_device *from); +extern void dev_addr_discard(struct net_device *dev); extern int __dev_addr_delete(struct dev_addr_list **list, int *count, void *addr, int alen, int all); extern int __dev_addr_add(struct dev_addr_list **list, int *count, void *addr, int alen, int newonly); extern int __dev_addr_sync(struct dev_addr_list **to, int *to_count, struct dev_addr_list **from, int *from_count); -- cgit v1.1 From df3345457a7a174dfb5872a070af80d456985038 Mon Sep 17 00:00:00 2001 From: Eric Dumazet Date: Wed, 24 Mar 2010 19:13:54 +0000 Subject: rps: add CONFIG_RPS RPS currently depends on SMP and SYSFS Adding a CONFIG_RPS makes sense in case this requirement changes in the future. This patch saves about 1500 bytes of kernel text in case SMP is on but SYSFS is off. Signed-off-by: Eric Dumazet Signed-off-by: David S. Miller --- include/linux/netdevice.h | 4 ++++ 1 file changed, 4 insertions(+) (limited to 'include/linux/netdevice.h') diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index c96c41e..53c272f 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -531,6 +531,7 @@ struct netdev_queue { unsigned long tx_dropped; } ____cacheline_aligned_in_smp; +#ifdef CONFIG_RPS /* * This structure holds an RPS map which can be of variable length. The * map is an array of CPUs. @@ -549,6 +550,7 @@ struct netdev_rx_queue { struct netdev_rx_queue *first; atomic_t count; } ____cacheline_aligned_in_smp; +#endif /* * This structure defines the management hooks for network devices. @@ -897,12 +899,14 @@ struct net_device { unsigned char broadcast[MAX_ADDR_LEN]; /* hw bcast add */ +#ifdef CONFIG_RPS struct kset *queues_kset; struct netdev_rx_queue *_rx; /* Number of RX queues allocated at alloc_netdev_mq() time */ unsigned int num_rx_queues; +#endif struct netdev_queue rx_queue; -- cgit v1.1 From b00fabb4020d17bda4bea59507e09fadf573088d Mon Sep 17 00:00:00 2001 From: stephen hemminger Date: Mon, 29 Mar 2010 14:47:27 +0000 Subject: netdev: ethtool RXHASH flag This adds ethtool and device feature flag to allow control of receive hashing offload. Signed-off-by: Stephen Hemminger Acked-by: Jeff Garzik Signed-off-by: David S. Miller --- include/linux/netdevice.h | 1 + 1 file changed, 1 insertion(+) (limited to 'include/linux/netdevice.h') diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 53c272f..b5670ab 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -785,6 +785,7 @@ struct net_device { #define NETIF_F_SCTP_CSUM (1 << 25) /* SCTP checksum offload */ #define NETIF_F_FCOE_MTU (1 << 26) /* Supports max FCoE MTU, 2158 bytes*/ #define NETIF_F_NTUPLE (1 << 27) /* N-tuple filters supported */ +#define NETIF_F_RXHASH (1 << 28) /* Receive hashing offload */ /* Segmentation offload features */ #define NETIF_F_GSO_SHIFT 16 -- cgit v1.1 From a748ee2426817a95b1f03012d8f339c45c722ae1 Mon Sep 17 00:00:00 2001 From: Jiri Pirko Date: Thu, 1 Apr 2010 21:22:09 +0000 Subject: net: move address list functions to a separate file +little renaming of unicast functions to be smooth with multicast ones Signed-off-by: Jiri Pirko Signed-off-by: David S. Miller --- include/linux/netdevice.h | 15 ++++++++++----- 1 file changed, 10 insertions(+), 5 deletions(-) (limited to 'include/linux/netdevice.h') diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index b5670ab..60f0c83 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -1991,15 +1991,20 @@ extern int dev_addr_add_multiple(struct net_device *to_dev, extern int dev_addr_del_multiple(struct net_device *to_dev, struct net_device *from_dev, unsigned char addr_type); +extern void dev_addr_flush(struct net_device *dev); +extern int dev_addr_init(struct net_device *dev); + +/* Functions used for unicast addresses handling */ +extern int dev_uc_add(struct net_device *dev, unsigned char *addr); +extern int dev_uc_del(struct net_device *dev, unsigned char *addr); +extern int dev_uc_sync(struct net_device *to, struct net_device *from); +extern void dev_uc_unsync(struct net_device *to, struct net_device *from); +extern void dev_uc_flush(struct net_device *dev); +extern void dev_uc_init(struct net_device *dev); /* Functions used for secondary unicast and multicast support */ extern void dev_set_rx_mode(struct net_device *dev); extern void __dev_set_rx_mode(struct net_device *dev); -extern int dev_unicast_delete(struct net_device *dev, void *addr); -extern int dev_unicast_add(struct net_device *dev, void *addr); -extern int dev_unicast_sync(struct net_device *to, struct net_device *from); -extern void dev_unicast_unsync(struct net_device *to, struct net_device *from); -extern void dev_unicast_flush(struct net_device *dev); extern int dev_mc_delete(struct net_device *dev, void *addr, int alen, int all); extern int dev_mc_add(struct net_device *dev, void *addr, int alen, int newonly); extern int dev_mc_sync(struct net_device *to, struct net_device *from); -- cgit v1.1 From 22bedad3ce112d5ca1eaf043d4990fa2ed698c87 Mon Sep 17 00:00:00 2001 From: Jiri Pirko Date: Thu, 1 Apr 2010 21:22:57 +0000 Subject: net: convert multicast list to list_head Converts the list and the core manipulating with it to be the same as uc_list. +uses two functions for adding/removing mc address (normal and "global" variant) instead of a function parameter. +removes dev_mcast.c completely. +exposes netdev_hw_addr_list_* macros along with __hw_addr_* functions for manipulation with lists on a sandbox (used in bonding and 80211 drivers) Signed-off-by: Jiri Pirko Signed-off-by: David S. Miller --- include/linux/netdevice.h | 82 ++++++++++++++++++++++++----------------------- 1 file changed, 42 insertions(+), 40 deletions(-) (limited to 'include/linux/netdevice.h') diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 60f0c83..a343a21 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -228,25 +228,6 @@ struct netif_rx_stats { DECLARE_PER_CPU(struct netif_rx_stats, netdev_rx_stat); -struct dev_addr_list { - struct dev_addr_list *next; - u8 da_addr[MAX_ADDR_LEN]; - u8 da_addrlen; - u8 da_synced; - int da_users; - int da_gusers; -}; - -/* - * We tag multicasts with these structures. - */ - -#define dev_mc_list dev_addr_list -#define dmi_addr da_addr -#define dmi_addrlen da_addrlen -#define dmi_users da_users -#define dmi_gusers da_gusers - struct netdev_hw_addr { struct list_head list; unsigned char addr[MAX_ADDR_LEN]; @@ -255,8 +236,10 @@ struct netdev_hw_addr { #define NETDEV_HW_ADDR_T_SAN 2 #define NETDEV_HW_ADDR_T_SLAVE 3 #define NETDEV_HW_ADDR_T_UNICAST 4 +#define NETDEV_HW_ADDR_T_MULTICAST 5 int refcount; bool synced; + bool global_use; struct rcu_head rcu_head; }; @@ -265,16 +248,20 @@ struct netdev_hw_addr_list { int count; }; -#define netdev_uc_count(dev) ((dev)->uc.count) -#define netdev_uc_empty(dev) ((dev)->uc.count == 0) -#define netdev_for_each_uc_addr(ha, dev) \ - list_for_each_entry(ha, &dev->uc.list, list) +#define netdev_hw_addr_list_count(l) ((l)->count) +#define netdev_hw_addr_list_empty(l) (netdev_hw_addr_list_count(l) == 0) +#define netdev_hw_addr_list_for_each(ha, l) \ + list_for_each_entry(ha, &(l)->list, list) -#define netdev_mc_count(dev) ((dev)->mc_count) -#define netdev_mc_empty(dev) (netdev_mc_count(dev) == 0) +#define netdev_uc_count(dev) netdev_hw_addr_list_count(&(dev)->uc) +#define netdev_uc_empty(dev) netdev_hw_addr_list_empty(&(dev)->uc) +#define netdev_for_each_uc_addr(ha, dev) \ + netdev_hw_addr_list_for_each(ha, &(dev)->uc) +#define netdev_mc_count(dev) netdev_hw_addr_list_count(&(dev)->mc) +#define netdev_mc_empty(dev) netdev_hw_addr_list_empty(&(dev)->mc) #define netdev_for_each_mc_addr(mclist, dev) \ - for (mclist = dev->mc_list; mclist; mclist = mclist->next) + netdev_hw_addr_list_for_each(ha, &(dev)->mc) struct hh_cache { struct hh_cache *hh_next; /* Next entry */ @@ -862,12 +849,10 @@ struct net_device { unsigned char addr_len; /* hardware address length */ unsigned short dev_id; /* for shared network cards */ - struct netdev_hw_addr_list uc; /* Secondary unicast - mac addresses */ - int uc_promisc; spinlock_t addr_list_lock; - struct dev_addr_list *mc_list; /* Multicast mac addresses */ - int mc_count; /* Number of installed mcasts */ + struct netdev_hw_addr_list uc; /* Unicast mac addresses */ + struct netdev_hw_addr_list mc; /* Multicast mac addresses */ + int uc_promisc; unsigned int promiscuity; unsigned int allmulti; @@ -1980,6 +1965,22 @@ extern struct net_device *alloc_netdev_mq(int sizeof_priv, const char *name, extern int register_netdev(struct net_device *dev); extern void unregister_netdev(struct net_device *dev); +/* General hardware address lists handling functions */ +extern int __hw_addr_add_multiple(struct netdev_hw_addr_list *to_list, + struct netdev_hw_addr_list *from_list, + int addr_len, unsigned char addr_type); +extern void __hw_addr_del_multiple(struct netdev_hw_addr_list *to_list, + struct netdev_hw_addr_list *from_list, + int addr_len, unsigned char addr_type); +extern int __hw_addr_sync(struct netdev_hw_addr_list *to_list, + struct netdev_hw_addr_list *from_list, + int addr_len); +extern void __hw_addr_unsync(struct netdev_hw_addr_list *to_list, + struct netdev_hw_addr_list *from_list, + int addr_len); +extern void __hw_addr_flush(struct netdev_hw_addr_list *list); +extern void __hw_addr_init(struct netdev_hw_addr_list *list); + /* Functions used for device addresses handling */ extern int dev_addr_add(struct net_device *dev, unsigned char *addr, unsigned char addr_type); @@ -2002,18 +2003,19 @@ extern void dev_uc_unsync(struct net_device *to, struct net_device *from); extern void dev_uc_flush(struct net_device *dev); extern void dev_uc_init(struct net_device *dev); +/* Functions used for multicast addresses handling */ +extern int dev_mc_add(struct net_device *dev, unsigned char *addr); +extern int dev_mc_add_global(struct net_device *dev, unsigned char *addr); +extern int dev_mc_del(struct net_device *dev, unsigned char *addr); +extern int dev_mc_del_global(struct net_device *dev, unsigned char *addr); +extern int dev_mc_sync(struct net_device *to, struct net_device *from); +extern void dev_mc_unsync(struct net_device *to, struct net_device *from); +extern void dev_mc_flush(struct net_device *dev); +extern void dev_mc_init(struct net_device *dev); + /* Functions used for secondary unicast and multicast support */ extern void dev_set_rx_mode(struct net_device *dev); extern void __dev_set_rx_mode(struct net_device *dev); -extern int dev_mc_delete(struct net_device *dev, void *addr, int alen, int all); -extern int dev_mc_add(struct net_device *dev, void *addr, int alen, int newonly); -extern int dev_mc_sync(struct net_device *to, struct net_device *from); -extern void dev_mc_unsync(struct net_device *to, struct net_device *from); -extern void dev_addr_discard(struct net_device *dev); -extern int __dev_addr_delete(struct dev_addr_list **list, int *count, void *addr, int alen, int all); -extern int __dev_addr_add(struct dev_addr_list **list, int *count, void *addr, int alen, int newonly); -extern int __dev_addr_sync(struct dev_addr_list **to, int *to_count, struct dev_addr_list **from, int *from_count); -extern void __dev_addr_unsync(struct dev_addr_list **to, int *to_count, struct dev_addr_list **from, int *from_count); extern int dev_set_promiscuity(struct net_device *dev, int inc); extern int dev_set_allmulti(struct net_device *dev, int inc); extern void netdev_state_change(struct net_device *dev); -- cgit v1.1 From 18e225f257663c59ff9d4482f07ffd06361fc2ec Mon Sep 17 00:00:00 2001 From: Pavel Roskin Date: Wed, 7 Apr 2010 16:40:09 -0700 Subject: net: fix definition of netdev_for_each_mc_addr() The first argument should be called ha, not mclist. All callers use the name "ha", but if they used a different name, there would be a compile error. Signed-off-by: Pavel Roskin Signed-off-by: David S. Miller --- include/linux/netdevice.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'include/linux/netdevice.h') diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index a343a21..d1a21b5 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -260,7 +260,7 @@ struct netdev_hw_addr_list { #define netdev_mc_count(dev) netdev_hw_addr_list_count(&(dev)->mc) #define netdev_mc_empty(dev) netdev_hw_addr_list_empty(&(dev)->mc) -#define netdev_for_each_mc_addr(mclist, dev) \ +#define netdev_for_each_mc_addr(ha, dev) \ netdev_hw_addr_list_for_each(ha, &(dev)->mc) struct hh_cache { -- cgit v1.1 From acbbc07145b919248c410e1852b953d385be5c97 Mon Sep 17 00:00:00 2001 From: Eric Dumazet Date: Sun, 11 Apr 2010 06:56:11 +0000 Subject: net: uninline skb_bond_should_drop() skb_bond_should_drop() is too big to be inlined. This patch reduces kernel text size, and its compilation time as well (shrinking include/linux/netdevice.h) Signed-off-by: Eric Dumazet Signed-off-by: David S. Miller --- include/linux/netdevice.h | 48 ++++------------------------------------------- 1 file changed, 4 insertions(+), 44 deletions(-) (limited to 'include/linux/netdevice.h') diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index d1a21b5..470f7c9 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -2089,54 +2089,14 @@ static inline void netif_set_gso_max_size(struct net_device *dev, dev->gso_max_size = size; } -static inline void skb_bond_set_mac_by_master(struct sk_buff *skb, - struct net_device *master) -{ - if (skb->pkt_type == PACKET_HOST) { - u16 *dest = (u16 *) eth_hdr(skb)->h_dest; - - memcpy(dest, master->dev_addr, ETH_ALEN); - } -} +extern int __skb_bond_should_drop(struct sk_buff *skb, + struct net_device *master); -/* On bonding slaves other than the currently active slave, suppress - * duplicates except for 802.3ad ETH_P_SLOW, alb non-mcast/bcast, and - * ARP on active-backup slaves with arp_validate enabled. - */ static inline int skb_bond_should_drop(struct sk_buff *skb, struct net_device *master) { - if (master) { - struct net_device *dev = skb->dev; - - if (master->priv_flags & IFF_MASTER_ARPMON) - dev->last_rx = jiffies; - - if ((master->priv_flags & IFF_MASTER_ALB) && master->br_port) { - /* Do address unmangle. The local destination address - * will be always the one master has. Provides the right - * functionality in a bridge. - */ - skb_bond_set_mac_by_master(skb, master); - } - - if (dev->priv_flags & IFF_SLAVE_INACTIVE) { - if ((dev->priv_flags & IFF_SLAVE_NEEDARP) && - skb->protocol == __cpu_to_be16(ETH_P_ARP)) - return 0; - - if (master->priv_flags & IFF_MASTER_ALB) { - if (skb->pkt_type != PACKET_BROADCAST && - skb->pkt_type != PACKET_MULTICAST) - return 0; - } - if (master->priv_flags & IFF_MASTER_8023AD && - skb->protocol == __cpu_to_be16(ETH_P_SLOW)) - return 0; - - return 1; - } - } + if (master) + return __skb_bond_should_drop(skb, master); return 0; } -- cgit v1.1 From fd793d8905720595caede6bd26c5df6c0ecd37f8 Mon Sep 17 00:00:00 2001 From: Changli Gao Date: Thu, 15 Apr 2010 00:16:59 -0700 Subject: net: CONFIG_SMP should be CONFIG_RPS Signed-off-by: Changli Gao Signed-off-by: David S. Miller --- include/linux/netdevice.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'include/linux/netdevice.h') diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 470f7c9..55c2086 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -1331,7 +1331,7 @@ struct softnet_data { struct sk_buff *completion_queue; /* Elements below can be accessed between CPUs for RPS */ -#ifdef CONFIG_SMP +#ifdef CONFIG_RPS struct call_single_data csd ____cacheline_aligned_in_smp; #endif struct sk_buff_head input_pkt_queue; -- cgit v1.1 From fec5e652e58fa6017b2c9e06466cb2a6538de5b4 Mon Sep 17 00:00:00 2001 From: Tom Herbert Date: Fri, 16 Apr 2010 16:01:27 -0700 Subject: rfs: Receive Flow Steering This patch implements receive flow steering (RFS). RFS steers received packets for layer 3 and 4 processing to the CPU where the application for the corresponding flow is running. RFS is an extension of Receive Packet Steering (RPS). The basic idea of RFS is that when an application calls recvmsg (or sendmsg) the application's running CPU is stored in a hash table that is indexed by the connection's rxhash which is stored in the socket structure. The rxhash is passed in skb's received on the connection from netif_receive_skb. For each received packet, the associated rxhash is used to look up the CPU in the hash table, if a valid CPU is set then the packet is steered to that CPU using the RPS mechanisms. The convolution of the simple approach is that it would potentially allow OOO packets. If threads are thrashing around CPUs or multiple threads are trying to read from the same sockets, a quickly changing CPU value in the hash table could cause rampant OOO packets-- we consider this a non-starter. To avoid OOO packets, this solution implements two types of hash tables: rps_sock_flow_table and rps_dev_flow_table. rps_sock_table is a global hash table. Each entry is just a CPU number and it is populated in recvmsg and sendmsg as described above. This table contains the "desired" CPUs for flows. rps_dev_flow_table is specific to each device queue. Each entry contains a CPU and a tail queue counter. The CPU is the "current" CPU for a matching flow. The tail queue counter holds the value of a tail queue counter for the associated CPU's backlog queue at the time of last enqueue for a flow matching the entry. Each backlog queue has a queue head counter which is incremented on dequeue, and so a queue tail counter is computed as queue head count + queue length. When a packet is enqueued on a backlog queue, the current value of the queue tail counter is saved in the hash entry of the rps_dev_flow_table. And now the trick: when selecting the CPU for RPS (get_rps_cpu) the rps_sock_flow table and the rps_dev_flow table for the RX queue are consulted. When the desired CPU for the flow (found in the rps_sock_flow table) does not match the current CPU (found in the rps_dev_flow table), the current CPU is changed to the desired CPU if one of the following is true: - The current CPU is unset (equal to RPS_NO_CPU) - Current CPU is offline - The current CPU's queue head counter >= queue tail counter in the rps_dev_flow table. This checks if the queue tail has advanced beyond the last packet that was enqueued using this table entry. This guarantees that all packets queued using this entry have been dequeued, thus preserving in order delivery. Making each queue have its own rps_dev_flow table has two advantages: 1) the tail queue counters will be written on each receive, so keeping the table local to interrupting CPU s good for locality. 2) this allows lockless access to the table-- the CPU number and queue tail counter need to be accessed together under mutual exclusion from netif_receive_skb, we assume that this is only called from device napi_poll which is non-reentrant. This patch implements RFS for TCP and connected UDP sockets. It should be usable for other flow oriented protocols. There are two configuration parameters for RFS. The "rps_flow_entries" kernel init parameter sets the number of entries in the rps_sock_flow_table, the per rxqueue sysfs entry "rps_flow_cnt" contains the number of entries in the rps_dev_flow table for the rxqueue. Both are rounded to power of two. The obvious benefit of RFS (over just RPS) is that it achieves CPU locality between the receive processing for a flow and the applications processing; this can result in increased performance (higher pps, lower latency). The benefits of RFS are dependent on cache hierarchy, application load, and other factors. On simple benchmarks, we don't necessarily see improvement and sometimes see degradation. However, for more complex benchmarks and for applications where cache pressure is much higher this technique seems to perform very well. Below are some benchmark results which show the potential benfit of this patch. The netperf test has 500 instances of netperf TCP_RR test with 1 byte req. and resp. The RPC test is an request/response test similar in structure to netperf RR test ith 100 threads on each host, but does more work in userspace that netperf. e1000e on 8 core Intel No RFS or RPS 104K tps at 30% CPU No RFS (best RPS config): 290K tps at 63% CPU RFS 303K tps at 61% CPU RPC test tps CPU% 50/90/99% usec latency Latency StdDev No RFS/RPS 103K 48% 757/900/3185 4472.35 RPS only: 174K 73% 415/993/2468 491.66 RFS 223K 73% 379/651/1382 315.61 Signed-off-by: Tom Herbert Signed-off-by: Eric Dumazet Signed-off-by: David S. Miller --- include/linux/netdevice.h | 69 ++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 68 insertions(+), 1 deletion(-) (limited to 'include/linux/netdevice.h') diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 55c2086..649a025 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -530,14 +530,73 @@ struct rps_map { }; #define RPS_MAP_SIZE(_num) (sizeof(struct rps_map) + (_num * sizeof(u16))) +/* + * The rps_dev_flow structure contains the mapping of a flow to a CPU and the + * tail pointer for that CPU's input queue at the time of last enqueue. + */ +struct rps_dev_flow { + u16 cpu; + u16 fill; + unsigned int last_qtail; +}; + +/* + * The rps_dev_flow_table structure contains a table of flow mappings. + */ +struct rps_dev_flow_table { + unsigned int mask; + struct rcu_head rcu; + struct work_struct free_work; + struct rps_dev_flow flows[0]; +}; +#define RPS_DEV_FLOW_TABLE_SIZE(_num) (sizeof(struct rps_dev_flow_table) + \ + (_num * sizeof(struct rps_dev_flow))) + +/* + * The rps_sock_flow_table contains mappings of flows to the last CPU + * on which they were processed by the application (set in recvmsg). + */ +struct rps_sock_flow_table { + unsigned int mask; + u16 ents[0]; +}; +#define RPS_SOCK_FLOW_TABLE_SIZE(_num) (sizeof(struct rps_sock_flow_table) + \ + (_num * sizeof(u16))) + +#define RPS_NO_CPU 0xffff + +static inline void rps_record_sock_flow(struct rps_sock_flow_table *table, + u32 hash) +{ + if (table && hash) { + unsigned int cpu, index = hash & table->mask; + + /* We only give a hint, preemption can change cpu under us */ + cpu = raw_smp_processor_id(); + + if (table->ents[index] != cpu) + table->ents[index] = cpu; + } +} + +static inline void rps_reset_sock_flow(struct rps_sock_flow_table *table, + u32 hash) +{ + if (table && hash) + table->ents[hash & table->mask] = RPS_NO_CPU; +} + +extern struct rps_sock_flow_table *rps_sock_flow_table; + /* This structure contains an instance of an RX queue. */ struct netdev_rx_queue { struct rps_map *rps_map; + struct rps_dev_flow_table *rps_flow_table; struct kobject kobj; struct netdev_rx_queue *first; atomic_t count; } ____cacheline_aligned_in_smp; -#endif +#endif /* CONFIG_RPS */ /* * This structure defines the management hooks for network devices. @@ -1333,11 +1392,19 @@ struct softnet_data { /* Elements below can be accessed between CPUs for RPS */ #ifdef CONFIG_RPS struct call_single_data csd ____cacheline_aligned_in_smp; + unsigned int input_queue_head; #endif struct sk_buff_head input_pkt_queue; struct napi_struct backlog; }; +static inline void incr_input_queue_head(struct softnet_data *queue) +{ +#ifdef CONFIG_RPS + queue->input_queue_head++; +#endif +} + DECLARE_PER_CPU_ALIGNED(struct softnet_data, softnet_data); #define HAVE_NETIF_QUEUE -- cgit v1.1 From 88751275b8e867d756e4f86ae92afe0232de129f Mon Sep 17 00:00:00 2001 From: Eric Dumazet Date: Mon, 19 Apr 2010 05:07:33 +0000 Subject: rps: shortcut net_rps_action() net_rps_action() is a bit expensive on NR_CPUS=64..4096 kernels, even if RPS is not active. Tom Herbert used two bitmasks to hold information needed to send IPI, but a single LIFO list seems more appropriate. Move all RPS logic into net_rps_action() to cleanup net_rx_action() code (remove two ifdefs) Move rps_remote_softirq_cpus into softnet_data to share its first cache line, filling an existing hole. In a future patch, we could call net_rps_action() from process_backlog() to make sure we send IPI before handling this cpu backlog. Signed-off-by: Eric Dumazet Signed-off-by: David S. Miller --- include/linux/netdevice.h | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) (limited to 'include/linux/netdevice.h') diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 649a025..83ab3da 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -1381,17 +1381,20 @@ static inline int unregister_gifconf(unsigned int family) } /* - * Incoming packets are placed on per-cpu queues so that - * no locking is needed. + * Incoming packets are placed on per-cpu queues */ struct softnet_data { struct Qdisc *output_queue; struct list_head poll_list; struct sk_buff *completion_queue; - /* Elements below can be accessed between CPUs for RPS */ #ifdef CONFIG_RPS + struct softnet_data *rps_ipi_list; + + /* Elements below can be accessed between CPUs for RPS */ struct call_single_data csd ____cacheline_aligned_in_smp; + struct softnet_data *rps_ipi_next; + unsigned int cpu; unsigned int input_queue_head; #endif struct sk_buff_head input_pkt_queue; -- cgit v1.1 From e36fa2f7e92f25aab2e3d787dcfe3590817f19d3 Mon Sep 17 00:00:00 2001 From: Eric Dumazet Date: Mon, 19 Apr 2010 21:17:14 +0000 Subject: rps: cleanups struct softnet_data holds many queues, so consistent use "sd" name instead of "queue" is better. Adds a rps_ipi_queued() helper to cleanup enqueue_to_backlog() Adds a _and_irq_disable suffix to net_rps_action() name, as David suggested. incr_input_queue_head() becomes input_queue_head_incr() Signed-off-by: Eric Dumazet Signed-off-by: David S. Miller --- include/linux/netdevice.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'include/linux/netdevice.h') diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 83ab3da..3c5ed5f 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -1401,10 +1401,10 @@ struct softnet_data { struct napi_struct backlog; }; -static inline void incr_input_queue_head(struct softnet_data *queue) +static inline void input_queue_head_incr(struct softnet_data *sd) { #ifdef CONFIG_RPS - queue->input_queue_head++; + sd->input_queue_head++; #endif } -- cgit v1.1 From a9cbd588fdb71ea415754c885e2f9f03e6bf1ba0 Mon Sep 17 00:00:00 2001 From: Changli Gao Date: Mon, 26 Apr 2010 23:06:24 +0000 Subject: net: reimplement softnet_data.output_queue as a FIFO queue reimplement softnet_data.output_queue as a FIFO queue to keep the fairness among the qdiscs rescheduled. Signed-off-by: Changli Gao Acked-by: Eric Dumazet ---- include/linux/netdevice.h | 1 + net/core/dev.c | 22 ++++++++++++---------- 2 files changed, 13 insertions(+), 10 deletions(-) Signed-off-by: David S. Miller --- include/linux/netdevice.h | 1 + 1 file changed, 1 insertion(+) (limited to 'include/linux/netdevice.h') diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 3c5ed5f..c04ca24 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -1385,6 +1385,7 @@ static inline int unregister_gifconf(unsigned int family) */ struct softnet_data { struct Qdisc *output_queue; + struct Qdisc **output_queue_tailp; struct list_head poll_list; struct sk_buff *completion_queue; -- cgit v1.1 From 6e7676c1a76aed6e957611d8d7a9e5592e23aeba Mon Sep 17 00:00:00 2001 From: Changli Gao Date: Tue, 27 Apr 2010 15:07:33 -0700 Subject: net: batch skb dequeueing from softnet input_pkt_queue batch skb dequeueing from softnet input_pkt_queue to reduce potential lock contention when RPS is enabled. Note: in the worst case, the number of packets in a softnet_data may be double of netdev_max_backlog. Signed-off-by: Changli Gao Signed-off-by: Eric Dumazet Signed-off-by: David S. Miller --- include/linux/netdevice.h | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) (limited to 'include/linux/netdevice.h') diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index c04ca24..40d4c20 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -1388,6 +1388,7 @@ struct softnet_data { struct Qdisc **output_queue_tailp; struct list_head poll_list; struct sk_buff *completion_queue; + struct sk_buff_head process_queue; #ifdef CONFIG_RPS struct softnet_data *rps_ipi_list; @@ -1402,10 +1403,11 @@ struct softnet_data { struct napi_struct backlog; }; -static inline void input_queue_head_incr(struct softnet_data *sd) +static inline void input_queue_head_add(struct softnet_data *sd, + unsigned int len) { #ifdef CONFIG_RPS - sd->input_queue_head++; + sd->input_queue_head += len; #endif } -- cgit v1.1 From dee42870a423ad485129f43cddfe7275479f11d8 Mon Sep 17 00:00:00 2001 From: Changli Gao Date: Sun, 2 May 2010 05:42:16 +0000 Subject: net: fix softnet_stat Per cpu variable softnet_data.total was shared between IRQ and SoftIRQ context without any protection. And enqueue_to_backlog should update the netdev_rx_stat of the target CPU. This patch renames softnet_data.total to softnet_data.processed: the number of packets processed in uppper levels(IP stacks). softnet_stat data is moved into softnet_data. Signed-off-by: Changli Gao ---- include/linux/netdevice.h | 17 +++++++---------- net/core/dev.c | 26 ++++++++++++-------------- net/sched/sch_generic.c | 2 +- 3 files changed, 20 insertions(+), 25 deletions(-) Signed-off-by: Eric Dumazet Signed-off-by: David S. Miller --- include/linux/netdevice.h | 17 +++++++---------- 1 file changed, 7 insertions(+), 10 deletions(-) (limited to 'include/linux/netdevice.h') diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 40d4c20..c39938f 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -218,16 +218,6 @@ struct neighbour; struct neigh_parms; struct sk_buff; -struct netif_rx_stats { - unsigned total; - unsigned dropped; - unsigned time_squeeze; - unsigned cpu_collision; - unsigned received_rps; -}; - -DECLARE_PER_CPU(struct netif_rx_stats, netdev_rx_stat); - struct netdev_hw_addr { struct list_head list; unsigned char addr[MAX_ADDR_LEN]; @@ -1390,6 +1380,12 @@ struct softnet_data { struct sk_buff *completion_queue; struct sk_buff_head process_queue; + /* stats */ + unsigned processed; + unsigned time_squeeze; + unsigned cpu_collision; + unsigned received_rps; + #ifdef CONFIG_RPS struct softnet_data *rps_ipi_list; @@ -1399,6 +1395,7 @@ struct softnet_data { unsigned int cpu; unsigned int input_queue_head; #endif + unsigned dropped; struct sk_buff_head input_pkt_queue; struct napi_struct backlog; }; -- cgit v1.1 From cd7b5396e7e4d10c51116f59f414ff90312af8d4 Mon Sep 17 00:00:00 2001 From: "David S. Miller" Date: Sun, 2 May 2010 22:27:59 -0700 Subject: net: Use explicit "unsigned int" instead of plain "unsigned" in netdevice.h Signed-off-by: David S. Miller --- include/linux/netdevice.h | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) (limited to 'include/linux/netdevice.h') diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index c39938f..98112fb 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -878,7 +878,7 @@ struct net_device { unsigned char operstate; /* RFC2863 operstate */ unsigned char link_mode; /* mapping policy to operstate */ - unsigned mtu; /* interface MTU value */ + unsigned int mtu; /* interface MTU value */ unsigned short type; /* interface hardware type */ unsigned short hard_header_len; /* hardware hdr length */ @@ -1381,10 +1381,10 @@ struct softnet_data { struct sk_buff_head process_queue; /* stats */ - unsigned processed; - unsigned time_squeeze; - unsigned cpu_collision; - unsigned received_rps; + unsigned int processed; + unsigned int time_squeeze; + unsigned int cpu_collision; + unsigned int received_rps; #ifdef CONFIG_RPS struct softnet_data *rps_ipi_list; -- cgit v1.1 From 0e34e93177fb1f642cab080e0bde664c06c7183a Mon Sep 17 00:00:00 2001 From: WANG Cong Date: Thu, 6 May 2010 00:47:21 -0700 Subject: netpoll: add generic support for bridge and bonding devices This whole patchset is for adding netpoll support to bridge and bonding devices. I already tested it for bridge, bonding, bridge over bonding, and bonding over bridge. It looks fine now. To make bridge and bonding support netpoll, we need to adjust some netpoll generic code. This patch does the following things: 1) introduce two new priv_flags for struct net_device: IFF_IN_NETPOLL which identifies we are processing a netpoll; IFF_DISABLE_NETPOLL is used to disable netpoll support for a device at run-time; 2) introduce one new method for netdev_ops: ->ndo_netpoll_cleanup() is used to clean up netpoll when a device is removed. 3) introduce netpoll_poll_dev() which takes a struct net_device * parameter; export netpoll_send_skb() and netpoll_poll_dev() which will be used later; 4) hide a pointer to struct netpoll in struct netpoll_info, ditto. 5) introduce ->real_dev for struct netpoll. 6) introduce a new status NETDEV_BONDING_DESLAE, which is used to disable netconsole before releasing a slave, to avoid deadlocks. Cc: David Miller Cc: Neil Horman Signed-off-by: WANG Cong Signed-off-by: David S. Miller --- include/linux/netdevice.h | 1 + 1 file changed, 1 insertion(+) (limited to 'include/linux/netdevice.h') diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 98112fb..69022d4 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -724,6 +724,7 @@ struct net_device_ops { unsigned short vid); #ifdef CONFIG_NET_POLL_CONTROLLER void (*ndo_poll_controller)(struct net_device *dev); + void (*ndo_netpoll_cleanup)(struct net_device *dev); #endif int (*ndo_set_vf_mac)(struct net_device *dev, int queue, u8 *mac); -- cgit v1.1 From 3b098e2d7c693796cc4dffb07caa249fc0f70771 Mon Sep 17 00:00:00 2001 From: Eric Dumazet Date: Sat, 15 May 2010 23:57:10 -0700 Subject: net: Consistent skb timestamping With RPS inclusion, skb timestamping is not consistent in RX path. If netif_receive_skb() is used, its deferred after RPS dispatch. If netif_rx() is used, its done before RPS dispatch. This can give strange tcpdump timestamps results. I think timestamping should be done as soon as possible in the receive path, to get meaningful values (ie timestamps taken at the time packet was delivered by NIC driver to our stack), even if NAPI already can defer timestamping a bit (RPS can help to reduce the gap) Tom Herbert prefer to sample timestamps after RPS dispatch. In case sampling is expensive (HPET/acpi_pm on x86), this makes sense. Let admins switch from one mode to another, using a new sysctl, /proc/sys/net/core/netdev_tstamp_prequeue Its default value (1), means timestamps are taken as soon as possible, before backlog queueing, giving accurate timestamps. Setting a 0 value permits to sample timestamps when processing backlog, after RPS dispatch, to lower the load of the pre-RPS cpu. Signed-off-by: Eric Dumazet Signed-off-by: David S. Miller --- include/linux/netdevice.h | 1 + 1 file changed, 1 insertion(+) (limited to 'include/linux/netdevice.h') diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 69022d4..c1b2341 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -2100,6 +2100,7 @@ extern const struct net_device_stats *dev_get_stats(struct net_device *dev); extern void dev_txq_stats_fold(const struct net_device *dev, struct net_device_stats *stats); extern int netdev_max_backlog; +extern int netdev_tstamp_prequeue; extern int weight_p; extern int netdev_set_master(struct net_device *dev, struct net_device *master); extern int skb_checksum_help(struct sk_buff *skb); -- cgit v1.1 From 57b610805ce92dbd79fc97509f80fa5391b99623 Mon Sep 17 00:00:00 2001 From: Scott Feldman Date: Mon, 17 May 2010 22:49:55 -0700 Subject: net: Add netlink support for virtual port management (was iovnl) Add new netdev ops ndo_{set|get}_vf_port to allow setting of port-profile on a netdev interface. Extends netlink socket RTM_SETLINK/ RTM_GETLINK with two new sub msgs called IFLA_VF_PORTS and IFLA_PORT_SELF (added to end of IFLA_cmd list). These are both nested atrtibutes using this layout: [IFLA_NUM_VF] [IFLA_VF_PORTS] [IFLA_VF_PORT] [IFLA_PORT_*], ... [IFLA_VF_PORT] [IFLA_PORT_*], ... ... [IFLA_PORT_SELF] [IFLA_PORT_*], ... These attributes are design to be set and get symmetrically. VF_PORTS is a list of VF_PORTs, one for each VF, when dealing with an SR-IOV device. PORT_SELF is for the PF of the SR-IOV device, in case it wants to also have a port-profile, or for the case where the VF==PF, like in enic patch 2/2 of this patch set. A port-profile is used to configure/enable the external switch virtual port backing the netdev interface, not to configure the host-facing side of the netdev. A port-profile is an identifier known to the switch. How port- profiles are installed on the switch or how available port-profiles are made know to the host is outside the scope of this patch. There are two types of port-profiles specs in the netlink msg. The first spec is for 802.1Qbg (pre-)standard, VDP protocol. The second spec is for devices that run a similar protocol as VDP but in firmware, thus hiding the protocol details. In either case, the specs have much in common and makes sense to define the netlink msg as the union of the two specs. For example, both specs have a notition of associating/deassociating a port-profile. And both specs require some information from the hypervisor manager, such as client port instance ID. The general flow is the port-profile is applied to a host netdev interface using RTM_SETLINK, the receiver of the RTM_SETLINK msg communicates with the switch, and the switch virtual port backing the host netdev interface is configured/enabled based on the settings defined by the port-profile. What those settings comprise, and how those settings are managed is again outside the scope of this patch, since this patch only deals with the first step in the flow. Signed-off-by: Scott Feldman Signed-off-by: Roopa Prabhu Signed-off-by: David S. Miller --- include/linux/netdevice.h | 8 ++++++++ 1 file changed, 8 insertions(+) (limited to 'include/linux/netdevice.h') diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index c1b2341..c3487a6 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -686,6 +686,9 @@ struct netdev_rx_queue { * int (*ndo_set_vf_tx_rate)(struct net_device *dev, int vf, int rate); * int (*ndo_get_vf_config)(struct net_device *dev, * int vf, struct ifla_vf_info *ivf); + * int (*ndo_set_vf_port)(struct net_device *dev, int vf, + * struct nlattr *port[]); + * int (*ndo_get_vf_port)(struct net_device *dev, int vf, struct sk_buff *skb); */ #define HAVE_NET_DEVICE_OPS struct net_device_ops { @@ -735,6 +738,11 @@ struct net_device_ops { int (*ndo_get_vf_config)(struct net_device *dev, int vf, struct ifla_vf_info *ivf); + int (*ndo_set_vf_port)(struct net_device *dev, + int vf, + struct nlattr *port[]); + int (*ndo_get_vf_port)(struct net_device *dev, + int vf, struct sk_buff *skb); #if defined(CONFIG_FCOE) || defined(CONFIG_FCOE_MODULE) int (*ndo_fcoe_enable)(struct net_device *dev); int (*ndo_fcoe_disable)(struct net_device *dev); -- cgit v1.1