aboutsummaryrefslogtreecommitdiffstats
path: root/net/ipv4/ip_output.c
Commit message (Collapse)AuthorAgeFilesLines
* net: core: Support UID-based routing.Lorenzo Colitti2016-03-111-1/+2
| | | | | | | | | | | | | | This contains the following commits: 1. 0149763 net: core: Add a UID range to fib rules. 2. 1650474 net: core: Use the socket UID in routing lookups. 3. 0b16771 net: ipv4: Add the UID to the route cache. 4. ee058f1 net: core: Add a RTA_UID attribute to routes. This is so that userspace can do per-UID route lookups. Bug: 15413527 Change-Id: I1285474c6734614d3bda6f61d88dfe89a4af7892 Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
* ipv4: use a 64bit load/store in output pathEric Dumazet2016-03-111-4/+17
| | | | | | | | | | | | | | gcc compiler is smart enough to use a single load/store if we memcpy(dptr, sptr, 8) on x86_64, regardless of CONFIG_CC_OPTIMIZE_FOR_SIZE In IP header, daddr immediately follows saddr, this wont change in the future. We only need to make sure our flowi4 (saddr,daddr) fields wont break the rule. Change-Id: Iad9c8fd9121ec84c2599b013badaebba92db7c39 Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv4: tcp: fix TOS value in ACK messages sent from TIME_WAITEric Dumazet2016-03-111-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | There is a long standing bug in linux tcp stack, about ACK messages sent on behalf of TIME_WAIT sockets. In the IP header of the ACK message, we choose to reflect TOS field of incoming message, and this might break some setups. Example of things that were broken : - Routing using TOS as a selector - Firewalls - Trafic classification / shaping We now remember in timewait structure the inet tos field and use it in ACK generation, and route lookup. Notes : - We still reflect incoming TOS in RST messages. - We could extend MuraliRaja Muniraju patch to report TOS value in netlink messages for TIME_WAIT sockets. - A patch is needed for IPv6 Change-Id: Ic7ad8a7b858de181bfe2a789c472f84955397d4c Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* ip: generate unique IP identificator if local fragmentation is allowedAnsis Atteka2013-10-131-3/+3
| | | | | | | | | | | | | | | | | | | | [ Upstream commit 703133de331a7a7df47f31fb9de51dc6f68a9de8 ] If local fragmentation is allowed, then ip_select_ident() and ip_select_ident_more() need to generate unique IDs to ensure correct defragmentation on the peer. For example, if IPsec (tunnel mode) has to encrypt large skbs that have local_df bit set, then all IP fragments that belonged to different ESP datagrams would have used the same identificator. If one of these IP fragments would get lost or reordered, then peer could possibly stitch together wrong IP fragments that did not belong to the same datagram. This would lead to a packet loss or data corruption. Signed-off-by: Ansis Atteka <aatteka@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* net: fix NULL dereferences in check_peer_redir()Eric Dumazet2012-02-131-4/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit d3aaeb38c40e5a6c08dd31a1b64da65c4352be36, along with dependent backports of commits: 69cce1d1404968f78b177a0314f5822d5afdbbfb 9de79c127cccecb11ae6a21ab1499e87aa222880 218fa90f072e4aeff9003d57e390857f4f35513e 580da35a31f91a594f3090b7a2c39b85cb051a12 f7e57044eeb1841847c24aa06766c8290c202583 e049f28883126c689cf95859480d9ee4ab23b7fa ] Gergely Kalman reported crashes in check_peer_redir(). It appears commit f39925dbde778 (ipv4: Cache learned redirect information in inetpeer.) added a race, leading to possible NULL ptr dereference. Since we can now change dst neighbour, we should make sure a reader can safely use a neighbour. Add RCU protection to dst neighbour, and make sure check_peer_redir() can be called safely by different cpus in parallel. As neighbours are already freed after one RCU grace period, this patch should not add typical RCU penalty (cache cold effects) Many thanks to Gergely for providing a pretty report pointing to the bug. Reported-by: Gergely Kalman <synapse@hippy.csoma.elte.hu> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* ipv4: Constrain UFO fragment sizes to multiples of 8 bytesBill Sommerfeld2011-08-151-3/+3
| | | | | | | | | | | | | | | [ Upstream commit d9be4f7a6f5a8da3133b832eca41c3591420b1ca ] Because the ip fragment offset field counts 8-byte chunks, ip fragments other than the last must contain a multiple of 8 bytes of payload. ip_ufo_append_data wasn't respecting this constraint and, depending on the MTU and ip option sizes, could create malformed non-final fragments. Google-Bug-Id: 5009328 Signed-off-by: Bill Sommerfeld <wsommerfeld@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
* ipv4: Don't use ufo handling on later transformed packetsSteffen Klassert2011-07-011-1/+1
| | | | | | | | | | | We might call ip_ufo_append_data() for packets that will be IPsec transformed later. This function should be used just for real udp packets. So we check for rt->dst.header_len which is only nonzero on IPsec handling and call ip_ufo_append_data() just if rt->dst.header_len is zero. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv4: Fix IPsec slowpath fragmentation problemSteffen Klassert2011-06-271-5/+5
| | | | | | | | | | | | | ip_append_data() builds packets based on the mtu from dst_mtu(rt->dst.path). On IPsec the effective mtu is lower because we need to add the protocol headers and trailers later when we do the IPsec transformations. So after the IPsec transformations the packet might be too big, which leads to a slowpath fragmentation then. This patch fixes this by building the packets based on the lower IPsec mtu from dst_mtu(&rt->dst) and adapts the exthdr handling to this. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv4: Fix packet size calculation in __ip_append_dataSteffen Klassert2011-06-271-5/+2
| | | | | | | | | | | | | Git commit 59104f06 (ip: take care of last fragment in ip_append_data) added a check to see if we exceed the mtu when we add trailer_len. However, the mtu is already subtracted by the trailer length when the xfrm transfomation bundles are set up. So IPsec packets with mtu size get fragmented, or if the DF bit is set the packets will not be send even though they match the mtu perfectly fine. This patch actually reverts commit 59104f06. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv4: Fix packet size calculation for raw IPsec packets in __ip_append_dataSteffen Klassert2011-06-091-3/+3
| | | | | | | | | | | | | | We assume that transhdrlen is positive on the first fragment which is wrong for raw packets. So we don't add exthdrlen to the packet size for raw packets. This leads to a reallocation on IPsec because we have not enough headroom on the skb to place the IPsec headers. This patch fixes this by adding exthdrlen to the packet size whenever the send queue of the socket is empty. This issue was introduced with git commit 1470ddf7 (inet: Remove explicit write references to sk/inet in ip_append_data) Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv4: Always call ip_options_build() after rest of IP header is filled in.David S. Miller2011-05-131-4/+5
| | | | | | | This will allow ip_options_build() to reliably look at the values of iph->{daddr,saddr} Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv4: Pass explicit daddr arg to ip_send_reply().David S. Miller2011-05-101-4/+3
| | | | | | This eliminates an access to rt->rt_src. Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv4: Pass flow key down into ip_append_*().David S. Miller2011-05-081-8/+10
| | | | | | This way rt->rt_dst accesses are unnecessary. Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv4: Pass flow keys down into datagram packet building engine.David S. Miller2011-05-081-20/+19
| | | | | | | | | This way ip_output.c no longer needs rt->rt_{src,dst}. We already have these keys sitting, ready and waiting, on the stack or in a socket structure. Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv4: Don't use rt->rt_{src,dst} in ip_queue_xmit().David S. Miller2011-05-081-4/+4
| | | | | | Now we can pick it out of the provided flow key. Signed-off-by: David S. Miller <davem@davemloft.net>
* inet: Pass flowi to ->queue_xmit().David S. Miller2011-05-081-2/+2
| | | | | | | | | | | This allows us to acquire the exact route keying information from the protocol, however that might be managed. It handles all of the possibilities, from the simplest case of storing the key in inet->cork.fl to the more complex setup SCTP has where individual transports determine the flow. Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv4: Use cork flow in ip_queue_xmit()David S. Miller2011-05-081-2/+3
| | | | | | | | | | | All invokers of ip_queue_xmit() must make certain that the socket is locked. All of SCTP, TCP, DCCP, and L2TP now make sure this is the case. Therefore we can use the cork flow during output route lookup in ip_queue_xmit() when the socket route check fails. Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv4: Initialize cork->opt using NULL not 0.David S. Miller2011-05-061-1/+1
| | | | | | Noticed by Joe Perches. Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv4: Initialize on-stack cork more efficiently.David S. Miller2011-05-061-1/+4
| | | | | | | | | | ip_setup_cork() explicitly initializes every member of inet_cork except flags, addr, and opt. So we can simply set those three members to zero instead of using a memset() via an empty struct assignment. Signed-off-by: David S. Miller <davem@davemloft.net> Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
* inet: Decrease overhead of on-stack inet_cork.David S. Miller2011-05-061-10/+12
| | | | | | | | | | | | | | | | When we fast path datagram sends to avoid locking by putting the inet_cork on the stack we use up lots of space that isn't necessary. This is because inet_cork contains a "struct flowi" which isn't used in these code paths. Split inet_cork to two parts, "inet_cork" and "inet_cork_full". Only the latter of which has the "struct flowi" and is what is stored in inet_sock. Signed-off-by: David S. Miller <davem@davemloft.net> Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
* ipv4: In ip_build_and_send_pkt() use 'saddr' and 'daddr' args passed in.David S. Miller2011-05-041-2/+2
| | | | | | | | | | | Instead of rt->rt_{dst,src} The only tricky part is source route option handling. If the source route option is enabled we can't just use plain 'daddr', we have to use opt->opt.faddr. Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv4: Make caller provide on-stack flow key to ip_route_output_ports().David S. Miller2011-05-031-1/+2
| | | | Signed-off-by: David S. Miller <davem@davemloft.net>
* inet: add RCU protection to inet->optEric Dumazet2011-04-281-23/+21
| | | | | | | | | | | | | | | | | | | | | | | We lack proper synchronization to manipulate inet->opt ip_options Problem is ip_make_skb() calls ip_setup_cork() and ip_setup_cork() possibly makes a copy of ipc->opt (struct ip_options), without any protection against another thread manipulating inet->opt. Another thread can change inet->opt pointer and free old one under us. Use RCU to protect inet->opt (changed to inet->inet_opt). Instead of handling atomic refcounts, just copy ip_options when necessary, to avoid cache line dirtying. We cant insert an rcu_head in struct ip_options since its included in skb->cb[], so this patch is large because I had to introduce a new ip_options_rcu structure. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge branch 'master' of ↵David S. Miller2011-04-111-1/+1
|\ | | | | | | | | | | | | master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 Conflicts: drivers/net/smsc911x.c
| * Fix common misspellingsLucas De Marchi2011-03-311-1/+1
| | | | | | | | | | | | Fixes generated by 'codespell' and manually reviewed. Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>
* | ipv4: Use flowi4_init_output() in ip_send_reply()David S. Miller2011-03-311-10/+8
|/ | | | Signed-off-by: David S. Miller <davem@davemloft.net>
* net: Put fl4_* macros to struct flowi4 and use them again.David S. Miller2011-03-121-2/+2
| | | | Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv4: Use flowi4 in public route lookup interfaces.David S. Miller2011-03-121-11/+11
| | | | Signed-off-by: David S. Miller <davem@davemloft.net>
* net: Make flowi ports AF dependent.David S. Miller2011-03-121-2/+2
| | | | | | | | | | | | Create two sets of port member accessors, one set prefixed by fl4_* and the other prefixed by fl6_* This will let us to create AF optimal flow instances. It will work because every context in which we access the ports, we have to be fully aware of which AF the flowi is anyways. Signed-off-by: David S. Miller <davem@davemloft.net>
* net: Put flowi_* prefix on AF independent members of struct flowiDavid S. Miller2011-03-121-8/+10
| | | | | | | | | | I intend to turn struct flowi into a union of AF specific flowi structs. There will be a common structure that each variant includes first, much like struct sock_common. This is the first step to move in that direction. Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv4: Create and use route lookup helpers.David S. Miller2011-03-121-20/+13
| | | | | | | The idea here is this minimizes the number of places one has to edit in order to make changes to how flows are defined and used. Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv4: Make output route lookup return rtable directly.David S. Miller2011-03-021-2/+4
| | | | | | Instead of on the stack. Signed-off-by: David S. Miller <davem@davemloft.net>
* inet: Replace left-over references to inet->corkHerbert Xu2011-03-011-2/+2
| | | | | | | | The patch to replace inet->cork with cork left out two spots in __ip_append_data that can result in bogus packet construction. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv4: Kill can_sleep arg to ip_route_output_flow()David S. Miller2011-03-011-1/+1
| | | | | | This boolean state is now available in the flow flags. Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv4: Make final arg to ip_route_output_flow to be boolean "can_sleep"David S. Miller2011-03-011-1/+1
| | | | | | Since that is what the current vague "flags" argument means. Signed-off-by: David S. Miller <davem@davemloft.net>
* inet: Add ip_make_skb and ip_finish_skbHerbert Xu2011-03-011-14/+51
| | | | | | | | | | | | | | | | | | | | | | This patch adds the helper ip_make_skb which is like ip_append_data and ip_push_pending_frames all rolled into one, except that it does not send the skb produced. The sending part is carried out by ip_send_skb, which the transport protocol can call after it has tweaked the skb. It is meant to be called in cases where corking is not used should have a one-to-one correspondence to sendmsg. This patch also adds the helper ip_finish_skb which is meant to be replace ip_push_pending_frames when corking is required. Previously the protocol stack would peek at the socket write queue and add its header to the first packet. With ip_finish_skb, the protocol stack can directly operate on the final skb instead, just like the non-corking case with ip_make_skb. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* inet: Remove explicit write references to sk/inet in ip_append_dataHerbert Xu2011-03-011-98/+140
| | | | | | | | | | | | | | In order to allow simultaneous calls to ip_append_data on the same socket, it must not modify any shared state in sk or inet (other than those that are designed to allow that such as atomic counters). This patch abstracts out write references to sk and inet_sk in ip_append_data and its friends so that we may use the underlying code in parallel. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* inet: Remove unused sk_sndmsg_* from UFOHerbert Xu2011-03-011-1/+0
| | | | | | | | | | UFO doesn't really use the sk_sndmsg_* parameters so touching them is pointless. It can't use them anyway since the whole point of UFO is to use the original pages without copying. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv4: Don't pre-seed hoplimit metric.David S. Miller2010-12-121-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | Always go through a new ip4_dst_hoplimit() helper, just like ipv6. This allowed several simplifications: 1) The interim dst_metric_hoplimit() can go as it's no longer userd. 2) The sysctl_ip_default_ttl entry no longer needs to use ipv4_doint_and_flush, since the sysctl is not cached in routing cache metrics any longer. 3) ipv4_doint_and_flush no longer needs to be exported and therefore can be marked static. When ipv4_doint_and_flush_strategy was removed some time ago, the external declaration in ip.h was mistakenly left around so kill that off too. We have to move the sysctl_ip_default_ttl declaration into ipv4's route cache definition header net/route.h, because currently net/ip.h (where the declaration lives now) has a back dependency on net/route.h Signed-off-by: David S. Miller <davem@davemloft.net>
* net: Abstract RTAX_HOPLIMIT metric accesses behind helper.David S. Miller2010-12-121-1/+1
| | | | Signed-off-by: David S. Miller <davem@davemloft.net>
* net: use the macros defined for the members of flowiChangli Gao2010-11-171-15/+10
| | | | | | | Use the macros defined for the members of flowi to clean the code up. Signed-off-by: Changli Gao <xiaosuo@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge branch 'master' of ↵David S. Miller2010-09-271-6/+13
|\ | | | | | | | | | | | | | | master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 Conflicts: drivers/net/qlcnic/qlcnic_init.c net/ipv4/ip_output.c
| * ip: fix truesize mismatch in ip fragmentationEric Dumazet2010-09-211-6/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Special care should be taken when slow path is hit in ip_fragment() : When walking through frags, we transfert truesize ownership from skb to frags. Then if we hit a slow_path condition, we must undo this or risk uncharging frags->truesize twice, and in the end, having negative socket sk_wmem_alloc counter, or even freeing socket sooner than expected. Many thanks to Nick Bowler, who provided a very clean bug report and test program. Thanks to Jarek for reviewing my first patch and providing a V2 While Nick bisection pointed to commit 2b85a34e911 (net: No more expensive sock_hold()/sock_put() on each tx), underlying bug is older (2.6.12-rc5) A side effect is to extend work done in commit b2722b1c3a893e (ip_fragment: also adjust skb->truesize for packets not owned by a socket) to ipv6 as well. Reported-and-bisected-by: Nick Bowler <nbowler@elliptictech.com> Tested-by: Nick Bowler <nbowler@elliptictech.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> CC: Jarek Poplawski <jarkao2@gmail.com> CC: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
* | ip: take care of last fragment in ip_append_dataEric Dumazet2010-09-241-3/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | While investigating a bit, I found ip_fragment() slow path was taken because ip_append_data() provides following layout for a send(MTU + N*(MTU - 20)) syscall : - one skb with 1500 (mtu) bytes - N fragments of 1480 (mtu-20) bytes (before adding IP header) last fragment gets 17 bytes of trail data because of following bit: if (datalen == length + fraggap) alloclen += rt->dst.trailer_len; Then esp4 adds 16 bytes of data (while trailer_len is 17... hmm... another bug ?) In ip_fragment(), we notice last fragment is too big (1496 + 20) > mtu, so we take slow path, building another skb chain. In order to avoid taking slow path, we should correct ip_append_data() to make sure last fragment has real trail space, under mtu... Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: ip_append_data() optimEric Dumazet2010-08-241-4/+3
| | | | | | | | | | | | | | Compiler is not smart enough to avoid a conditional branch. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: Rename skb_has_frags to skb_has_frag_listDavid S. Miller2010-08-231-1/+1
| | | | | | | | | | | | | | | | | | | | | | SKBs can be "fragmented" in two ways, via a page array (called skb_shinfo(skb)->frags[]) and via a list of SKBs (called skb_shinfo(skb)->frag_list). Since skb_has_frags() tests the latter, it's name is confusing since it sounds more like it's testing the former. Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: simplify flags for tx timestampingOliver Hartkopp2010-08-191-3/+3
|/ | | | | | | | | | | | | This patch removes the abstraction introduced by the union skb_shared_tx in the shared skb data. The access of the different union elements at several places led to some confusion about accessing the shared tx_flags e.g. in skb_orphan_try(). http://marc.info/?l=linux-netdev&m=128084897415886&w=2 Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Signed-off-by: David S. Miller <davem@davemloft.net>
* ip_fragment: fix subtracting PPPOE_SES_HLEN from mtu twiceChangli Gao2010-08-021-4/+2
| | | | | | | | | | | | | | 6c79bf0f2440fd250c8fce8d9b82fcf03d4e8350 subtracts PPPOE_SES_HLEN from mtu at the front of ip_fragment(). So the later subtraction should be removed. The MTU of 802.1q is also 1500, so MTU should not be changed. Signed-off-by: Changli Gao <xiaosuo@gmail.com> Signed-off-by: Bart De Schuymer <bdschuym@pandora.bo> ---- net/ipv4/ip_output.c | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) Signed-off-by: Bart De Schuymer <bdschuym@pandora.bo> Signed-off-by: David S. Miller <davem@davemloft.net>
* net/ipv4: EXPORT_SYMBOL cleanupsEric Dumazet2010-07-121-6/+3
| | | | | | | | | CodingStyle cleanups EXPORT_SYMBOL should immediately follow the symbol declaration. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net/ipv4/ip_output.c: Removal of unused variable in ip_fragment()George Kadianakis2010-07-071-2/+1
| | | | | | | Removal of unused integer variable in ip_fragment(). Signed-off-by: George Kadianakis <desnacked@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>