aboutsummaryrefslogtreecommitdiffstats
path: root/net/ipv4
Commit message (Collapse)AuthorAgeFilesLines
* tcp: allow splice() to build full TSO packetsEric Dumazet2012-04-271-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ This combines upstream commit 2f53384424251c06038ae612e56231b96ab610ee and the follow-on bug fix commit 35f9c09fe9c72eb8ca2b8e89a593e1c151f28fc2 ] vmsplice()/splice(pipe, socket) call do_tcp_sendpages() one page at a time, adding at most 4096 bytes to an skb. (assuming PAGE_SIZE=4096) The call to tcp_push() at the end of do_tcp_sendpages() forces an immediate xmit when pipe is not already filled, and tso_fragment() try to split these skb to MSS multiples. 4096 bytes are usually split in a skb with 2 MSS, and a remaining sub-mss skb (assuming MTU=1500) This makes slow start suboptimal because many small frames are sent to qdisc/driver layers instead of big ones (constrained by cwnd and packets in flight of course) In fact, applications using sendmsg() (adding an additional memory copy) instead of vmsplice()/splice()/sendfile() are a bit faster because of this anomaly, especially if serving small files in environments with large initial [c]wnd. Call tcp_push() only if MSG_MORE is not set in the flags parameter. This bit is automatically provided by splice() internals but for the last page, or on all pages if user specified SPLICE_F_MORE splice() flag. In some workloads, this can reduce number of sent logical packets by an order of magnitude, making zero-copy TCP actually faster than one-copy :) Reported-by: Tom Herbert <therbert@google.com> Cc: Nandita Dukkipati <nanditad@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Tom Herbert <therbert@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: H.K. Jerry Chu <hkchu@google.com> Cc: Maciej Żenczykowski <maze@google.com> Cc: Mahesh Bandewar <maheshb@google.com> Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* tcp: fix syncookie regressionEric Dumazet2012-03-232-17/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit dfd25ffffc132c00070eed64200e8950da5d7e9d ] commit ea4fc0d619 (ipv4: Don't use rt->rt_{src,dst} in ip_queue_xmit()) added a serious regression on synflood handling. Simon Kirby discovered a successful connection was delayed by 20 seconds before being responsive. In my tests, I discovered that xmit frames were lost, and needed ~4 retransmits and a socket dst rebuild before being really sent. In case of syncookie initiated connection, we use a different path to initialize the socket dst, and inet->cork.fl.u.ip4 is left cleared. As ip_queue_xmit() now depends on inet flow being setup, fix this by copying the temp flowi4 we use in cookie_v4_check(). Reported-by: Simon Kirby <sim@netnation.com> Bisected-by: Simon Kirby <sim@netnation.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Tested-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* tcp: fix tcp_shift_skb_data() to not shift SACKed data below snd_unaNeal Cardwell2012-03-191-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 4648dc97af9d496218a05353b0e442b3dfa6aaab ] This commit fixes tcp_shift_skb_data() so that it does not shift SACKed data below snd_una. This fixes an issue whose symptoms exactly match reports showing tp->sacked_out going negative since 3.3.0-rc4 (see "WARNING: at net/ipv4/tcp_input.c:3418" thread on netdev). Since 2008 (832d11c5cd076abc0aa1eaf7be96c81d1a59ce41) tcp_shift_skb_data() had been shifting SACKed ranges that were below snd_una. It checked that the *end* of the skb it was about to shift from was above snd_una, but did not check that the end of the actual shifted range was above snd_una; this commit adds that check. Shifting SACKed ranges below snd_una is problematic because for such ranges tcp_sacktag_one() short-circuits: it does not declare anything as SACKed and does not increase sacked_out. Before the fixes in commits cc9a672ee522d4805495b98680f4a3db5d0a0af9 and daef52bab1fd26e24e8e9578f8fb33ba1d0cb412, shifting SACKed ranges below snd_una happened to work because tcp_shifted_skb() was always (incorrectly) passing in to tcp_sacktag_one() an skb whose end_seq tcp_shift_skb_data() had already guaranteed was beyond snd_una. Hence tcp_sacktag_one() never short-circuited and always increased tp->sacked_out in this case. After those two fixes, my testing has verified that shifting SACKed ranges below snd_una could cause tp->sacked_out to go negative with the following sequence of events: (1) tcp_shift_skb_data() sees an skb whose end_seq is beyond snd_una, then shifts a prefix of that skb that is below snd_una (2) tcp_shifted_skb() increments the packet count of the already-SACKed prev sk_buff (3) tcp_sacktag_one() sees the end of the new SACKed range is below snd_una, so it short-circuits and doesn't increase tp->sacked_out (5) tcp_clean_rtx_queue() sees the SACKed skb has been ACKed, decrements tp->sacked_out by this "inflated" pcount that was missing a matching increase in tp->sacked_out, and hence tp->sacked_out underflows to a u32 like 0xFFFFFFFF, which casted to s32 is negative. (6) this leads to the warnings seen in the recent "WARNING: at net/ipv4/tcp_input.c:3418" thread on the netdev list; e.g.: tcp_input.c:3418 WARN_ON((int)tp->sacked_out < 0); More generally, I think this bug can be tickled in some cases where two or more ACKs from the receiver are lost and then a DSACK arrives that is immediately above an existing SACKed skb in the write queue. This fix changes tcp_shift_skb_data() to abort this sequence at step (1) in the scenario above by noticing that the bytes are below snd_una and not shifting them. Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* tcp: don't fragment SACKed skbs in tcp_mark_head_lost()Neal Cardwell2012-03-191-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit c0638c247f559e1a16ee79e54df14bca2cb679ea ] In tcp_mark_head_lost() we should not attempt to fragment a SACKed skb to mark the first portion as lost. This is for two primary reasons: (1) tcp_shifted_skb() coalesces adjacent regions of SACKed skbs. When doing this, it preserves the sum of their packet counts in order to reflect the real-world dynamics on the wire. But given that skbs can have remainders that do not align to MSS boundaries, this packet count preservation means that for SACKed skbs there is not necessarily a direct linear relationship between tcp_skb_pcount(skb) and skb->len. Thus tcp_mark_head_lost()'s previous attempts to fragment off and mark as lost a prefix of length (packets - oldcnt)*mss from SACKed skbs were leading to occasional failures of the WARN_ON(len > skb->len) in tcp_fragment() (which used to be a BUG_ON(); see the recent "crash in tcp_fragment" thread on netdev). (2) there is no real point in fragmenting off part of a SACKed skb and calling tcp_skb_mark_lost() on it, since tcp_skb_mark_lost() is a NOP for SACKed skbs. Signed-off-by: Neal Cardwell <ncardwell@google.com> Acked-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Acked-by: Yuchung Cheng <ycheng@google.com> Acked-by: Nandita Dukkipati <nanditad@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* tcp: fix false reordering signal in tcp_shifted_skbNeal Cardwell2012-03-191-8/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 4c90d3b30334833450ccbb02f452d4972a3c3c3f ] When tcp_shifted_skb() shifts bytes from the skb that is currently pointed to by 'highest_sack' then the increment of TCP_SKB_CB(skb)->seq implicitly advances tcp_highest_sack_seq(). This implicit advancement, combined with the recent fix to pass the correct SACKed range into tcp_sacktag_one(), caused tcp_sacktag_one() to think that the newly SACKed range was before the tcp_highest_sack_seq(), leading to a call to tcp_update_reordering() with a degree of reordering matching the size of the newly SACKed range (typically just 1 packet, which is a NOP, but potentially larger). This commit fixes this by simply calling tcp_sacktag_one() before the TCP_SKB_CB(skb)->seq advancement that can advance our notion of the highest SACKed sequence. Correspondingly, we can simplify the code a little now that tcp_shifted_skb() should update the lost_cnt_hint in all cases where skb == tp->lost_skb_hint. Signed-off-by: Neal Cardwell <ncardwell@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* ipsec: be careful of non existing mac headersEric Dumazet2012-03-192-8/+3
| | | | | | | | | | | | | | | | | [ Upstream commit 03606895cd98c0a628b17324fd7b5ff15db7e3cd ] Niccolo Belli reported ipsec crashes in case we handle a frame without mac header (atm in his case) Before copying mac header, better make sure it is present. Bugzilla reference: https://bugzilla.kernel.org/show_bug.cgi?id=42809 Reported-by: Niccolò Belli <darkbasic@linuxsystems.it> Tested-by: Niccolò Belli <darkbasic@linuxsystems.it> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* ipv4: fix redirect handlingEric Dumazet2012-02-291-51/+58
| | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 9cc20b268a5a14f5e57b8ad405a83513ab0d78dc ] commit f39925dbde77 (ipv4: Cache learned redirect information in inetpeer.) introduced a regression in ICMP redirect handling. It assumed ipv4_dst_check() would be called because all possible routes were attached to the inetpeer we modify in ip_rt_redirect(), but thats not true. commit 7cc9150ebe (route: fix ICMP redirect validation) tried to fix this but solution was not complete. (It fixed only one route) So we must lookup existing routes (including different TOS values) and call check_peer_redir() on them. Reported-by: Ivan Zahariev <famzah@icdsoft.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> CC: Flavio Leitner <fbl@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* route: fix ICMP redirect validationFlavio Leitner2012-02-291-5/+31
| | | | | | | | | | | | | | | | | | | | [ Upstream commit 7cc9150ebe8ec06cafea9f1c10d92ddacf88d8ae ] The commit f39925dbde7788cfb96419c0f092b086aa325c0f (ipv4: Cache learned redirect information in inetpeer.) removed some ICMP packet validations which are required by RFC 1122, section 3.2.2.2: ... A Redirect message SHOULD be silently discarded if the new gateway address it specifies is not on the same connected (sub-) net through which the Redirect arrived [INTRO:2, Appendix A], or if the source of the Redirect is not the current first-hop gateway for the specified destination (see Section 3.3.1). Signed-off-by: Flavio Leitner <fbl@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* tcp: fix tcp_shifted_skb() adjustment of lost_cnt_hint for FACKNeal Cardwell2012-02-291-0/+4
| | | | | | | | | | | | | | | | | [ Upstream commit 0af2a0d0576205dda778d25c6c344fc6508fc81d ] This commit ensures that lost_cnt_hint is correctly updated in tcp_shifted_skb() for FACK TCP senders. The lost_cnt_hint adjustment in tcp_sacktag_one() only applies to non-FACK senders, so FACK senders need their own adjustment. This applies the spirit of 1e5289e121372a3494402b1b131b41bfe1cf9b7f - except now that the sequence range passed into tcp_sacktag_one() is correct we need only have a special case adjustment for FACK. Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* tcp: fix range tcp_shifted_skb() passes to tcp_sacktag_one()Neal Cardwell2012-02-291-9/+10
| | | | | | | | | | | | | | | | | | | | [ Upstream commit daef52bab1fd26e24e8e9578f8fb33ba1d0cb412 ] Fix the newly-SACKed range to be the range of newly-shifted bytes. Previously - since 832d11c5cd076abc0aa1eaf7be96c81d1a59ce41 - tcp_shifted_skb() incorrectly called tcp_sacktag_one() with the start and end sequence numbers of the skb it passes in set to the range just beyond the range that is newly-SACKed. This commit also removes a special-case adjustment to lost_cnt_hint in tcp_shifted_skb() since the pre-existing adjustment of lost_cnt_hint in tcp_sacktag_one() now properly handles this things now that the correct start sequence number is passed in. Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* tcp: allow tcp_sacktag_one() to tag ranges not aligned with skbsNeal Cardwell2012-02-291-14/+22
| | | | | | | | | | | | | | | [ Upstream commit cc9a672ee522d4805495b98680f4a3db5d0a0af9 ] This commit allows callers of tcp_sacktag_one() to pass in sequence ranges that do not align with skb boundaries, as tcp_shifted_skb() needs to do in an upcoming fix in this patch series. In fact, now tcp_sacktag_one() does not need to depend on an input skb at all, which makes its semantics and dependencies more clear. Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* tcp_v4_send_reset: binding oif to iif in no sock caseShawn Lu2012-02-291-0/+5
| | | | | | | | | | | | | | | | | | | | | | [ Upstream commit e2446eaab5585555a38ea0df4e01ff313dbb4ac9 ] Binding RST packet outgoing interface to incoming interface for tcp v4 when there is no socket associate with it. when sk is not NULL, using sk->sk_bound_dev_if instead. (suggested by Eric Dumazet). This has few benefits: 1. tcp_v6_send_reset already did that. 2. This helps tcp connect with SO_BINDTODEVICE set. When connection is lost, we still able to sending out RST using same interface. 3. we are sending reply, it is most likely to be succeed if iif is used Signed-off-by: Shawn Lu <shawn.lu@ericsson.com> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* net: Don't proxy arp respond if iif == rt->dst.dev if private VLAN is disabledThomas Graf2012-02-291-1/+2
| | | | | | | | | | | | | | | | | | | [ Upstream commit 70620c46ac2b45c24b0f22002fdf5ddd1f7daf81 ] Commit 653241 (net: RFC3069, private VLAN proxy arp support) changed the behavior of arp proxy to send arp replies back out on the interface the request came in even if the private VLAN feature is disabled. Previously we checked rt->dst.dev != skb->dev for in scenarios, when proxy arp is enabled on for the netdevice and also when individual proxy neighbour entries have been added. This patch adds the check back for the pneigh_lookup() scenario. Signed-off-by: Thomas Graf <tgraf@suug.ch> Acked-by: Jesper Dangaard Brouer <hawk@comx.dk> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* ipv4: Fix wrong order of ip_rt_get_source() and update iph->daddr.Li Wei2012-02-291-1/+1
| | | | | | | | | | | | | | [ Upstream commit 5dc7883f2a7c25f8df40d7479687153558cd531b ] This patch fix a bug which introduced by commit ac8a4810 (ipv4: Save nexthop address of LSRR/SSRR option to IPCB.).In that patch, we saved the nexthop of SRR in ip_option->nexthop and update iph->daddr until we get to ip_forward_options(), but we need to update it before ip_rt_get_source(), otherwise we may get a wrong src. Signed-off-by: Li Wei <lw@cn.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* ipv4: Save nexthop address of LSRR/SSRR option to IPCB.Li Wei2012-02-292-3/+4
| | | | | | | | | | | | | | | | [ Upstream commit ac8a48106be49c422575ddc7531b776f8eb49610 ] We can not update iph->daddr in ip_options_rcv_srr(), It is too early. When some exception ocurred later (eg. in ip_forward() when goto sr_failed) we need the ip header be identical to the original one as ICMP need it. Add a field 'nexthop' in struct ip_options to save nexthop of LSRR or SSRR option. Signed-off-by: Li Wei <lw@cn.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* ipv4: fix for ip_options_rcv_srr() daddr update.Li Wei2012-02-291-0/+1
| | | | | | | | | | | | [ Upstream commit b12f62efb8ec0b9523bdb6c2d412c07193086de9 ] When opt->srr_is_hit is set skb_rtable(skb) has been updated for 'nexthop' and iph->daddr should always equals to skb_rtable->rt_dst holds, We need update iph->daddr either. Signed-off-by: Li Wei <lw@cn.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* net: fix NULL dereferences in check_peer_redir()Eric Dumazet2012-02-134-28/+55
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit d3aaeb38c40e5a6c08dd31a1b64da65c4352be36, along with dependent backports of commits: 69cce1d1404968f78b177a0314f5822d5afdbbfb 9de79c127cccecb11ae6a21ab1499e87aa222880 218fa90f072e4aeff9003d57e390857f4f35513e 580da35a31f91a594f3090b7a2c39b85cb051a12 f7e57044eeb1841847c24aa06766c8290c202583 e049f28883126c689cf95859480d9ee4ab23b7fa ] Gergely Kalman reported crashes in check_peer_redir(). It appears commit f39925dbde778 (ipv4: Cache learned redirect information in inetpeer.) added a race, leading to possible NULL ptr dereference. Since we can now change dst neighbour, we should make sure a reader can safely use a neighbour. Add RCU protection to dst neighbour, and make sure check_peer_redir() can be called safely by different cpus in parallel. As neighbours are already freed after one RCU grace period, this patch should not add typical RCU penalty (cache cold effects) Many thanks to Gergely for providing a pretty report pointing to the bug. Reported-by: Gergely Kalman <synapse@hippy.csoma.elte.hu> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* tcp: md5: using remote adress for md5 lookup in rst packetshawnlu2012-02-031-1/+1
| | | | | | | | | | | | [ Upstream commit 8a622e71f58ec9f092fc99eacae0e6cf14f6e742 ] md5 key is added in socket through remote address. remote address should be used in finding md5 key when sending out reset packet. Signed-off-by: shawnlu <shawn.lu@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* tcp: fix tcp_trim_head() to adjust segment count with skb MSSNeal Cardwell2012-02-031-4/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 5b35e1e6e9ca651e6b291c96d1106043c9af314a ] This commit fixes tcp_trim_head() to recalculate the number of segments in the skb with the skb's existing MSS, so trimming the head causes the skb segment count to be monotonically non-increasing - it should stay the same or go down, but not increase. Previously tcp_trim_head() used the current MSS of the connection. But if there was a decrease in MSS between original transmission and ACK (e.g. due to PMTUD), this could cause tcp_trim_head() to counter-intuitively increase the segment count when trimming bytes off the head of an skb. This violated assumptions in tcp_tso_acked() that tcp_trim_head() only decreases the packet count, so that packets_acked in tcp_tso_acked() could underflow, leading tcp_clean_rtx_queue() to pass u32 pkts_acked values as large as 0xffffffff to ca_ops->pkts_acked(). As an aside, if tcp_trim_head() had really wanted the skb to reflect the current MSS, it should have called tcp_set_skb_tso_segs() unconditionally, since a decrease in MSS would mean that a single-packet skb should now be sliced into multiple segments. Signed-off-by: Neal Cardwell <ncardwell@google.com> Acked-by: Nandita Dukkipati <nanditad@google.com> Acked-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* ah: Don't return NET_XMIT_DROP on input.Nick Bowler2012-02-031-2/+0
| | | | | | | | | | | | | | | | commit 4b90a603a1b21d63cf743cc833680cb195a729f6 upstream. When the ahash driver returns -EBUSY, AH4/6 input functions return NET_XMIT_DROP, presumably copied from the output code path. But returning transmit codes on input doesn't make a lot of sense. Since NET_XMIT_DROP is a positive int, this gets interpreted as the next header type (i.e., success). As that can only end badly, remove the check. Signed-off-by: Nick Bowler <nbowler@elliptictech.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* ah: Read nexthdr value before overwriting it in ahash input callback.Nick Bowler2012-01-251-2/+2
| | | | | | | | | | | | | commit b7ea81a58adc123a4e980cb0eff9eb5c144b5dc7 upstream. The AH4/6 ahash input callbacks read out the nexthdr field from the AH header *after* they overwrite that header. This is obviously not going to end well. Fix it up. Signed-off-by: Nick Bowler <nbowler@elliptictech.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
* ah: Correctly pass error codes in ahash output callback.Nick Bowler2012-01-251-2/+0
| | | | | | | | | | | | | | | | | | commit 069294e813ed5f27f82613b027609bcda5f1b914 upstream. The AH4/6 ahash output callbacks pass nexthdr to xfrm_output_resume instead of the error code. This appears to be a copy+paste error from the input case, where nexthdr is expected. This causes the driver to continuously add AH headers to the datagram until either an allocation fails and the packet is dropped or the ahash driver hits a synchronous fallback and the resulting monstrosity is transmitted. Correct this issue by simply passing the error code unadulterated. Signed-off-by: Nick Bowler <nbowler@elliptictech.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
* igmp: Avoid zero delay when receiving odd mixture of IGMP queriesBen Hutchings2012-01-121-0/+2
| | | | | | | | | | | | | | | commit a8c1f65c79cbbb2f7da782d4c9d15639a9b94b27 upstream. Commit 5b7c84066733c5dfb0e4016d939757b38de189e4 ('ipv4: correct IGMP behavior on v3 query during v2-compatibility mode') added yet another case for query parsing, which can result in max_delay = 0. Substitute a value of 1, as in the usual v3 case. Reported-by: Simon McVittie <smcv@debian.org> References: http://bugs.debian.org/654876 Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv4: using prefetch requires including prefetch.hStephen Rothwell2012-01-061-0/+1
| | | | | | | | | | [ Upstream commit b9eda06f80b0db61a73bd87c6b0eb67d8aca55ad ] Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: David Miller <davem@davemloft.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
* ipv4: reintroduce route cache garbage collectorEric Dumazet2012-01-061-0/+106
| | | | | | | | | | | | | | | | | | | | [ Upstream commit 9f28a2fc0bd77511f649c0a788c7bf9a5fd04edb ] Commit 2c8cec5c10b (ipv4: Cache learned PMTU information in inetpeer) removed IP route cache garbage collector a bit too soon, as this gc was responsible for expired routes cleanup, releasing their neighbour reference. As pointed out by Robert Gladewitz, recent kernels can fill and exhaust their neighbour cache. Reintroduce the garbage collection, since we'll have to wait our neighbour lookups become refcount-less to not depend on this stuff. Reported-by: Robert Gladewitz <gladewitz@gmx.de> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
* ipv4: flush route cache after change accept_localWeiping Pan2012-01-061-0/+5
| | | | | | | | | | | | [ Upstream commit d01ff0a049f749e0bf10a35bb23edd012718c8c2 ] After reset ipv4_devconf->data[IPV4_DEVCONF_ACCEPT_LOCAL] to 0, we should flush route cache, or it will continue receive packets with local source address, which should be dropped. Signed-off-by: Weiping Pan <panweiping3@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
* net: have ipconfig not wait if no dev is availableGerlando Falauto2012-01-061-0/+4
| | | | | | | | | | | | | | | | | | [ Upstream commit cd7816d14953c8af910af5bb92f488b0b277e29d ] previous commit 3fb72f1e6e6165c5f495e8dc11c5bbd14c73385c makes IP-Config wait for carrier on at least one network device. Before waiting (predefined value 120s), check that at least one device was successfully brought up. Otherwise (e.g. buggy bootloader which does not set the MAC address) there is no point in waiting for carrier. Cc: Micha Nelissen <micha@neli.hopto.org> Cc: Holger Brunck <holger.brunck@keymile.com> Signed-off-by: Gerlando Falauto <gerlando.falauto@keymile.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
* ipip, sit: copy parms.name after register_netdeviceTed Feng2012-01-061-1/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 72b36015ba43a3cca5303f5534d2c3e1899eae29 upstream. Same fix as 731abb9cb2 for ipip and sit tunnel. Commit 1c5cae815d removed an explicit call to dev_alloc_name in ipip_tunnel_locate and ipip6_tunnel_locate, because register_netdevice will now create a valid name, however the tunnel keeps a copy of the name in the private parms structure. Fix this by copying the name back after register_netdevice has successfully returned. This shows up if you do a simple tunnel add, followed by a tunnel show: $ sudo ip tunnel add mode ipip remote 10.2.20.211 $ ip tunnel tunl0: ip/ip remote any local any ttl inherit nopmtudisc tunl%d: ip/ip remote 10.2.20.211 local any ttl inherit $ sudo ip tunnel add mode sit remote 10.2.20.212 $ ip tunnel sit0: ipv6/ip remote any local any ttl 64 nopmtudisc 6rd-prefix 2002::/16 sit%d: ioctl 89f8 failed: No such device sit%d: ipv6/ip remote 10.2.20.212 local any ttl inherit Signed-off-by: Ted Feng <artisdom@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
* tcp: properly update lost_cnt_hint during shiftingYan, Zheng2011-11-111-3/+1
| | | | | | | | | | | | | | | | | [ Upstream commit 1e5289e121372a3494402b1b131b41bfe1cf9b7f ] lost_skb_hint is used by tcp_mark_head_lost() to mark the first unhandled skb. lost_cnt_hint is the number of packets or sacked packets before the lost_skb_hint; When shifting a skb that is before the lost_skb_hint, if tcp_is_fack() is ture, the skb has already been counted in the lost_cnt_hint; if tcp_is_fack() is false, tcp_sacktag_one() will increase the lost_cnt_hint. So tcp_shifted_skb() does not need to adjust the lost_cnt_hint by itself. When shifting a skb that is equal to lost_skb_hint, the shifted packets will not be counted by tcp_mark_head_lost(). So tcp_shifted_skb() should adjust the lost_cnt_hint even tcp_is_fack(tp) is true. Signed-off-by: Zheng Yan <zheng.z.yan@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
* tcp: properly handle md5sig_pool referencesYan, Zheng2011-11-111-4/+7
| | | | | | | | | | | | | | [ Upstream commit 260fcbeb1ae9e768a44c9925338fbacb0d7e5ba9 ] tcp_v4_clear_md5_list() assumes that multiple tcp md5sig peers only hold one reference to md5sig_pool. but tcp_v4_md5_do_add() increases use count of md5sig_pool for each peer. This patch makes tcp_v4_md5_do_add() only increases use count for the first tcp md5sig peer. Signed-off-by: Zheng Yan <zheng.z.yan@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
* ipv4: fix ipsec forward performance regressionYan, Zheng2011-11-111-7/+7
| | | | | | | | | | | | | [ Upstream commit b73233960a59ee66e09d642f13d0592b13651e94 ] There is bug in commit 5e2b61f(ipv4: Remove flowi from struct rtable). It makes xfrm4_fill_dst() modify wrong data structure. Signed-off-by: Zheng Yan <zheng.z.yan@intel.com> Reported-by: Kim Phillips <kim.phillips@freescale.com> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
* tcp: initialize variable ecn_ok in syncookies pathMike Waychison2011-10-031-1/+1
| | | | | | | | | | | | | | | [ Upstream commit f0e3d0689da401f7d1981c2777a714ba295ea5ff ] Using a gcc 4.4.3, warnings are emitted for a possibly uninitialized use of ecn_ok. This can happen if cookie_check_timestamp() returns due to not having seen a timestamp. Defaulting to ecn off seems like a reasonable thing to do in this case, so initialized ecn_ok to false. Signed-off-by: Mike Waychison <mikew@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
* tcp: fix validation of D-SACKZheng Yan2011-10-031-1/+1
| | | | | | | | | | | | [ Upstream commit f779b2d60ab95c17f1e025778ed0df3ec2f05d75 ] D-SACK is allowed to reside below snd_una. But the corresponding check in tcp_is_sackblock_valid() is the exact opposite. It looks like a typo. Signed-off-by: Zheng Yan <zheng.z.yan@intel.com> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
* netfilter: TCP and raw fix for ip_route_me_harderJulian Anastasov2011-10-031-10/+8
| | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 797fd3913abf2f7036003ab8d3d019cbea41affd ] TCP in some cases uses different global (raw) socket to send RST and ACK. The transparent flag is not set there. Currently, it is a problem for rerouting after the previous change. Fix it by simplifying the checks in ip_route_me_harder and use FLOWI_FLAG_ANYSRC even for sockets. It looks safe because the initial routing allowed this source address to be used and now we just have to make sure the packet is rerouted. As a side effect this also allows rerouting for normal raw sockets that use spoofed source addresses which was not possible even before we eliminated the ip_route_input call. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
* mcast: Fix source address selection for multicast listener reportYan, Zheng2011-10-031-1/+1
| | | | | | | | | | | [ Upstream commit e05c4ad3ed874ee4f5e2c969e55d318ec654332c ] Should check use count of include mode filter instead of total number of include mode filters. Signed-off-by: Zheng Yan <zheng.z.yan@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
* ipv4: some rt_iif -> rt_route_iif conversionsJulian Anastasov2011-10-031-5/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 97a804102021431fa6fa33c21c85df762b0f5cb9 ] As rt_iif represents input device even for packets coming from loopback with output route, it is not an unique key specific to input routes. Now rt_route_iif has such role, it was fl.iif in 2.6.38, so better to change the checks at some places to save CPU cycles and to restore 2.6.38 semantics. compare_keys: - input routes: only rt_route_iif matters, rt_iif is same - output routes: only rt_oif matters, rt_iif is not used for matching in __ip_route_output_key - now we are back to 2.6.38 state ip_route_input_common: - matching rt_route_iif implies input route - compared to 2.6.38 we eliminated one rth->fl.oif check because it was not needed even for 2.6.38 compare_hash_inputs: Only the change here is not an optimization, it has effect only for output routes. I assume I'm restoring the original intention to ignore oif, it was using fl.iif - now we are back to 2.6.38 state Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
* gre: fix improper error handlingxeb@mail.ru2011-08-151-15/+6
| | | | | | | | | | | | [ Upstream commit 559fafb94ad9e4cd8774f39241917c57396f9fc5 ] Fix improper protocol err_handler, current implementation is fully unapplicable and may cause kernel crash due to double kfree_skb. Signed-off-by: Dmitry Kozlov <xeb@mail.ru> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
* ipv4: use RT_TOS after some rt_tos conversionsJulian Anastasov2011-08-152-2/+2
| | | | | | | | | | [ Upstream commit b0fe4a31849063fcac0bdc93716ca92615e93f57 ] rt_tos was changed to iph->tos but it must be filtered by RT_TOS Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
* IPv4: Send gratuitous ARP for secondary IP addresses alsoZoltan Kiss2011-08-151-8/+8
| | | | | | | | | | | | [ Upstream commit b76d0789c92a816a5539dc14232a700b8d62a53a ] If a device event generates gratuitous ARP messages, only primary address is used for sending. This patch iterates through the whole list. Tested with 2 IP addresses configuration on bonding interface. Signed-off-by: Zoltan Kiss <schaman@sch.bme.hu> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
* net: adjust array indexJulia Lawall2011-08-151-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit a1889c0d2039a53ae04abb9f20c62500bd312bf3 ] Convert array index from the loop bound to the loop index. A simplified version of the semantic patch that fixes this problem is as follows: (http://coccinelle.lip6.fr/) // <smpl> @@ expression e1,e2,ar; @@ for(e1 = 0; e1 < e2; e1++) { <... ar[ - e2 + e1 ] ...> } // </smpl> Signed-off-by: Julia Lawall <julia@diku.dk> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
* ipv4: fix the reusing of routing cache entriesJulian Anastasov2011-08-151-1/+2
| | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit d547f727df86059104af2234804fdd538e112015 ] compare_keys and ip_route_input_common rely on rt_oif for distinguishing of input and output routes with same keys values. But sometimes the input route has also same hash chain (keyed by iif != 0) with the output routes (keyed by orig_oif=0). Problem visible if running with small number of rhash_entries. Fix them to use rt_route_iif instead. By this way input route can not be returned to users that request output route. The patch fixes the ip_rt_bug errors that were reported in ip_local_out context, mostly for 255.255.255.255 destinations. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
* ipv4: Constrain UFO fragment sizes to multiples of 8 bytesBill Sommerfeld2011-08-151-3/+3
| | | | | | | | | | | | | | | [ Upstream commit d9be4f7a6f5a8da3133b832eca41c3591420b1ca ] Because the ip fragment offset field counts 8-byte chunks, ip fragments other than the last must contain a multiple of 8 bytes of payload. ip_ufo_append_data wasn't respecting this constraint and, depending on the MTU and ip option sizes, could create malformed non-final fragments. Google-Bug-Id: 5009328 Signed-off-by: Bill Sommerfeld <wsommerfeld@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
* icmp: Fix regression in nexthop resolution during replies.David S. Miller2011-08-151-6/+8
| | | | | | | | | | | | | | [ Upstream commit 415b3334a21aa67806c52d1acf4e72e14f7f402f ] icmp_route_lookup() uses the wrong flow parameters if the reverse session route lookup isn't used. So do not commit to the re-decoded flow until we actually make a final decision to use a real route saved in 'rt2'. Reported-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
* net: Compute protocol sequence numbers and fragment IDs using MD5.David S. Miller2011-08-155-0/+5
| | | | | | | | | | | | | | | | | | | | | Computers have become a lot faster since we compromised on the partial MD4 hash which we use currently for performance reasons. MD5 is a much safer choice, and is inline with both RFC1948 and other ISS generators (OpenBSD, Solaris, etc.) Furthermore, only having 24-bits of the sequence number be truly unpredictable is a very serious limitation. So the periodic regeneration and 8-bit counter have been removed. We compute and use a full 32-bit sequence number. For ipv6, DCCP was found to use a 32-bit truncated initial sequence number (it needs 43-bits) and that is fixed here as well. Reported-by: Dan Kaminsky <dan@doxpara.com> Tested-by: Willy Tarreau <w@1wt.eu> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
* net: refine {udp|tcp|sctp}_mem limitsEric Dumazet2011-07-072-16/+4
| | | | | | | | | | | | | | | | | Current tcp/udp/sctp global memory limits are not taking into account hugepages allocations, and allow 50% of ram to be used by buffers of a single protocol [ not counting space used by sockets / inodes ...] Lets use nr_free_buffer_pages() and allow a default of 1/8 of kernel ram per protocol, and a minimum of 128 pages. Heavy duty machines sysadmins probably need to tweak limits anyway. References: https://bugzilla.stlinux.com/show_bug.cgi?id=38032 Reported-by: starlight <starlight@binnacle.cx> Suggested-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: bind() fix error return on wrong address familyMarcus Meissner2011-07-041-1/+3
| | | | | | | | | | | | | | | | | | | | | | Hi, Reinhard Max also pointed out that the error should EAFNOSUPPORT according to POSIX. The Linux manpages have it as EINVAL, some other OSes (Minix, HPUX, perhaps BSD) use EAFNOSUPPORT. Windows uses WSAEFAULT according to MSDN. Other protocols error values in their af bind() methods in current mainline git as far as a brief look shows: EAFNOSUPPORT: atm, appletalk, l2tp, llc, phonet, rxrpc EINVAL: ax25, bluetooth, decnet, econet, ieee802154, iucv, netlink, netrom, packet, rds, rose, unix, x25, No check?: can/raw, ipv6/raw, irda, l2tp/l2tp_ip Ciao, Marcus Signed-off-by: Marcus Meissner <meissner@suse.de> Cc: Reinhard Max <max@suse.de> Signed-off-by: David S. Miller <davem@davemloft.net>
* xfrm4: Don't call icmp_send on local errorSteffen Klassert2011-07-011-1/+6
| | | | | | | | | Calling icmp_send() on a local message size error leads to an incorrect update of the path mtu. So use ip_local_error() instead to notify the socket about the error. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv4: Don't use ufo handling on later transformed packetsSteffen Klassert2011-07-011-1/+1
| | | | | | | | | | | We might call ip_ufo_append_data() for packets that will be IPsec transformed later. This function should be used just for real udp packets. So we check for rt->dst.header_len which is only nonzero on IPsec handling and call ip_ufo_append_data() just if rt->dst.header_len is zero. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* netfilter: Fix ip_route_me_harder triggering ip_rt_bugJulian Anastasov2011-06-292-48/+26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Avoid creating input routes with ip_route_me_harder. It does not work for locally generated packets. Instead, restrict sockets to provide valid saddr for output route (or unicast saddr for transparent proxy). For other traffic allow saddr to be unicast or local but if callers forget to check saddr type use 0 for the output route. The resulting handling should be: - REJECT TCP: - in INPUT we can provide addr_type = RTN_LOCAL but better allow rejecting traffic delivered with local route (no IP address => use RTN_UNSPEC to allow also RTN_UNICAST). - FORWARD: RTN_UNSPEC => allow RTN_LOCAL/RTN_UNICAST saddr, add fix to ignore RTN_BROADCAST and RTN_MULTICAST - OUTPUT: RTN_UNSPEC - NAT, mangle, ip_queue, nf_ip_reroute: RTN_UNSPEC in LOCAL_OUT - IPVS: - use RTN_LOCAL in LOCAL_OUT and FORWARD after SNAT to restrict saddr to be local Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv4: Fix IPsec slowpath fragmentation problemSteffen Klassert2011-06-271-5/+5
| | | | | | | | | | | | | ip_append_data() builds packets based on the mtu from dst_mtu(rt->dst.path). On IPsec the effective mtu is lower because we need to add the protocol headers and trailers later when we do the IPsec transformations. So after the IPsec transformations the packet might be too big, which leads to a slowpath fragmentation then. This patch fixes this by building the packets based on the lower IPsec mtu from dst_mtu(&rt->dst) and adapts the exthdr handling to this. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: David S. Miller <davem@davemloft.net>