|
|
|
|
BGPBuilding Reliable Networks with the Border Gateway ProtocolBy Iljitsch van BeijnumSeptember 2002 ISBN 0-596-00254-8, 304 pages, |
Chapter 6
Traffic EngineeringBeing multihomed means you have two (or more) routes to any destination connected to the Internet. In other words, you need a way to decide which route is better. When left to its own devices, a BGP router will try to send traffic over the route with the shortest AS path. Depending on the connectivity of your upstream ISPs and traffic patterns, this will suit the available bandwidth of the respective connections to varying degrees. Even though bandwidth is getting cheaper all the time, it's usually advantageous to try to balance the traffic so that it takes advantage of all the available bandwidth in a multihomed setup. Thus, if BGP decides that most of the outgoing traffic should go through the smallest pipe, you will have to tell it that this isn't what you want by tweaking one or more BGP attributes. Ideally, more traffic will then flow over the under-used connection. At the same time, you'll want the traffic to take the best route to a destination, if possible, whatever "best" may be. This type of activity is called traffic engineering.
Engineering outgoing traffic is the easy part, because you have control over what your own routers do. It's harder to get incoming traffic balanced properly over the available connections. At the end of the chapter, there is a discussion of queuing, traffic shaping, and traffic policing techniques that can be used to maximize network performance under low-bandwidth conditions.
The examples in this chapter all assume a network with Autonomous System number 60055 multihomed to two ISPs: ISP A (AS 40077) and ISP B (AS 50066). The way ISP A and ISP B interconnect with other ASes differs from example to example, however. See Figure 6-1.
Figure 6-1. Network used for examples in this chapter
![]()
Unlike in Chapter 5, the connections to both ISP A and ISP B terminate at the same router, so the settings for both ISPs can be shown side by side.
TIP: The examples in this chapter show only the commands necessary to perform the function being discussed. You also need to configure filters and other features discussed in Chapter 5 to arrive at a working configuration.
Knowing Which Route Is Best
Traditionally, there have been three figures that describe the quality of a connection: bandwidth, delay, and packet loss. A connection with high bandwidth, low delay, and low packet loss is obviously better than one with low bandwidth, high delay, and high packet loss. But which is better: a 45-Mbps satellite connection with a 300-millisecond delay,[1] or a 1544-Kbps terrestrial connection with a 3-ms delay? There is of course no easy answer: it depends. A protocol such as Telnet uses very little bandwidth, but when the user presses a key, he has to wait for the packet containing the input character to travel over the network and for the packet containing the response to travel back again before it shows up on his screen. So any delay of more than a few dozen milliseconds is immediately noticeable. With FTP, the time individual packets take to traverse the line is of little importance. What counts is the total time the file transfer takes, which depends mostly on the bandwidth of the connection. Both low-bandwidth, delay-sensitive applications such as Telnet and bandwidth-hungry, delay-tolerant ones such as FTP suffer from packet loss, because both typically use TCP. TCP uses complex algorithms to optimize performance (especially data throughput) for every possible combination of delay, bandwidth, and packet loss. TCP assumes packet loss indicates congestion in the network, so it slows down when packets are lost. (The TCP congestion management algorithms are discussed at the end of this chapter.)
Finding High- and Low-Quality Routes
There are no hard and fast rules about which quality makes a certain route better than another: application requirements are different, and the interaction between the bandwidth, delay, and packet loss parameters complicate matters further. But that doesn't mean there is nothing you can do. It may be hard to determine which of two good routes is best, but it isn't hard to determine that a route is bad. Some experimentation with the traceroute program can bring interesting information to the surface you can use to base your route selection policy on.
traceroute
Doing some traceroutes to destinations that are reachable over a certain path will often reveal some or all of the following information, which you may want to use to select the preferred route to certain destinations:
- Congestion
- Congestion lowers the available bandwidth and increases delay and packet loss, so it's important to avoid routes over congested paths. Congestion shows up in a traceroute in two ways: there is usually packet loss, and the round-trip times are often inconsistent: most of the time high, but sometimes low.
- Distance
- Two routes can have different paths. Some networks interconnect only at relatively few locations, so they may have to transport traffic over long distances to get it to its destination. Others have better interconnection, so the traffic doesn't have to take a detour. There may be reasons not to prefer the more direct route, such as lower bandwidth or congestion, but generally a shorter geographic path is better. Every 100 kilometers or 60 miles of fiber adds about a millisecond to the round-trip time, because light travels at a speed of approximately 207,000 km or 129,000 miles per second in fiber. A trans-Atlantic or transcontinental detour can easily add up to more than 100-ms extra delay.
- The distance of a path isn't directly visible in traceroute (apart from the higher delay), but many networks are helpful enough to give their router descriptive names, so it's possible to deduce the geographic path to some extent. Also, this information may be available in BGP itself in the form of communities that indicate where a route was learned.
- Hops
- In general, the number of hops that shows up on a traceroute isn't too important. But each hop potentially adds additional delay, because packets have to wait in a queue before they are transmitted, and the extra equipment in a path means that a failure somewhere along the way is more likely. So all else being equal, paths with fewer hops are slightly better. On the other hand, paths with few hops probably use some kind of layer 2 switching, such as frame relay or ATM, which adds another layer of complexity to the network.
It may be necessary to temporarily reroute outgoing traffic to observe the properties of alternative paths. See the section "Setting the Local Preference" for more details on how to do this. Example 6-1 is the output of a traceroute to a somewhat congested destination. (Parts of the domain names and IP addresses have been removed for brevity.)
Example 6-1: traceroute showing some congestion
traceroute to g.root-servers.net (192.112.36.4), 30 hops max, 40 byte packets1 208.100 (208.100) 0.602 ms 0.511 ms 0.498 ms2 63.1 (63.1) 0.306 ms 0.272 ms 0.415 ms3 pos3-2.gw2.dca8 (157.58) 0.982 ms 0.957 ms 0.967 ms4 0.so-3-0.XL2.DCA8 (46.94) 1.116 ms 1.104 ms 1.095 ms5 0.so-7-0.XL2.DCA6 (46.25) 2.456 ms 2.672 ms 2.444 ms6 POS7-0.BR4.DCA6 (52.233) 2.408 ms 2.356 ms 2.369 ms7 204.98 (204.98) 3.296 ms 3.395 ms 3.314 ms8 wdc-core-01.inet (205.37) 3.344 ms 3.320 ms 3.293 ms9 wdc-edge-05.inet (205.82) 3.465 ms 3.354 ms 3.295 ms10 63.222 (63.222) 5.929 ms 226.962 ms 11.260 ms11 198.50 (198.50) 52.694 ms 79.870 ms 33.990 ms12 G.ROOT-SERVERS.NET (192.4) 23.892 ms * 19.495 msBeginning at line 10, there is a sudden increase in delay, and the delay becomes inconsistent from one probe packet to another. If this was the result of distance, the increase in delay would be across the board; light always travels at the same speed. Also, both the source and destination of the trace are in Virginia in this example. A lot of interconnection takes place there, so a detour is unlikely. The delay stabilizes at the last hop, so the earlier delay figures are probably due to high CPU load for the two hops just before the last, and the ICMP processing took a relatively long time. But the 20-ms delay with a 4-ms (20%) difference between the two delay figures on the last line indicates either a rather slow connection or high queuing delays. There is no packet loss, however, other than the asterisk in the middle of the last line. A missing answer to the second probe on a line usually indicates that the responding host or router limits the number of responses to probe packets per unit of time. So the level of congestion seen here isn't high.
Is the Highest-Bandwidth Route Best?
As you may have noticed, bandwidth isn't on my list of route-selection criteria. Obviously, bandwidth can be a decisive factor. For instance, if you connect to the vBNS or another high bandwidth network, you'll want to take advantage of this connection. You should then probably prefer all routes over the high bandwidth link. But in most cases, you will connect to ISPs that have many times the bandwidth you require, even if one ISP has even more than another. And if one of your connections is faster than another, this doesn't mean you'll want to use the faster connection for all traffic. For instance, balancing traffic so three quarters of it flows over a 4.5-Mbps fractional T3 connection and a quarter flows over a T1 results in an effective bandwidth of 6 Mbps. Using the fractional T3 connection for all traffic means no individual stream or session is limited to the maximum bandwidth of the slower T1 line, but it limits the total available bandwidth to that of the 4.5-Mbps line.
If you know in advance the IP addresses that high bandwidth applications will connect to, you can prefer routes to those address ranges over the higher bandwidth connection. If these addresses are not known, or the list is too long, you'll have to look at other factors when configuring route selection parameters on the router.
Route Maps
Cisco IOS provides the network administrator with route maps to modify the Local Preference, AS path, and MED prior to inclusion of a route in the BGP table and the subsequent best route selection process, or before a route is propagated to a neighbor. A route map is much like an if-then construction in a programming language. First, a match line is applied. If the route matches, the set lines that follow are applied. The route is then passed to the BGP table or to the neighbor, depending on whether the route map is set for incoming or outgoing route updates. Route maps are grouped together using a tag or name, and each route map has a sequence number. The route map with the lowest sequence number is evaluated first. When all route maps that share the same tag have been evaluated but there is no match, or when there is an explicit deny, the route is rejected and not entered into the BGP table or announced to the neighbor. Here are the match criteria most relevant to BGP processing:
- as-path
- Using an AS path access list
- community
- Using a community list
- ip address
- Using an access list or prefix list
- ip next-hop
- Using an access list or prefix list
- metric
- The Multi Exit Discriminator
Actions that can be taken using the set part of a route map include:
- as-path
- Prepending extra AS numbers to the AS path
- comm-list
- Deleting communities
- community
- Adding or replacing communities
- dampening
- Setting the flap-dampening parameters
- ip next-hop
- Setting the next hop address
- local-preference
- Setting the Local Preference
- metric
- Setting or changing the MED metric
- weight
- Setting the weight value
More information on route maps is available in the Cisco documentation, available over the Web at http://www.cisco.com.
Setting the Local Preference
The easiest and most effective way to influence the BGP path selection process is to adjust the Local Preference. This works well when certain routes are always better than others, for instance:
- Routes over the main connection are preferred over routes that use a slower backup connection.
- Routes from peering connections are preferred over routes from transit connections.
- Routes directly to customers are preferred over external routes.
Example 6-2 shows part of a BGP configuration where the routes received from both peers receive different Local Preference values.
Example 6-2: Setting the Local Preference for all routes received from a BGP neighbor
!router bgp 60055neighbor 192.0.254.17 remote-as 40077neighbor 192.0.254.17 route-map ispa-in inneighbor 219.2.19.1 remote-as 50066neighbor 219.2.19.1 route-map ispb-in in!route-map ispa-in permit 10set local-preference 90!route-map ispb-in permit 10set local-preference 110!The permit keyword in the route-map statement means matched routes will be permitted to enter the BGP table or be propagated to the neighbor; a deny route map will filter out all routes matching the match clause. The number 10 is the sequence number, used to apply the different route maps with the same tag in the right sequence. In this case, there is only one route map for each tag (
ispa-inandispb-in), so the sequence number doesn't do anything.Since we want to match all routes, there is no need to supply a match clause for the route maps. Both route maps just use a set clause to set the Local Preference for every route that is received from the respective neighbor. This has the effect that if ISP B has a route to a destination, this route will always be preferred over the route ISP A has to the same destination. Routes from ISP A will be used only if there is no matching route over ISP B. This would be a good routing policy if traffic over ISP B is a lot cheaper than traffic over ISP A. Example 6-3 shows the BGP table after applying the route maps.
Example 6-3: Partial BGP table with different Local Preferences
BR1#show ip bgpBGP table version is 619734, local router ID is 192.0.254.18Status codes: s suppressed, d damped, h history, * valid, > best, i - internalOrigin codes: i - IGP, e - EGP, ? - incompleteNetwork Next Hop Metric LocPrf Path* 7.7.0.0/19 192.0.254.17 90 40077 397 i*> 219.2.19.1 110 50066 5703 397 i* 8.8.0.0/16 192.0.254.17 90 40077 30021 i*> 219.2.19.1 110 50066 30021 i* 9.9.0.0/20 192.0.254.17 90 40077 5930 1070 i*> 219.2.19.1 110 50066 1070 iUnder normal circumstances, the router would choose the route over AS 40077 for network
7.7.0.0/19because the path over AS 50066 is longer. The higher Local Preference has precedence over the AS path length, however, so the route over ISP B is selected, as indicated with a>character. For8.8.0.0/16, the AS path is the same length, so under other circumstances the decision would have come down to the tie- breaking rules. And for9.9.0.0/20, the higher Local Preference doesn't really make a difference, because the route over AS 50066 has a shorter path anyway.This policy works well as long as ISP B provides good connectivity to all destinations on the Net. But maybe ISP B peers with AS 30088 over a heavily congested connection, as shown in Figure 6-2.
Figure 6-2. The congestion between ISP B and AS 30088
![]()
In this case, routes that traverse AS 30088 should be avoided. This is accomplished in Example 6-4.
Example 6-4: Setting the Local Preference depending on AS path
!ip as-path access-list 4 permit _30088_ip as-path access-list 4 deny .*!route-map ispa-in permit 10set local-preference 90!route-map ispb-in permit 10match as-path 4set local-preference 80!route-map ispb-in permit 20set local-preference 110!The route map
ispb-in permit 10uses a match clause pointing to AS path access list 4 to find all routes with AS number 30088 in their AS path. The underscore characters before and after the AS number match a space, and the beginning or the end of the path. For a five-digit AS number, this doesn't make a difference, but the regular expression "3008" not only matches paths with AS number 3008 in it, but also with AS numbers such as 13008, 30080, 30081 and so on.The routes received from ISP B that match AS path access list 4 are assigned a Local Preference value of 80. Routes not matching AS path access list 4 will be evaluated by the
ispb-in permit 20route map. There, they are always matched and assigned a Local Preference of 110.Example 6-5 shows the result of applying these route maps for a route to a customer of AS 30088. Under normal circumstances, the second route would have been chosen because the path is shorter, but the modified Local Preference values make sure the first route is used.
Example 6-5: The result of Local Preference manipulation
BR1#show ip bgp 221.169.0.0BGP routing table entry for 221.169.0.0/20, version 239188Paths: (2 available, best #1)Not advertised to any peer40077 1800 30088 20099192.0.254.17 from 192.0.254.17 (192.0.254.17)Origin IGP, metric 20, localpref 90, valid, external, best, ref 250066 30088 20099219.2.19.1 from 219.2.19.1 (219.2.13.237)Origin IGP, localpref 80, valid, external, ref 2Manipulating Inbound AS Paths
Bypassing the AS path length comparison and the possible subsequent steps by setting the Local Preference isn't always the most appropriate way to influence the route-selection process. For instance, a route with twelve ASes in the path will be preferred over one with a single AS in the path if the Local Preference is higher, but it's hard to imagine a situation in which a path that's so much longer is still preferable. An alternative is to manipulate the way the router evaluates the AS path or to manipulate the AS path itself. Bay (now Nortel) routers allow a weight to be set for each AS, and the total weight of the path is calculated for each route. Cisco and most other vendors lack such an elegant and powerful mechanism, but they usually allow some sort of direct manipulation of the path. The usual way to do this is by prepending your own AS number to the end of the path one or more times. The path is then announced to external BGP peers in its modified state, which may not be desirable, so this technique is mostly suited for multihomed end-user networks and not for ISPs. Example 6-6 shows route maps to modify the AS path rather than the Local Preference as was done in Example 6-4. The
ispb-in permit 10route map prepends the path for paths that match AS path access list 4 because they contain AS 30088. Then the secondispb-inroute map matches all remaining routes (without the need for either a match or a set clause), so they are included in the BGP table without modifications.Example 6-6: Prepending the AS path
!ip as-path access-list 4 permit _30088_ip as-path access-list 4 deny .*!route-map ispa-in permit 10set as-path prepend 60055!route-map ispb-in permit 10match as-path 4set as-path prepend 60055 60055 60055!route-map ispb-in permit 20!As a result of these AS path manipulations, more traffic will flow over ISP B, since the path over ISP A is now longer. For some destinations, however, the longer path over ISP A may still be shorter, or the paths over A and B may be the same length, so that BGP has to employ the tie-breaking rules to select the best route. Example 6-7 shows the result for a route over AS 30088. Originally, the route over ISP B was shorter. But this route had its path prepended with the local AS number three times and the route over ISP A just once, so the route over ISP A is preferred.
Example 6-7: The result of AS path manipulation
BR1#show ip bgp 221.169.0.0BGP routing table entry for 221.169.0.0/20, version 247873Paths: (2 available, best #1)Not advertised to any peer60055 40077 1800 30088 20099192.0.254.17 from 192.0.254.17 (192.0.253.83)Origin IGP, metric 20, localpref 100, valid, external, best, ref 260055 60055 60055 50066 30088 20099219.2.19.1 from 219.2.19.1 (219.2.13.237)Origin IGP, localpref 100, valid, external, ref 2Note that the metric (MED) for the route over ISP A is 20, while the route over ISP B doesn't have a metric. Default IOS behavior is to treat a route without a MED metric as having a MED with the value 0. This may be changed to the opposite behavior (which conforms to IETF recommendations) using the bgp bestpath med missing-as-worst command in recent IOS versions. A missing MED then equals the highest (worst) possible value, as the command suggests. To me, the IETF behavior makes slightly more sense, but if you want to use MEDs, it's a good idea to make sure the routes actually have a MED set and do not depend on default behavior.
Inbound Communities
Depending on your upstream ISP, incoming routes may be "colored" with several communities. This can work both ways: later in this chapter, we'll see how setting communities for the routes you send to an ISP can trigger actions inside the ISP's network. Many ISPs use communities to convey information about the origin of routes. This information can include whether the route was received from a customer, a peer or an upstream ISP, or the location where the route was learned. The next example is based on the following:
- The AS 60055 network is located in Chicago.
- ISP A (AS 40077) is a national network connecting to MAE East but not to the Chicago NAP, and it doesn't use communities.
- ISP B (AS 50066) is a regional ISP that connects to the Ameritech (Chicago) NAP and to MAE East in Virginia.
- Routes ISP B learns at the Chicago NAP have the community
50066:3001.
- Routes ISP B learns at MAE East have the community
50066:3002.
- ISP B's connection to the Chicago NAP is excellent, but their connection to MAE East is somewhat congested.
This situation is depicted in Figure 6-3. The width of the lines connecting both ISPs to the interconnect locations indicates the available bandwidth.
Figure 6-3. Example national and regional ISP connectivity
![]()
Routes over the Chicago NAP through ISP B are most likely a lot better than routes to the same destinations over ISP A because of ISP A's lack of local or regional interconnection with other networks. It makes sense to assign a higher Local Preference to ISP B's Chicago NAP routes. If the paths for routes to destinations behind MAE East are the same, the path over ISP A should be preferred, because ISP A's connection to MAE East isn't congested. On the other hand, if ISP A's route to such a destination is much longer, it's probably better to suffer some congestion over ISP B than to take the scenic route over ISP A. This can be accomplished by assigning a default MED metric of 10 to all routes (overwriting the existing MED, if there was one), except routes from ISP B over MAE East; those get a metric of 20. Example 6-8 implements this routing policy.
Example 6-8: Using communities to help select the best route
!router bgp 60055bgp always-compare-med!ip bgp-community new-formatip community-list 1 permit 50066:3001ip community-list 1 denyip community-list 2 permit 50066:3002ip community-list 2 deny!route-map ispa-in permit 10set metric 10!route-map ispb-in permit 10match community 1set metric 10set local-preference 120!route-map ispb-in permit 20match community 2set metric 20!route-map ispb-in permit 30set metric 10!The bgp always-compare-med command makes the router take the MED metric into account when comparing routes even when the two routes to a destination aren't received from the same AS. The ip bgp-community new-format command makes the router show all community-related information in the
AS:nnformat. Without it, communities are shown as single, very large numbers. Example 6-9 shows part of the BGP table after the BGP sessions have been reset and the new route maps have been applied.Example 6-9: Partial listing of the BGP table
BR1#show ip bgpBGP table version is 620121, local router ID is 192.0.254.18Status codes: s suppressed, d damped, h history, * valid, > best, i - internalOrigin codes: i - IGP, e - EGP, ? - incompleteNetwork Next Hop Metric LocPrf Path* 7.7.0.0/19 192.0.254.17 10 40077 397 i*> 219.2.19.1 10 120 50066 5703 397 i*> 8.8.0.0/16 192.0.254.17 10 40077 30021 i* 219.2.19.1 20 50066 30021 i* 9.9.0.0/20 192.0.254.17 10 40077 5930 1070 i*> 219.2.19.1 20 50066 1070 iThe first network in this example,
7.7.0.0/19, has a shorter path over ISP A (AS 40077), but it has the community50066:3001(not visible in this example) because ISP B (AS 50066) learned the route in Chicago, and the route mapispb-inhas changed the Local Preference to 120. The route over ISP A has an empty Local Preference value, which is treated as a value of 100. Thus the route over ISP B is preferred.ISPs A and B both peer with AS 30021 (network
8.8.0.0/16) at MAE East, so the route from ISP B contains the community50066:3002, and the MED is changed to 20. The Local Preference and AS path length are the same for both ISP A and ISP B, so the MED is the deciding factor, and the router selects the route over ISP A.The situation for network
9.9.0.0/20is similar to that of network8.8.0.0/16: ISP B also learns this route at MAE East. But ISP A doesn't directly peer with AS 1070, which explains the longer path. So the route over ISP B is selected because its path is shorter.RPSL Routing Policy
Example 6-10 shows the Routing Policy Specification Language (RPSL) version of the routing policy for the configuration listed in Example 6-8 for inclusion in a Routing Registry.
Example 6-10: RPSL routing policy with communities
aut-num: AS60055import: from AS40077action pref = 2; med = 10;accept ANYimport: from AS50066action pref = 1; med = 10;accept community(50066:3001);action pref = 2; med = 20;accept community(50066:3002);action pref = 2; med = 10;accept ANY;export: to AS40077 announce AS60055export: to AS50066 announce AS60055default: to AS40077default: to AS50066The import: clauses are executed from top to bottom, so if a route has both communities
50066:3001and50066:3002set, it matches the first clause and receive a pref of 1 and a med of 10. Note that the pref keyword in RPSL isn't the same as the Local Preference: a lower pref is more preferred, while for Local Preference, a higher value is more preferred. In this policy, Local Preference 100 is translated into pref 2, and Local Preference 120 becomes pref 1.BGP Load Balancing
When a single router has two connections to the same AS, it's possible to load-balance outgoing traffic over those connections by instructing the router to insert more than one route with the same NLRI into the routing table. Depending on the switching mode the router uses, half the packets will flow over one connection and the other half over the other (per packet load balancing), or half the destination IP addresses will be routed over one connection and the other half over the other (per destination load balancing). Load balancing is enabled by setting maximum-paths to a value higher than one (the maximum is six):
!router bgp 60055maximum-paths 4!With this setting in effect, up to four routes are entered into the routing table, as long as the routes are received from routers in the same AS, and the AS paths and MED metrics are identical. The maximum-paths keyword applies to all BGP peers: it isn't possible to enable load balancing for some peers and not for others. However, it's simple to prevent load balancing by giving each incoming route a different MED.
Load balancing can work in both directions only if there are multiple connections between one router at one end and one router at the other end. This means that both connections are unavailable if the router on either side fails, creating two single points of failure, unless there are other connections (terminating at other routers) in addition to the ones eligible for load balancing. There is no requirement that load balancing be enabled on both ends. For instance, if both connections terminate at different routers at your ISP, it isn't possible to load-balance your incoming traffic, but as long as both connections terminate on one router at your end, you can still configure load balancing for outgoing traffic.
Traffic Engineering for Incoming Traffic
Because the local router determines the route taken by outgoing packets, it isn't difficult to balance outbound traffic over multiple connections. The situation for inbound traffic is different. There are only a few routes you can influence to shift incoming traffic patterns: one for each address block for each ISP you connect to, instead of tens of thousands for outgoing traffic. In the typical multihoming case, with one address block and two ISPs, this leaves you with just two routes that can be manipulated to change inbound traffic distribution. This manipulation can take the form of:
- Setting the MED
- Prepending the AS path
- Setting outbound communities
You can also decide to "cheat" and break up a single address block that would normally be announced as a single route into several smaller blocks, so you can announce each separately, with different properties, for more fine-grained control.
Setting the MED
The MED metric is intended to be used only between two neighboring ASes. It isn't communicated to ASes beyond the neighboring AS. For this reason, the use of the MED in balancing incoming traffic is limited to the situation where there is more than one connection between two ASes: setting a higher MED for one route will make the traffic flow over the other. This is useful when one of the connections is of a much higher bandwidth, and the second one is a lower-bandwidth backup. Because you don't know whether the bgp bestpath med missing-as-worst command is in effect on the router terminating your connections at the other end, always set MEDs for the routes over both connections, as is shown in Example 6-11.
Example 6-11: Setting outbound MED values
!router bgp 60055neighbor 192.0.254.17 remote-as 40077neighbor 192.0.254.17 route-map ispa-out outneighbor 219.2.19.1 remote-as 50066neighbor 219.2.19.1 route-map ispb-out out!route-map ispa-out permit 10set metric 10!route-map ispb-out permit 10set metric 20!We are now trying to influence incoming traffic, so we have to manipulate outgoing routing updates and apply the route maps to the neighbors using the neighbor ... route-map ... out command.
TIP: The MED metric you see in the BGP table is never announced to eBGP neighbors. If you want a neighbor to receive a MED, you have to configure an outbound route map to set the MED for this neighbor.
Prepending Outbound AS Paths
When you bring up your second BGP session, you soon get to see how much traffic your routes attract over both ISPs. In many cases, the traffic will be distributed fairly equally over both connections, or one connection receives more traffic but there is enough spare capacity (for inbound traffic) so this isn't a problem. But maybe one connection attracts more traffic than it can handle, or you have one big pipe and a smaller one, and the traffic volumes are equal (or at least they try to be). Under these circumstances, you'll want to shift part of the incoming traffic load from one connection to the other. The most powerful option to change incoming traffic patters is making the AS path longer. This is effective, because the path is preserved between ASes, and BGP implementations use the path length early in the route selection algorithm. The biggest problem with making the AS path longer by prepending your own AS number to the path one or more extra times is that it may be too effective. Example 6-12 shows a configuration that prepends the path for the routes announced to an upstream ISP.
Example 6-12: Prepending the path for outbound routes
!router bgp 60055neighbor 219.2.19.1 remote-as 50066neighbor 219.2.19.1 route-map ispb-out out!route-map ispb-out permit 10set as-path prepend 60055!The way the route map works should be familiar by now. Rather than applying the route map for incoming routes, the
ispb-out permit 10route map is used for outbound route updates. The number 10 is superfluous here, because there is only a single route map, but the router adds it to the configuration if you don't type it in yourself. The set clause adds 60055 to all routes.Make sure all routes with prepended paths are accepted by your ISP and upstream networks. It isn't uncommon for the AS path filters that ISPs use to filter routes from customers not to allow path prepending. There usually isn't a good reason for this; it's just that allowing path prepending makes for more complex filters. If, after configuring path prepending, you use one or more looking glasses to see if they now receive your route in its prepended state, you may see only the unprepended route over your other ISP. It isn't always clear whether this means the route wasn't accepted, or routers further upstream just prefer the unprepended path over your other ISP because of the shorter AS path. The only way to make sure is to temporarily disable the BGP session to the nonprepended ISP:
!router bgp 60055neighbor 192.0.254.17 shutdown!If the prepended route doesn't show up on remote looking glasses, or remote destinations on the Net become unreachable after shutting down your unprepended ISP, there must be a filter somewhere. Don't forget to let route propagation settle for a minute or two before drawing conclusions. You can determine where filtering takes place by tracerouting to an unreachable destination. The ASes that show up in the traceroute don't filter the prepended AS path. The filter must be between the last AS that shows up in the traceroute and the first one that doesn't. If your ISP is the one filtering out prepended routes, you can ask them to change their filters, but if the problem is further upstream, there is probably not a lot you can do. Don't forget to reenable the BGP session to your other ISP:
!router bgp 60055no neighbor 192.0.254.17 shutdown!TIP: The filter that prohibits routes with prepended paths from finding their way may be located inside your own router. It's best always to allow prepending your own AS, even if you don't plan on prepending in the near future:
ip as-path access-list 2 permit ^(60055_)*$This regular expression matches all AS paths consisting of the beginning of the line (
^), zero or more (()*) times the AS number, a space (_), and then the end of the line ($). In other words: an empty AS path or an AS path with just the AS number 60055.The Effect of AS Path Prepending
Suppose you multihome to two ISPs that are similar: they interconnect at mostly the same NAPs and Internet Exchanges, and they peer with mostly the same networks. Under these circumstances, other networks see two similar paths for the routes you announce. Figure 6-4 shows an example of this.
Figure 6-4. Multihoming to similar ISPs
![]()
If AS E (the example multihomed network, AS 60055) wants to receive more traffic over ISP A and thus makes the AS path longer over ISP B, the majority of traffic will flow over the ISP A, which now has the shorter path. In a situation with two similar ISPs, AS path prepending gives them only three choices:
- The default traffic distribution, which may or may not be balanced
- Longer path over ISP A: majority of traffic comes in over ISP B
- Longer path over ISP B: majority of traffic comes in over ISP A
Table 6-1 shows which route is preferred in the situation shown in Figure 6-4 without path prepending, with prepending the path to ISP A, and with prepending the path to ISP B.
Table 6-1: Prepended paths over similar ISPs
AS X
AS Y
AS Z
Traffic distribution
Prepend to A
AEE
BE
AEE
BE
AEE
BE
ISP A: 15%
ISP B: 85%
No prepending
AE
BE
AE
BE
AE
BE
ISP A: 40%
ISP B: 60%
Prepend to B
AE
BEE
AE
BEE
AE
BEE
ISP A: 90%
ISP B: 10%
For the purposes of calculating the traffic distribution, it's assumed that A always handles 15% of the traffic, B always 10%, and ASes X, Y, and Z are all the source of 25% of incoming traffic. ASes with "even" letters (X, Z) prefer to send traffic over ISP B when the paths are of equal length; "odd" ASes (Y) prefer ISP A in this example. The preferred path is listed in bold type in the table.
When the two ISPs are not as similar, increasing the length of the AS path has a more gradual effect, because the paths over ISPs A and B aren't the same for all networks. Figure 6-5 shows multihoming to dissimilar ISPs.
Figure 6-5. Multihoming to dissimilar ISPs
![]()
In this example, ISP B is a much smaller ISP that doesn't peer with networks X and Y, but rather buys transit service from AS C to reach those networks. Networks V and W don't peer directly with ISP A, so even if the path over ISP B becomes a lot longer, they'll still prefer to send traffic to AS E over ISP B. Because this is a peering link, sending the traffic over this route is cheaper. Network Z, on the other hand, will immediately route traffic over ISP A when the path over ISP B is prepended, because the connections to both A and B are peering links. Table 6-2 shows the possible traffic distribution using prepending.
Table 6-2: Prepended paths over dissimilar ISPs
AS C
AS V
AS W
AS X
AS Y
AS Z
Traffic distribution
2 to ISP A
BE
BE
BE
CBE
AEEE
CBE
AEEE
CBE
AEEE
BE
A: 15%
B: 85%
1 to ISP A
BE
BE
BE
CBE
AEE
CBE
AEE
CBE
AEE
BE
A: 35%
B: 65%
No prepending
BE
BE
BE
CBE
AE
CBE
AE
CBE
AE
BE
A: 55%
B: 45%
1 to ISP B
BEE
BEE
BEE
CBEE
AE
CBEE
AE
CBEE
AE
BEE
A: 75%
B: 25%
The traffic distribution in this example is 15% from ISP A, 5% from ISP B and ASes V and W, 10% from AS C, and 20% from ASes X, Y, and Z.
TIP: All else being equal, it's a good idea to select dissimilar ISPs, for instance, one tier-1 ISP that peers with all the other large networks, and one tier-2 ISP that peers with many small networks. This way, you have a wide range of traffic engineering options.
Setting Outbound Communities
In many cases you'll want to prepend the path for certain upstream networks or peers of a transit ISP and not for others. For instance, if two of your ISPs have a transit network in common, you might want to have one ISP announce a prepended path to this transit network without changing the path that other transit networks and peers see over that ISP. To avoid spending a lot of time implementing this type of policy upon customer request, many ISPs provide their customers (and sometimes their peers) with a list of communities that trigger actions such as path prepending and setting the Local Preference. This can then be done for each route individually.
Well-known communities
Communities were introduced in BGP by RFC 1997. This RFC also defines the three well-known communities listed in Table 6-3.
Table 6-3: Well-known communities Well-known community
Action
no-export(0xFFFFFF01)Don't advertise this route to eBGP peers.
no-advertise(0xFFFFFF02)Don't advertise this route to any peers, iBGP, or eBGP.
no-export-subconfed(0xFFFFFF03)Advertise this route to iBGP peers with the same AS number, but not to other confederation members.
These communities can be useful under certain circumstances, for example, if an ISP wants to set the
no-exportcommunity on routes it sends to a customer to make sure the customer doesn't accidentally announce the ISP's routes to another ISP. If the customer provides transit services to customers of his own, however, they won't receive the route unless the original customer of the ISP removes theno-advertisecommunity.WARNING: Setting the
no-export,no-advertise, orno-export-subconfedcommunities can have the (possibly unwanted) side effect that no routes are announced, even if there are other routes that would otherwise be eligible for announcement.For instance, if you set the
no-advertisecommunity on routes announced to ISP B, other customers of ISP B won't see these routes because they aren't advertised. This is as intended. But routes with the same NLRI that ISP B has learned from ISP A will not be advertised either, because ISP B considers the directly received routes with theno-advertisecommunity best, and only the best route is eligible for further announcement over BGP.Common community actions
ISPs accepting communities provide their customers with a list of communities they use and what action is taken for each community. It's possible to set several communities for a single route, but the results may be unpredictable if your ISP doesn't expect this. Many networks list the communities they accept in their AUT-NUM object in the Routing Registry they use. Table 6-4 shows a fairly typical list of communities an ISP might accept.
Table 6-4: An example of communities an ISP accepts Community
Action
50066:70Set Local Preference to 70.
50066:90Set Local Preference to 90.
50066:110Set Local Preference to 110.
50066:5010Announce this route for transit to ISP C.
50066:5020Don't announce this route to transit ISP or peer C.
50066:5040Prepend AS path to C once.
50066:5041Prepend AS path to C twice.
50066:5042Prepend AS path to C three times.
50066:10040Prepend AS path once at interconnect point I.
50066:10041Prepend AS path twice at interconnect point I.
50066:10042Prepend AS path three times at interconnect point I.
Some ISPs require you to set a community indicating a route should be announced for transit,
50066:5010in this example. This isn't common: most networks do the opposite and allow you to set a community indicating the route shouldn't be announced to transit networks. Be sure to notice the subtle difference between not announcing for transit and not announcing to transit networks: in the first case, the potential transit network still receives the announcement, but the route is treated as a peering route and not announced to peers and upstream networks of the transit network. In the latter case, the transit network doesn't get to hear the route at all.Influencing the Local Preference in Upstream ASes
The MED is specifically intended to inform an upstream ISP that one connection should be preferred over another, but today this is often done with communities. Using communities instead of the MED may have benefits internal to the ISP network. For example, the ISP is then free to use the MED for another purpose, as we did for outbound traffic engineering in the beginning of this chapter. And, unlike the MED, using a community to set the Local Preference inside an ISP network also makes it possible to use a link to an ISP as a backup for a link to another ISP. When the Local Preference is set sufficiently low for the intended backup connection, the ISP it connects to will completely ignore the route and always send traffic over the other ISP as long as there is a route present over this ISP. This can't be accomplished with the MED; the AS path length overrides it, the MED isn't communicated from AS to AS, and by default, the MED is looked at only when two connections terminate at the same AS.
The impact of changing the Local Preference depends on the Local Preference values an ISP uses for routes learned from transit, peers, and customers. In this example, that would be 80 for transit, 100 for peers, and 120 for customers. If you have a fast main connection to this ISP along with a slower backup connection, you'll probably want to set community
50066:110for routes announced over the backup connection. This makes sure all traffic flows over the main connection and the backup connection is used only when the main connection is unavailable. It's also possible to do this when you connect to two different ISPs: to ISP A with a main connection, and to ISP B with a backup connection. Then you'll want to set community50066:90or even50066:70so ISP B sends all traffic for you over a peering or transit connection to ISP A.Example 6-13 shows a configuration for the router terminating the backup connection to ISP B. The main, high-bandwidth connection terminates at another router.
Example 6-13: Setting a community to indicate a backup route
!router bgp 60055neighbor 219.2.19.1 remote-as 50066neighbor 219.2.19.1 route-map ispb-backup-out outneighbor 219.2.19.1 send-community!route-map ispb-backup-out permit 10set community 50066:5010 50066:70!The community
50066:5010makes this route eligible for announcement as a transit route over AS C, but the community50066:70makes sure this route has the lowest possible Local Preference in ISP B's network. Thus, ISP B won't use it as long as there is any other route with the same NLRI (prefix), even if this means routing the traffic over ISP A.TIP: By default, Cisco routers accept incoming communities but don't transmit them over iBGP or eBGP. The send-community command enables sending communities to a neighbor.
Prepending the AS Path
Some smaller ISPs have path-prepending communities for each peer, but even medium-sized ISPs peer with many networks, so this soon gets out of hand. More often, an ISP has communities to prepend the path to each of its transit ISPs individually, as well as communities to prepend the path for an entire interconnect point. Our example ISP B (AS 50066) accepts path prepending communities for ISP X and interconnect point I.
AS W in Figure 6-5 (shown earlier this chapter) connects both directly to ISP B and also over transit AS C and then ISP B. Supposing the peering link between AS W and ISP B is congested, we'll want incoming traffic from AS W to flow over ISP C. This is accomplished by prepending the path ISP B announces to AS W twice, as is done in the configuration in Example 6-14.
Example 6-14: Setting a community to prepend the path
!router bgp 60055neighbor 219.2.19.1 remote-as 50066neighbor 219.2.19.1 route-map ispb-out outneighbor 219.2.19.1 send-community!route-map ispb-out permit 10set community 50066:5010 50066:10041!Unfortunately, it's not possible to do something similar for outbound traffic: this will still flow over the congested connection between ISP B and AS C. This isn't the case if it's routed over ISP A and not over ISP B, of course, as is done in Example 6-4 earlier this chapter. Also, setting community
50066:10041means the path is prepended twice towards all peers at this interconnect point. This may be undesirable, for instance if AS Z connects to ISP B over the same interconnect point as AS W. AS Z now sees the pathC B B Eover ISP B and the much shorter pathA Eover ISP A, so traffic from AS Z will now come in over the connection to ISP A.Announcing More Specific Routes
When prepending the path and setting communities for outbound routes aren't enough to balance incoming traffic, there is a last resort: announcing more specific routes. This will inflate the global routing table, so announcing more specific routes should be done only when absolutely necessary. Because a more specific route always takes precedence over a less specific route, this technique always works, as long as the more specifics are accepted by your ISP and a reasonable number of their upstream (transit or peer) networks.
TIP: Announcing more specifics is also useful when someone else announces your address block (by mistake, or by your request but no longer needed) and you don't want to wait for them to fix this.
Consider the situation outlined back in Figure 6-4. If the routers for ISP A consistently use a lower router ID (which defaults to the highest IP address in the box) than those of ISP B, it's possible that nearly all traffic comes in over ISP A. The AS paths are all the same length, and the tie-breaking rules favor the route from the neighbor with the lowest router ID. Prepending the path won't help: all traffic then comes in over ISP B. If neither A nor B allows selective prepending using communities, balancing the traffic is possible only by announcing more specific routes. For instance, if your address block is
220.37.0.0/20(16 Class C's:220.37.0through220.37.15), you could announce220.37.0.0/21to ISP A and220.37.8.0/21to ISP B. This way, all traffic to the Class C nets220.37.0through220.37.7comes in over ISP A, and all traffic to Class C nets220.37.8through220.37.15over ISP B. Example 6-15 shows a configuration that accomplishes this.Example 6-15: Announcing more specific routes
!router bgp 60055network 220.37.0.0 mask 255.255.240.0network 220.37.0.0 mask 255.255.248.0network 220.37.8.0 mask 255.255.248.0neighbor 192.0.254.17 remote-as 40077neighbor 192.0.254.17 description BGP session to ISP Aneighbor 192.0.254.17 prefix-list ispa-ms outneighbor 219.2.19.1 remote-as 50066neighbor 219.2.19.1 description BGP session to ISP Bneighbor 219.2.19.1 prefix-list ispb-ms out!ip route 220.37.0.0 255.255.240.0 Null0ip route 220.37.0.0 255.255.248.0 Null0ip route 220.37.8.0 255.255.248.0 Null0!ip prefix-list ispa-ms description outbound filter for ISP Aip prefix-list ispa-ms seq 5 permit 220.37.0.0/20ip prefix-list ispa-ms seq 10 permit 220.37.0.0/21ip prefix-list ispa-ms seq 15 deny 220.37.8.0/21ip prefix-list ispb-ms description outbound filter for ISP Bip prefix-list ispb-ms seq 5 permit 220.37.0.0/20ip prefix-list ispb-ms seq 10 deny 220.37.0.0/21ip prefix-list ispb-ms seq 15 permit 220.37.8.0/21!To announce the two more specific routes in addition to the original
/20route (as a fallback in case the more specifics are filtered), each route must be listed in the BGP configuration with a network statement, and there must be matching local (pull up) routes, as provided by the ip route ... Null0 statements. The prefix lists limit the routes announced to ISP A to220.37.0.0/20and220.37.0.0/21, and those announced to ISP B to220.37.0.0/20and220.37.8.0/21. Having two routes with the same address part isn't a problem: the NLRI consists of both the address and the prefix parts of a route. Two routes are considered different if either differs.
Figure 6-6. Propagation of more specific routes
![]()
Figure 6-6 shows the propagation of routes, and Example 6-16 shows how these routes might show up in the BGP table of a remote AS. (Don't forget to register ROUTE objects in the Routing Registry of your choice for the more specific routes.)
Example 6-16: More specific routes as seen by a remote AS
BR1#show ip bgpBGP table version is 933017, local router ID is 195.30.2.198Status codes: s suppressed, d damped, h history, * valid, > best, i - internalOrigin codes: i - IGP, e - EGP, ? - incompleteNetwork Next Hop [...] Path*> 220.37.0.0/21 192.0.254.17 [...] 40077 60055 i*> 220.37.0.0/20 192.0.254.17 [...] 40077 60055 i* 219.2.19.1 [...] 50066 30077 60055 i*> 220.37.8.0/21 219.2.19.1 [...] 50066 123 456 60055 iAs you can see, there are two routes for the
/20, but only a single route for each of the more specifics. Also, the path over ISP B (AS 50066) for the/20is shorter than the path of the/21: apparently, AS 30077 filters out the more specific route and allows only the/20from AS 50066. But ASes 123 and 456 don't filter, so there is still a route for the/21. And since it's the most specific route, this is the one that is actually used, as the routing table for this remote network shows in Example 6-17.Example 6-17: More specific routes in the routing table
BR1#show ip route 220.37.0.0Routing entry for 220.37.0.0, 3 known subnetsVariably subnetted with 2 masksB 220.37.0.0/20 [20/0] via 192.0.254.17, 1d12hB 220.37.0.0/21 [20/0] via 192.0.254.17, 1d12hB 220.37.8.0/21 [20/0] via 219.2.19.1, 1d16hNote that the
220.37.0.0/20route is actually in the routing table, although it will never be used for forwarding as long as both220.37.0.0/21and220.37.8.0/21are available, because those cover the exact same address range. The "B" indicates that the route was learned from BGP, and 20/0 is the administrative distance (20, the default for eBGP) and metric (0, a missing MED).TIP: When deploying IP addresses, try to avoid putting all high-bandwidth hosts in the same or nearby subnets. Putting half the high-bandwidth hosts in the first
/24and the rest in the second is a better idea. If you ever need to announce more specifics to balance incoming traffic, it's a lot easier to announce two/21sout of a/20or announce just a/24separately in addition to what's normally announced, rather than having to announce very specific routes (prefixes longer than/24) or do some renumbering within your own address range in a hurry.Queuing, Traffic Shaping, and Policing
Traffic engineering works only if you have bandwidth to spare on one of your connections. Even the most sophisticated traffic balancing techniques won't help you when there is just too much traffic. When the output queues for interfaces start filling up, interactive protocols start noticing delays, and bulk protocols start noticing lower throughput. The best way to handle this would be to get more bandwidth, but with some smart queuing techniques, it's possible to increase performance for some protocols or sessions without hurting others very much. Or just give way to "important" packets and let less important traffic suffer. There are three ways to accomplish this: special queuing strategies, traffic shaping, and rate limiting. Before choosing one, you should know how each interacts with TCP.
Nearly all applications that run over the Internet use the TCP (RFC 793) "on top of" IP. IP can only transmit packets of a limited size, and the packets may arrive corrupted by bit errors on the communications medium, in the wrong order, or not at all. Also, IP provides no way for applications to address a specific program running on the destination host. All this missing functionality is implemented in TCP. The characteristics of TCP are:
- "Stream" interface
- Any and all bytes the application writes to the stream come out in the same order at the application running on the remote host. There is no packet size limit: TCP breaks up the communication into packets as needed.
- Integrity and reliability
- TCP performs a checksum calculation over every segment (packet) and throws away the segment if the checksum fails. It keeps resending packets until the data is received (and acknowledged) successfully by the other end, or until it becomes apparent that the communications channel is unusable, and the connection times out.
- Multiplexing
- TCP implements "ports" to multiplex different communication streams between two hosts, so applications can address a specific application running on the remote host. For instance, web servers usually live on port 80. When a web browser contacts a server, it also selects a source port number so that the web page can be sent back to this port, and the page will end up with the right browser process. Well-known server ports are usually (but not always) below 1024; client source ports are semirandomly selected from a range starting at 1024 or higher.
- Congestion control
- Finally, TCP provides congestion control: it makes sure it doesn't waste resources by sending more traffic than the network can successfully carry to the remote host.
Most of what TCP does falls outside the scope of this book, so it won't be discussed here.[2] It's good to know about the congestion control mechanisms TCP employs, however, because they have a strong impact on the traffic patterns on the network.
TCP Congestion Control
Apart from the basic self-timing that happens because TCP uses a windowing system where only a limited amount of data may be in transit at any time, there are four additional congestion-related mechanisms in TCP: slow start, congestion avoidance, fast retransmit, and fast recovery. These algorithms are documented in RFC 2001.
Slow start
When a TCP connection is initiated, the other side tells the local TCP how much data it's prepared to buffer. This is the "advertised window." Setting up a connection takes three packets: an initial packet with the SYN control bit set (a "SYN packet"), a reply from the target host with both the SYN and ACK bits set, and a final packet from the initiating host back to the target acknowledging the SYN/ACK packet. This is the three-way handshake.
After the three-way handshake, the local (and remote) TCP may transmit data until the advertised window is full. Then it has to wait for an acknowledgment (ACK) for some of this data before it can continue transmitting. When the remote TCP advertises a large window, the local TCP doesn't send a full window's worth of data at once: there may be a low-bandwidth connection somewhere in the path between the two hosts, and the router that terminates this connection may be unable to buffer such a large amount of data until it can traverse the slow connection. Thus, the sending TCP uses a congestion window in addition to the advertised window. The congestion window is initialized as one maximum segment size, and it doubles each time an ACK is received. If the segment size is 1460 bytes (which corresponds to a 1500-byte Ethernet packet minus IP and TCP headers), and the receiver advertises a 8192-byte window, the sending TCP initializes the congestion window to 1460 bytes, transmits the first packet, and waits for an ACK. When the first ACK is received, the congestion window is increased to 2920 bytes, and two packets are transmitted. When the first one of these is ACKed, the congestion window becomes 5840 bytes, so four packets may now be in transit. One packet is still unacknowledged, so three new packets are transmitted. After receiving the next ACK, the congestion window increases beyond the advertised window, so now it's the advertised window that limits the amount of unacknowledged data allowed to be underway.
Congestion avoidance
Congestion avoidance introduces another variable: the slow start threshold size (ssthresh). When a connection is initialized, the ssthresh is set to 65,535 bytes (the maximum possible advertised window). As long as no data is lost, the slow start algorithm is used until the congestion window reaches its full size. If TCP receives an out-of-order ACK, however, congestion avoidance comes into play. An out-of-order ACK is an acknowledgment for data that was already acknowledged before. This happens when a packet gets lost: the receiving TCP sends an ACK for the data up to the lost packet, indicating, "I'm still waiting for the data following what I'm ACKing now." TCP ACKs are cumulative: it isn't possible to say "I got bytes 1000-1499, but I'm missing 500-999."
Upon receiving a duplicate ACK, the sending TCP assumes the unacknowledged data has been lost because of congestion, and the ssthresh and also the congestion window are set to half of the current window size, as long as this is at least two times the maximum segment size. After this, the congestion window is allowed to grow only very slowly, to avoid immediate return of the congestion. If the sending TCP doesn't see any ACKs at all for some period of time, it assumes massive congestion and triggers slow start, in addition to lowering the ssthresh. So as long as the congestion window is smaller than or equal to the ssthresh, slow start is executed (congestion window doubles after each ACK), and after that congestion avoidance (congestion window grows slowly).
Fast retransmit and fast recovery
When TCP receives three out-of-order ACKs in a row, it assumes that just a single packet was lost. (One or two out-or-order ACKs are likely to be the result of packet reordering on the network.) It then retransmits the packet it thinks has been lost, without waiting for the regular retransmit timer to expire. The ssthresh is set as per congestion avoidance, but the congestion window is set to the ssthresh plus three maximum segments: this is the amount of data that was successfully received by the other end, as indicated by the out-of-order ACKs. The result is that TCP slows down a bit, but not too much, because there is obviously still a reasonable amount of data coming through.
TCP Under Packet Loss and Delay Conditions
The result of these four mechanisms is that TCP slows down a lot when multiple packets are lost. The problem is even worse when the round-trip times are long, because the use of windows limits TCP's throughput to a window size per round-trip-time. This means that even with the maximum window size of just under 64 KB (without the TCP high-performance extensions enabled), TCP performance over a transcontinental circuit with a round trip delay of 70 ms will not exceed 900 Kbps. When a packet is lost, this speed is nearly halved, and it takes hundreds of successfully acknowledged packets to get back up to the original window size. So even sporadic packet loss can bring down the effectively used bandwidth for a single TCP session over a high-delay path. This means that packet loss can be tolerated only on low-delay connections, and only as long as those connections are not part of a high-delay path.
The behavior of the two main categories of non-TCP applications under packet loss conditions is different. These categories are multimedia (streaming audio and video) and applications based on small transactions that don't need a lot of overhead, such as DNS. Streaming audio and video are generally not too sensitive to packet loss, although the audio/video quality will suffer slightly. For things like DNS lookups, packet loss slows down individual transactions a lot (they time out and have to be repeated), but the performance penalty doesn't carry over to transactions that didn't lose packets themselves. Because non-TCP applications don't really react to packet loss, they often exacerbate the congestion by continuing to send more traffic than the connection can handle.
Although some lost packets are the result of bit errors on the physical medium or temporary routing inconsistencies, the typical reason packets are lost is congestion: too much traffic. If a router has a single OC-3 (155 Mbps) connection to a popular destination, and 200 Mbps of traffic comes in for this destination, something has to give. The first thing the router will do is to put packets that can't be transmitted immediately in a queue. IP traffic tends to have a lot of bursts: traffic can get high for short periods of time ranging from a fraction of a second to a few seconds. The queue helps smooth out these bursts, at the expense of some additional delay for the queued packets, but at least they're not lost. If the excessive traffic volume persists, the queue fills up. The router has no other choice than to discard any additional packets that come in when the queue is full. This is called a "tail drop." The TCP anti-congestion measures are designed to avoid exactly this situation, so in most cases, all the TCP sessions will slow down so the congestion clears up for the most part. If the congestion is bad, however, this may not be enough. If a connection is used for many short-lived TCP sessions (such as web or email traffic), the sheer number of initial packets (when TCP is still in slow start) may be enough to cause congestion. Non-TCP applications can also easily cause congestion because they lack TCP's sophisticated congestion-avoidance techniques.
Queuing
Queuing happens only when the interface is busy. As long as the interface is idle, packets will be transmitted without special treatment. Regular queues invariably employ the first in, first out (FIFO) principle: the packet that has been waiting the longest is transmitted first. When the queue is full, and additional packets come in, tail drops happen. More sophisticated queuing mechanisms usually employ several queues. Packets are classified by user-configurable means and then placed in the appropriate queue. Then, when the interface is ready to transmit, a queue from which the next packet will be transmitted is selected as per the queuing algorithm. Cisco routers support several queuing strategies: FIFO, WFQ, RED, priority, and custom queuing. Note that special queuing mechanisms have effect only when it's not possible immediately to transmit a packet over the output interface. If the interface is idle and there are no queued packets, the new packet is transmitted immediately.
First in, first out
FIFO queuing is the most basic queuing strategy: packets are transmitted in the same order they come in. This is the default for fast interfaces. FIFO queuing is enabled by removing all other queuing mechanisms:
!interface Serial0no fair-queue!Weighted fair queuing
WFQ tries to allocate bandwidth fairly to different conversations (typically TCP sessions) so high-bandwidth sessions don't get to monopolize the connection. WFQ is the default for lower-bandwidth interfaces. It can be enabled with:
!interface Serial0fair-queue!Random early detect
RED starts to drop packets as the output queue fills up, in order to trigger congestion-avoidance in TCP. The sessions with the most traffic are most likely to experience a dropped packet, so those are the ones that slow down the most. Weighted random early detect (WRED) takes the priority value in the IP header into account and starts dropping low-priority packets earlier than their higher-priority counterparts. Unlike WFQ, priority, and custom queuing, RED doesn't need much processing time and can be used on high-speed interfaces. It needs a transmit queue bigger than the default 40-packet queue to be able to start dropping packets early and avoid tail drops.
!interface Ethernet0random-detecthold-queue 200 out!TIP: In RFC 2309, the IETF recommends using RED for Internet routers.
Priority queuing
This queuing strategy allows traffic to be classified as high, normal, medium, or low priority. If there is any high-priority traffic, it's transmitted first, then medium-priority traffic, and so on. This can slow down lower-priority traffic a lot or even completely block it if there is enough higher-priority traffic to fill the entire bandwidth capacity. Example 6-18 enables priority queuing and assigns a medium (higher than normal) priority to DNS traffic and a low priority to FTP.
Example 6-18: Enabling priority queuing
!interface Serial0priority-group 1!priority-list 1 protocol ip medium udp domainpriority-list 1 protocol ip low tcp ftppriority-list 1 protocol ip low tcp ftp-data!Custom queuing
Custom queuing has a large number of queues and transmits a configurable amount of data from a queue before proceeding to the next. This queuing strategy makes it possible to guarantee a minimum amount of bandwidth for certain traffic types, while at the same time making the bandwidth that is left unused available to other traffic types. Example 6-19 assigns 75% of the bandwidth to WWW traffic, 5% to the DNS, and 20% to all other traffic.
Example 6-19: Enabling custom queuing
!interface Serial0custom-queue-list 1!queue-list 1 protocol ip 1 tcp wwwqueue-list 1 protocol ip 2 udp domainqueue-list 1 default 3queue-list 1 queue 1 byte-count 7500queue-list 1 queue 2 byte-count 500queue-list 1 queue 3 byte-count 2000!If there is more WWW traffic than can fit in 75% of the interface bandwidth, and the non-WWW/non-DNS traffic requires only 5%, the unused 15% is reallocated to WWW traffic so that no bandwidth is wasted.
Traffic Shaping and Rate Limiting
With traffic shaping, all the traffic for an interface, or just that matching a certain access list, is counted. This happens regardless of whether the interface is idle or packets are queued for transmission. When the traffic reaches a user-configurable bandwidth threshold, additional packets are put in a queue and delayed, so bandwidth use is limited to the configured amount.
Rate limiting, sometimes referred to as traffic policing or CAR, is similar to traffic shaping, but instead of being delayed, the excess traffic is treated differently from regular traffic in a user-configurable way. A common way to handle the excess traffic is simply to drop it, but it's also possible to do other things, such as lowering the priority field in the IP header. Example 6-20 enables traffic shaping for one interface and rate limiting for another.
Example 6-20: Enabling traffic shaping and rate limiting
!interface Serial0traffic-shape rate 128000 8000 8000 1000!interface Serial1rate-limit output 128000 8000 8000 conform-action transmit exceed-action drop!Both the traffic-shape rate and the rate-limit output commands take bandwidth limit as their next argument. The other figures are burst and buffer sizes. For most applications, having those isn't desirable (TCP performance is even a bit worse when there is room for bursts), so for traffic shaping, you can leave them out; for rate limiting, you can set them to the minimum of 8000.
Traffic shaping and rate limiting are often used to limit a customer's available bandwidth when a customer buys a certain amount of bandwidth that is lower than that of the interface that connects them. This isn't a good use of rate limiting, however, because it drops a lot of packets, which makes TCP think there is congestion. So it slows down, but after a while it tries to pick up the pace again, and then there is more packet loss, and so on. Traffic shaping, on the other hand, just slows the packets down so TCP adapts to the available bandwidth. Example 6-21 shows the FTP performance over a connection that is rate-limited to 128 Kbps.
Example 6-21: FTP over a 128-Kbps rate-limited connection
ftp> put testfilelocal: testfile remote: testfile150 Opening BINARY mode data connection for 'testfile'.100% |**********************************| 373 KB 00:00 ETA226 Transfer complete.382332 bytes sent in 35.61 seconds (10.48 KB/s)The TCP performance is only 84 Kbps, about two thirds of the available bandwidth. Example 6-22 is the same transfer over the same connection, but now with traffic shaping to 128 Kbps in effect.
Example 6-22: FTP a 128-Kbps traffic-shaped connection
ftp> put testfilelocal: testfile remote: testfile150 Opening BINARY mode data connection for 'testfile'.100% |**********************************| 373 KB 00:00 ETA226 Transfer complete.382332 bytes sent in 24.73 seconds (15.10 KB/s)The performance is now 121 Kbps, which is just a few percent under the maximum possible bandwidth, considering TCP, IP, and datalink overhead.
Apart from combating denial-of-service attacks, as discussed in Chapter 11, rate limiting has another potential use, because unlike traffic shaping and the different queuing mechanisms, it can also be applied to incoming traffic. When an ISP and a customer agree on a certain bandwidth use, the ISP can easily use traffic shaping to make sure the customer doesn't receive more incoming traffic than the agreed upon bandwidth with traffic shaping. But since it's impossible to traffic shape packets coming in on an interface, the customer is responsible for traffic shaping their outgoing traffic. To make sure they don't send out more traffic than agreed, the ISP can implement additional rate limiting for incoming traffic.
1. "Delay" usually means the time from the moment a bit enters a circuit on one end until the moment it appears at the other end, but it's used interchangeably with round trip time (RTT) here, as the time it takes to receive a reply for a packet sent.
2. W. Richard Stevens' book TCP/IP Illustrated, Volume 1: The Protocols (Addison-Wesley) has an excellent description of TCP internals.
Back to: BGP
© 2001, O'Reilly & Associates, Inc.
webmaster@oreilly.de