You are on page 1of 10

92

IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 19, NO. 1, FEBRUARY 2011

Fast Simulation of Service Availability in Mesh Networks With Dynamic Path Restoration
Adrian E. Conway, Senior Member, IEEE
AbstractA fast simulation technique based on importance sampling is developed for the analysis of path service availability in mesh networks with dynamic path restoration. The method combines the simulation of the path rerouting algorithm with a dynamic path failure importance sampling (DPFS) scheme to estimate path availabilities efciently. In DPFS, the failure rates of network elements are biased at increased rates until path failures are observed under rerouting. The simulated model uses failure equivalence groups, with nite/innite sources of failure events and nite/innite pools of repair personnel, to facilitate the modeling of bidirectional link failures, multiple in-series link cuts, optical amplier failures along links, node failures, and more general geographically distributed failure scenarios. The analysis of a large mesh network example demonstrates the practicality of the technique. Index TermsAvailability, biasing, failure, importance, mesh, model, network, path, restoration, risk, sampling, simulation.

I. INTRODUCTION HE concept of a mesh network architecture is being adopted increasingly in the eld in the development and deployment of new networks or in the replacement, migration, or evolution of existing networks. In a generic mesh network, a set of nodes is interconnected with links following an arbitrary topology. The routes of end-to-end paths (i.e., end-to-end physical or virtual circuits) over the links can be arbitrary. The routes of backup or protection paths can also be arbitrary and even be generated dynamically. This generality is in contrast to traditional network architectures that are typically more rigid in form with fault tolerance provided using, for example, rings, extra dedicated protection links, or preestablished protection connections. Advantages of mesh networking include the enabling of more general routing schemes, more exible trafc engineering, simplication of network operations and management functions, more cost-effective use of redundant network capacity, the enabling of more general self-conguration and self-healing mechanisms, and potentially higher levels of service availability. The mesh networking concept is also general in scope, so it can be applied physically in different physical parts of networks as well as logically in different logical layers.

Manuscript received July 17, 2009; revised January 24, 2010 and June 10, 2010; accepted June 11, 2010; approved by IEEE/ACM TRANSACTIONS ON NETWORKING Editor S. Subramaniam. Date of publication July 12, 2010; date of current version February 18, 2011. This paper is published under a joint IEEEVerizon copyright agreement. The author is with Verizon Laboratories, Waltham, MA 02451 USA (e-mail: aec@ieee.org). Color versions of one or more of the gures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identier 10.1109/TNET.2010.2053382

For example, it can be applied physically in optical backbone networks [1], [2], medium haul networks, access networks [3], xed/mobile wireless networks [4], or cognitive radio networks [5]. It can be applied logically in peer-to-peer networks, overlay networks, or dynamic virtual-circuit networks that use MPLS. As the mesh networking concept is deployed increasingly in the eld, it is necessary for design, engineering, and service provisioning to be able to quantify the service availability that will be realized in such networks. Such quantication is all the more important as enterprises, organizations, systems, devices, and individuals are all becoming increasingly dependent on the continuous operation of networks for computer communications. Furthermore, network service providers must now provide many of their customers with essentially guaranteed service as specied in service level agreements (SLA) in terms of measures such as service availability. Hence, this quantication can be all the more important when there are SLA with binding legal agreements and potential penalties involved. A related problem is the quantication of risk in SLA [6]. Although network availability analysis is a well-established discipline, the analysis of mesh networks is a challenging problem in general since the inherent generality of a mesh network complicates substantially the structure of the availability modeling problem. A number of recent works have been involved with determining the availability of paths in mesh networks. For example, in [7], Markov chain analysis and discrete-event simulation is used to analyze a mesh network with dedicated-resource or shared-resource bandwidth protection schemes. In [8][10], analytical models are developed to analyze WDM mesh netand works with different protection schemes such as with -link sharing. In [11], a continuous-time Markov model is used to analyze the case of shared protection. In [12], a calculation method using the concept of restoration-aware connection availability is developed to analyze mesh networks with dedicated or shared backup capacity. In [13], a heuristic technique is developed for computing end-to-end circuit availability for optimal routing and wavelength assignment. Other related works may be found surveyed in [11]. Little work, if any, appears to have been concerned with mesh networks having general dynamic path restoration. In this paper, we consider the general problem of analyzing path availability in mesh networks with dynamic path restoration, where failover paths are determined dynamically, on the y, by an algorithm in real-time based on the current state of the network. In this general problem, the size of the state-space and the structural complexity of the system generally precludes the use of analytical modeling techniques. Direct simulation can also be very challenging, or even impractical, when the sets of network element failure events that lead to loss of end-to-end

1063-6692/$26.00 2010 IEEE

CONWAY: FAST SIMULATION OF SERVICE AVAILABILITY IN MESH NETWORKS WITH DYNAMIC PATH RESTORATION

93

path service occur very rarely. To tackle the problem in a practical and general way, we develop a fast efcient Markov Monte Carlo simulation technique for the analysis of service availability in a general mesh network model with a general dynamic path restoration method. In the model, it is assumed that there is a given set of initial end-to-end paths that carry end-to-end trafc demands. When there are one or more network element failures, the affected paths are rerouted dynamically by a given rerouting algorithm that generates alternate routes to use. As element repairs are made and the initial routes become available again for use, the rerouted paths may revert to their respective original routes. The model also uses the concept of a failure equivalence group (FEG), consisting of failure event sources and pools of repair personnel, to account for multiple in-series link cuts, optical amplier failures along each link, as well as bidirectional link failures, node failures, or more general geographically distributed failure scenarios. The FEG is a generalization of the concept of the Shared Risk Link Group (SRLG) used in optical and GMPLS networking [14][16]. The mesh network simulation technique developed here combines the simulation of any specic dynamic path restoration algorithm with a dynamic importance sampling (DIS) [17] variance reduction technique tailored specically to the mesh network problem at hand. The DIS method developed here is called dynamic path-failure importance sampling (DPFS). In DPFS, the failure rates of network elements are biased at increased rates until path failures are observed to occur under the given rerouting algorithm. This enables the efcient simulation of path availability by reducing substantially the simulation run-length needed to meet a desired condence interval requirement. The application of DIS to mesh networks with dynamic path restoration appears to be a novel application of the importance sampling technique. Hitherto developed importance sampling simulation methods for networks have, for example, assumed xed routing and focused on other measures such as call blocking (see, e.g., [18]) or packet/cell loss (see, e.g., [19]). An attractive feature of the developed simulation technique is that it fully takes into account the details of any path rerouting algorithm together with the generality of the dened mesh network model. The simulation method, however, provides condence intervals on availability estimates as opposed to exact or approximate analytical results. The paper is organized as follows. In the following section, we rst dene the general mesh network model. The assumed general dynamic path restoration method is then dened in Section III. The modeling of failures and repairs with FEG is developed in Section IV. The simulation of path availability, the application of importance sampling, and the DPFS method are developed in Sections VVII, respectively. Section VIII presents: 1) a small mesh network simulation example to validate the theory and demonstrate the effectiveness of DPFS; and 2) a large example to demonstrate the effectiveness in a problem size that is typical of one arising in practice. II. MESH NETWORK MODEL The generic mesh network model considered here is formulated in terms of nodes, links, circuits, and paths. As shown guratively in Fig. 1, a unidirectional circuit runs over one or

Fig. 1. Model structure in terms of nodes, links, circuits, and paths.

more unidirectional links. A unidirectional path runs over one or more unidirectional circuits. The network is composed of unidirectional point-to-point links. A link can correspond to, for example, an optical ber or a wireless radio channel. The bits/s. A link may be in an bandwidth of link is operational or a failed state due to, e.g., a ber cut, a failed optical amplier, or a failed radio. A failed link has no available bandwidth. The instantaneous available bandwidth of link at time is denoted by . The initial condition is . A bidirectional link is modeled using a pair of unidirectional links. A circuit is dened to be a generic unidirectional connection between two nodes over a set of interconnected links. In an optical mesh network, a circuit can correspond to a wavelength or lightpath between a pair of nodes. The end-to-end wavelength may run over (be switched through) one or more connected links. The end-to-end wavelength may also be composed of different concatenated wavelengths if there is static wavelength conversion at switching nodes. In a virtual-circuit mesh network, such as a MPLS mesh network, a circuit can correspond to a static virtual-circuit between a pair of nodes. The total number of circuits is . The total bandwidth in circuit is bits/s. A circuit consumes the bandwidth in each of the links that it uses. The circuit routing matrix is dened to be , where if circuit uses link , and 0 otherwise. The circuit routing matrix is assumed to be static in time. A link may be used by more than one circuit. The bandwidth of a circuit must necessarily be less than or equal to the bandwidth of any of the links that the circuit . uses, i.e., The sum of the bandwidths of the circuits that use link must , necessarily be less than or equal to the link bandwidth i.e., for If a circuit uses a link that is in a failed state, then the circuit is considered to be in a failed state with no available bandwidth. The instantaneous available bandwidth at time is denoted by , where of circuit . The . A bidirectional initial condition is circuit is modeled using a pair of unidirectional circuits. The two directions of a bidirectional circuit need not follow the same route over the links of the network.

94

IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 19, NO. 1, FEBRUARY 2011

A path is dened to be a generic unidirectional end-to-end connection between two nodes over a set of interconnected generic circuits. For example, in an optical mesh network, a path can correspond to an end-to-end circuit-switched connection over a set of wavelengths or lightpaths. In a virtual-circuit mesh network, such as a MPLS mesh network, a path can correspond to an end-to-end virtual-circuit [or label switched path (LSP)] over a set of wavelengths, lightpaths, or static virtual-circuits. The total number of paths is . The required bandwidth of path is bits/s. A path consumes the bandwidth in each of the circuits that it uses. A circuit may be used by more than one path. The routing of a path in terms of working circuits may change in time as circuit failures occur due to link failures and paths are rerouted. If a working route for a path cannot be found with the assumed dynamic path restoration method, if path then the path is no longer operational. Let is operational at time , and 0 otherwise. It is assumed that all for . paths are operational initially, i.e., The state of the path routing at time is given by the path routing matrix , where if path uses circuit at time , and 0 otherwise. The routing of all the paths is necessarily always subject to the available bandwidth of each circuit, i.e., for The initial path matrix is assumed to be provided. The initial routes for the paths can be determined by an algorithm of the least-cost path type (e.g., shortest paths, minimum-hop paths), load balancing type, or other type. A bidirectional path is modeled using a pair of unidirectional paths. The two directions of a bidirectional path need not follow the same route through the circuits or links of the network. III. DYNAMIC PATH RESTORATION The failure of a particular link results in the failure of all circuits that use the link. The failure of a circuit can lead to the possible failure of a path. When a path experiences a circuit failure, the dynamic mesh network will attempt to reroute the affected path over circuits that are operational. The process of rerouting a failed path in response to a failed circuit is called dynamic path restoration. The type of restoration method assumed here is quite general. It may be one that nds the next shortest route in terms of working circuits, subject to the prevailing circuit bandwidth constraints. It may be one that nds a route that maximizes the minimum remaining capacity over all working circuits in the network. The restoration method could also reroute some or all paths in the network (network repacking) to maximize some objective function. The method could also reroute paths in response to the completion of link repairs. The dynamics of the path restoration algorithm are dened generally as follows. Let the state of the bandbe dened by the vector width of the circuits at time . As already . Now, dened, the path routing matrix at time is given by suppose that at time there is a link failure or repair event . that causes the new state of the circuits to become

of the path restoration algorithm is The general function based on then to determine a new path routing matrix , subject to the circuit bandwidth state , i.e., . If a path is affected by a is failure event and the path cannot be rerouted, then for path set to 0, and the path routing matrix entries become irrelevant. It is to be noted that the routes of rerouted paths can depend upon the previously realized ordered sequence of link failure and repair events. When all links return back to an operational . state, the path routing matrix could be different from However, we assume in the formulation developed here that the dynamic path restoration method is such that all paths return once all links become operational, to their initial routes , then i.e., if . This reects what may be seen typically, though not necessarily, in practice since there are usually some established desired routes under normal operating conditions. It is also to be noted that, in the above dened path restoration dynamics, we do not account for the time that paths may not be operational while the path restoration algorithm is nding new path routes to use. However, this rerouting time is, in practice, typically orders of magnitude smaller than the time to repair links (e.g., cable cuts), and therefore, the main contribution to path unavailability is the time to repair physical links themselves. Todays optical mesh networks can have restoration times of the order of seconds or less [20]. The average repair time of cable cuts is, in practice, typically of the order of several hours (see, e.g., [21] and [22]). IV. FAILURE AND REPAIR MODELING WITH FEG We now dene the modeling of failures and repairs in the network. To enable the construction of mesh network models with features that can reect network characteristics commonly seen in practice, we dene and use the concept of a failure equivalence group (FEG). A FEG is dened to be a particular subset of unidirectional links together with an associated failure and repair process. A particular link may belong to one or more groups. At any point in time, each group is in either an operational or a failed state. When a group is in a failed state, all the unidirectional links in the group are unusable. A unidirectional link is usable if and only if all the groups to which it belongs are operational. Each group experiences the arrival of failure events that cause the group to be in a failed state. The failure events in a particular group are repaired by a nite or innite pool of repair personnel that is dedicated to the group. When a group is operational and a failure event arrives in the group, the group enters the failed state and the repair of the failure event is started by a repair person. While in the failed state, a group may also experience additional independent arrivals of failure events. The additional failure events may be repaired by additional repair persons in parallel or placed in a repair queue. In general, as shown in Fig. 2, the failure and repair process for each group is modeled as a dedicated nite or innite source multiserver queue, with the number of servers corresponding to the population of the repair personnel associated with the group.

CONWAY: FAST SIMULATION OF SERVICE AVAILABILITY IN MESH NETWORKS WITH DYNAMIC PATH RESTORATION

95

Fig. 2. Failures and repairs in a failure equivalence group.

Whenever the repair of all outstanding failure events in a group has been completed, the group reenters the operational state. It is to be noted that the FEG construct is closely related to the concept of a Shared Risk Link Group (SRLG) [14][16]. A SRLG is dened in [14] as a set of links or optical lines sharing a common physical resource (including ber links/sub-segment/ segment/trunk) i.e., sharing a common risk. The FEG is also a group of network elements, i.e., unidirectional links, that shares a common risk, i.e., failure events. The FEG construct, however, is more general than the SRLG since it also includes the specication of a failure event (risk) arrival process and an event (risk) repair process. The generality of the FEG construct provides a unied way to create network models that can faithfully represent many of the common failure and repair characteristics of mesh networks. Bidirectional link failure and its repair can be modeled by associating a group with a particular pair of unidirectional links. In networks, the different directions of links between a particular pair of nodes are frequently in the same cable or conduit, or just physically adjacent. In these cases, cuts will result in the simultaneous failure of the links in the different directions. Hence, there is a need to model such physical link dependencies explicitly while still allowing for purely independent unidirectional links that one may also have in mesh networks. The FEG construct also enables one to model the possibility in practice of having multiple simultaneous cuts in series in a particular unidirectional or bidirectional link. That is, due to the fact of it being physically distributed, a second or third cut, or even more cuts, may occur in a particular link before a rst cut and any other subsequent cuts are repaired. This can be modeled by using a FEG with an innite source of failure events. The assumed population of the repair personnel may be nite, as is of course always the case in reality, or innite for the sake of modeling simplicity. The construct also enables the modeling of the failure and repair of optical ampliers (e.g., EDFA) that are typically spaced along bers to boost optical signal levels. This can be modeled

by using a FEG with a nite source of failure events, where the population of the source corresponds to the number of ampliers in the link. The population of the repair personnel in this case is nite, at most equal to the number of ampliers. To simultaneously model cuts and amplier failures and repairs on a link, we can associate the link with two different FEG, one to model the cuts and the other to model the amplier failures. In this way, we can realistically model links with cuts and amplier failures. Node failures and their repair can also be modeled with FEG by simply associating a group with all the unidirectional links adjacent to a particular node. A node failure can correspond to the destruction of a network node by natural or man-made causes. Using FEG, we can also model any set of simultaneous unidirectional link and node failures. Such additional general failure modeling can correspond to geographically distributed physical failure events such as ice storms, windstorms, or oods. In the case of such physical failure events, the FEG source population and repair pool population may both be set to a size of 1. The preventative maintenance of network elements can also be modeled using FEG. The failure and repair modeling with FEG is now developed more mathematically. The number of groups is . The failure and repair processes of the groups are assumed to be independent and Markovian. The assumption of independent groups is quite reasonable in the present mesh network modeling context since the links or ampliers in a particular group will typically be separated geographically from those in other groups and separate repair personnel are typically located in different geographical areas. The failure arrival process of a group may correspond to either an innite or a nite source, as shown in Fig. 2. The maximum possible number of failure events in group is . In the case of an innite source, the failure event arrival . In the case of a nite source, rate of group is and , and the arrival rate for each source the number of sources is is . The repair rate of a group failure by a repair person is . Let . The state of the groups at time is given , where by the random variable is the number of group failure events at time that have not been repaired. If , then group is in the operational state; otherwise, it is in the failed state. The number of repair personnel associated with group is . The group failure and repair process forms a continuous-time Markov chain with state-space and initial state . The group process determines the links that are operational at a point in time. Hence, the group process is what drives the available bandwidth in the circuits and, consequently, the rerouting of the paths. of the The joint steady-state probability distribution and group process, where , is known since the groups are independent and each group process corresponds to a Markovian queue. The distribution is given by the product-form (1) where a corresponds to the steady-state distribution of type of queue. For example: 1) if we

96

IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 19, NO. 1, FEBRUARY 2011

have a nite source with , then (see [23, Sec. 3.6]); 2) if we have an innite , then (see source with [23, Sec. 3.2]); 3) if we have an innite source with ( ), then (see [23, Sec. 3.4]); and , 4) if we have a nite source and then (see [23, Sec. 3.9])

The form of for other related Markovian queues may also be determined. Case 1) is the simplest in which a group simply experiences a single failure event that is then repaired. Case 2) models multiple cuts on a link with a single repair person. Case 3) models multiple cuts on a link with no limit on the number of repair optical ampliers on a link with personnel. Case 4) models repair personnel. V. SIMULATION OF PATH AVAILABILITY The service measure of central interest here is the path avail, , dened to be the average proportion ability of time that path is operational in steady state. Let be the . This random variable of the recurrence time of the state state is a regenerative state of the model since it has been assumed in Section III that the path restoration is such that all when all links become paths return to their initial routes that path is operational operational. The average time during the recurrence time is

Fig. 3. Possible transitions and rates out of state n in the CTMC.

[17]. The possible state transitions and associated transition rates in the CTMC out of a state are shown in Fig. 3. The from state to DTMC transition probability , corresponding to a group failure event arrival, where , is a unit vector pointing in the direction , , and , is, therefore, given by

where is the transition rate in the CTMC for a group failure event arrival, given that the current state is , is the mean holding time in state of the CTMC. and from state to The DTMC transition probability , corresponding to a group failure event repair, where , is, therefore, given by

where

denotes expectation and, as dened in Section II, if path is operational at time , and 0 otherwise. is given by (see, The path availability e.g., [24]). Conveniently, an explicit expression for can is known. This avoids the need to be obtained here since . The mean sojourn time in state is estimate , where if group is an innite source, if group is a nite source. We may write or . Hence, and

where is the transition rate in the CTMC for a group failure event repair, given that the current state is . The mean holding time in state of the CTMC is given by

Since is known from (1), the remaining unknown here is now , which is to be estimated with simulation. The method of estimating the average operational time in a recurrence time is to use regenerative simulation with as the regenerative state. Rather than simulating the state continuous-time Markov chain (CTMC) of the group process, we simulate the associated embedded discrete-time Markov chain (DTMC). In [17], it is shown that we can estimate steady-state measures of a CTMC by simulating the corresponding DTMC. The deterministic holding times in the states of the DTMC are set to the corresponding mean state holding times in the CTMC. Simulating the DTMC is also guaranteed to reduce the variance in estimating steady-state measures

More explicitly, for example, in the case of a group, if , and 0 otherwise, and if , and 0 otherwise. In a group, and if , and type of group, 0 otherwise. In a and . In a group, and . Now, let be the discrete random variable of the recurrence in the DTMC, i.e., the number of DTMC time for state state transitions in a tour from state and back to state . Let be the random variable of the DTMC state at time epoch , , , and . Let where be the state of the circuit bandwidths at time epoch in the be the state of the path routing at time DTMC, and let epoch in the DTMC. The state of the circuits and paths do not change during the holding time in a state. When there is a transition out of a state due to a group failure event or a repair, the

CONWAY: FAST SIMULATION OF SERVICE AVAILABILITY IN MESH NETWORKS WITH DYNAMIC PATH RESTORATION

97

state of the circuits becomes and that of the paths , where , . become if path is operational at time epoch in the Let DTMC under and , and 0 otherwise. in the CTMC to that in the DTMC. We may now relate is a steady-state measure, we know from [17] that Since can be estimated from the DTMC. Hence, we may write

VII. DYNAMIC PATH-FAILURE SAMPLING The DIS simulation method developed here for estimating path availabilities in mesh networks with dynamic path restoration is called dynamic path-failure importance sampling (DPFS). In DPFS, the goal is to bias the system state trajectory specically toward path failures that the restoration algorithm is unable to restore. This is achieved by setting the failure in the FEG at an increased level until path failures are rates is reached). Once a path observed to occur (or state failure is realized, the group failure rates are set back to their original values. Note that if we simply biased failure rates until a group failure occurred, then this would not be effective in general since, under dynamic path restoration, the failure of a group does not always necessarily lead to the failure of paths. In DPFS, we dene the group failure bias to be a constant , , such that the failure rate is increased to for . We also dene a target failure rate ratio , . The target is the desired ratio of the sum of the biased group and the sum of the group repair rates . If the failure rates target is , then the group failure bias is given by

Now, let be the set of all possible tours of length in the DTMC starting at state and returning back to state in steps, where , is the DTMC state , and is the number of at time epoch , group failures at time epoch that have not been repaired. Let be the probability of realizing tour . Then, we may write

where . Hence, if we simulate the DTMC using Markov Monte Carlo simulation starting at state until it returns to state , then an estimate of is given by , where is the realized number of steps in the tour in the DTMC. VI. SIMULATION WITH IMPORTANCE SAMPLING with The direct simulation of the DTMC to estimate Markov Monte Carlo may be very time-consuming, or even impractical, since the failure rates of the groups will, in practice, usually be much smaller than the repair rates. To reduce the simulation time needed, i.e., the number of independent regenerations, we can use importance sampling [17], [25]. In importance , , sampling, the state transition probabilities so that group in the DTMC are modied to the values failure events are more likely to arrive. We may write

The value of is set by the user. A reasonable value to use can be determined with some trial simulation runs. The DPFS method is then to simulate the DTMC with the biased failure rates starting from state until a path failure is observed (or state is reached). Once a path failure is observed, the bias is set to 1.0 (i.e., turned off). The system then returns after all group repairs have been made. This reto state generative simulation process is repeated independently, always , until a required number of instarting again from state dependent regenerations have been completed. The DPFS simulation method for the mesh network with dyis summarized namic path restoration rerouting function below. In the following, the number of independent regeneraobtained in regenerations is given by , the estimate of , the mean estimate of is detion is denoted by noted by , the estimate of the availability of path is de, and . noted by Mesh Network Simulation with DPFS: Choose the target failure rate ratio . Set the bias .

where

and

The term is known as the likelihood ratio. Hence, if we simulate the DTMC starting at state until it returns to state , is given by , then an estimate of where is the realized number of steps in the tour in the modied DTMC. The manner in which the DTMC transition probabilities can is very general [17]. The modications may be modied to be static or dynamic, i.e., on the y as a function of state or time. A dynamic method is known as dynamic importance sampling (DIS) [17]. In the following, we develop a DIS method tailored specically to the problem of simulating mesh networks with dynamic path restoration.

For

{ . Set . and paths state to and . { . .

Set the initial state to Initialize circuits state For For Set While For and

: Set failure rate : Set

(turns bias on). .

98

IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 19, NO. 1, FEBRUARY 2011

If

for any ,

: to

For : Set failure rate of group (turns bias off).

Randomly sample the next state transition out of state in the DTMC: New state is . Set Update Update For Set } For } For For : : . . : . . : Update . , . rerouting paths that are currently working. We also assume in the examples that paths, which could not be rerouted due to prevailing bandwidth constraints, are not rerouted later in response to other subsequent link failures or repairs, at least until all links become operational. Note that other path rerouting assumptions may, of course, be adopted to model other cases. A. Small Network Example The small network consists of a pair of unidirectional links ( ) from node A to node B, two circuits ( ), and one path ( ), as shown in Fig. 4. The initial path uses the . It is assumed that the links fail rst circuit, i.e., independently, i.e., each link belongs to a separate FEG ( ), and that there is at most one failure in a link (e.g., each link could be a radio channel). If the rst link is the rst to fail and the second link is operational, then the path is routed over the circuit that uses the second link. If the second link should then fail before the rst link is repaired, then the path is no longer operational until both links have been repaired. If the second link is the rst to fail and the rst link then fails, then the path is no longer operational until both links have been repaired. Once all links are operational, the group process returns to the state . (0,0) and the routing is assumed to return to , , In the small example, we assume , , and we set the target failure rate ratio . To obtain a condence interval on the estimated path availability, we simulate the DTMC until we complete at least 10 000 state transitions and then compute a condence interval based on the regenerative tours that have been completed. With DPFS, the obtained 99% condence interval for the path un, 1.319 ). With the failure availability is (1.434 biasing turned off, the obtained 99% condence interval is , 0). For the sake of comparison, the exact path (2.921 unavailability obtained with analysis (see the Appendix) is . Hence, in this case, the condence interval width 1.349 is reduced by a factor of about 25 with DPFS. If, in the case where failure biasing is turned off, we extend the simulation to complete at least 10 million state transitions, then the 99% , 1.338 ). This is condence interval is (1.438 approximately what it was with DPFS with only 10 000 state transitions. Hence, in this example, DPFS reduces the simulation run-length by a factor of about 1000. This demonstrates how DPFS can reduce substantially the simulation run-length needed to meet a given condence interval width. B. Large Network Example For a large example, we adopt the 18-node IP backbone network topology presented in [27] and shown in Fig. 5. In this and .
Fig. 4. Small mesh network example.

Note that, in the above simulation method, the complexity of each regenerative tour is dominated by the complexity of the that is being simulated. particular path rerouting method The complexity of randomly sampling each state transition out with binary search since of state is of the order of there are at most transitions out of a state in the DTMC. The number of times that the path rerouting method is invoked in a tour depends on the actual DTMC state transitions that are realized in the tour. VIII. MESH NETWORK EXAMPLES We present two mesh network examples. The rst is a small test network to validate the theory and demonstrate the efciency of DPFS. The second is a larger network example to illustrate the practical application of DPFS to a modeling problem of a size that can typically be encountered in practice. In both examples, we assume that the initial paths are the shortest paths and that the dynamic path restoration algorithm reroutes a path by determining the next shortest operational route using Dijkstras algorithm [26], subject to the prevailing circuit bandwidth constraints. As in Section III, we assume that all paths return to once all links become operational, i.e., the initial paths when the group process returns to the regenerative state . During the time that at least one link is in a failed state, we assume in the examples that a working path is rerouted to an alternate working route only when there is a circuit failure in the current route of the working path. Before a path is rerouted, is released in each of the working cirthe bandwidth cuits along the current route of the path. Working paths that have no circuit failures in their current route are not rerouted. Also, working paths are not rerouted in response to link repairs, at least until all links become operational. These assumptions reect what may typically, though not necessarily, be done in practice to avoid potential unnecessary operational errors in

CONWAY: FAST SIMULATION OF SERVICE AVAILABILITY IN MESH NETWORKS WITH DYNAMIC PATH RESTORATION

99

Fig. 5. Large mesh network example.

Fig. 6. Path unavailability estimates with DPFS.

network example, there are 33 (bidirectional) links between the nodes. Each bidirectional link is modeled with a pair of unidirectional links. A unidirectional circuit is associated with each unidirectional link. Each circuit is assumed to have a total bandwidth of 192 units. A unidirectional path is associated with each unidirectional circuit. In addition, there is a pair of unidirectional paths assumed between nodes 1 and 5, nodes 1 and 13, nodes 6 and 15, and nodes 13 and 16. Each path is assumed to have a required bandwidth of 48 units. Hence, in the network, there are 66 unidirectional links, 66 circuits, and 74 paths. To type model link cuts, we associate one FEG of the with each pair of unidirectional links. To model optical amplier failures on the links, we associate an additional FEG of the type with each pair of unidirectional links. Hence, in the network, we have a total of 66 FEG. We assume that if an amplier fails, then both directions of the bidirectional link are no longer operational. This models what happens typically, though not necessarily, in practice, where a link is effectively taken out of service by an upper layer protocol when either direction fails. The mesh network model developed here, however, is sufciently general that it can also accommodate the case where an amplier failure only affects one direction. The link cut and amplier mean repair times are assumed to be 4 h. The amplier spacings are assumed to be 100 km. The link cut rates are assumed to be one cut per year per 800 km (as in [21]). The amplier mean time to failure is assumed to be ve years. The distances between the nodes are assumed to be the approximate geographic distances between the nodes (cities) shown in . The path Fig. 5. The target failure rate ratio is set at rerouting algorithm is shortest-paths using Dijkstras algorithm. The ordering of any rerouting of paths is randomized so that paths with a lower index do not have an inherent advantage due to their specic path enumeration. Simulating the large network with DPFS for 10 million state transitions, we obtain the path unavailabilities shown in Fig. 6. The relative 99% condence interval widths (width/unavailability estimate) for the path unavailability estimates with DPFS are shown in Fig. 7. The total simulation wall clock runtime on a conventional laptop computer was 4234 s. As can be seen from Fig. 7, with DPFS we obtain reasonable condence interval widths on all of the path unavailabilities even though

Fig. 7. Relative 99% condence interval widths with DPFS.

Fig. 8. Relative 99% condence interval widths with DPFS failure rate biasing turned off.

. The largest some of them are very small of the order of relative width is 1.88. Simulating the DTMC for 10 million state transitions with the DPFS failure rate biasing turned off, we obtain the relative 99% condence interval widths shown in Fig. 8. The largest relative width is 0.625, but the 30 zero values plotted in Fig. 8 correspond to paths for which no path failure time was ever observed

100

IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 19, NO. 1, FEBRUARY 2011

in the entire simulation run. Hence, without DPFS, we are unable to obtain any estimate for 30 of the path unavailabilities. This demonstrates the effectiveness of the DPFS method in estimating the unavailability of highly available paths in mesh networks with dynamic path restoration. IX. CONCLUDING REMARKS The DPFS simulation technique developed here is a practical and effective method for estimating service availability in mesh networks with dynamic path restoration. It enables one to obtain useful condence interval widths on path service availabilities in reasonable simulation run times. The developed failure and repair modeling with FEG is sufciently general so that it can be used to faithfully represent many of the types of failure and repair mechanisms that appear in practice. The assumed path restoration algorithm is sufciently general to accommodate almost any algorithm, at least ones that return paths to their initial paths once all element repairs have been made. There are several directions in which the present work can be extended. The formulation of the simulation could be recast in terms of independent replications to accommodate restoration algorithms that do not necessarily return paths to their initial paths. The DPFS method could be modied to turn the failure biasing off only in response to specic path failures. This could be useful when only a specic path, or a subset of paths, is of interest. The DPFS method could also be modied to turn the failure biasing off in response to particular path failures, where the particular path is chosen by cycling through all the paths in the network. Such a stratied DPFS sampling scheme could be benecial in cases where the path availabilities are highly imbalanced. The formulation of the simulation could also be simplied in the degenerate case where the path restoration method is simply one that uses dedicated or shared failover paths that have been predetermined ofine. The failure and repair modeling in Section IV could also be extended to accommodate dependencies between the FEG queues to model, for example, shared pools of repair personnel. However, with such generalizations, an explicit expression for may not be available would have to be computed or at least estimated. and APPENDIX The path availability in the small network example of Section VIII can be derived analytically. Consider the continuous-time Markov chain corresponding to the group process, as shown in Fig. 9. The states (0,0), (0,1), (1,0), and (1,1), are labeled 1, 2, 3, and 4, respectively. Let be the transition probability from state to in the corre, sponding embedded DTMC. We have , , , , , , . Let be the mean holding time and , , in state . We have , and . Now, dene state 1 to be the regenerative state. We now determine analytically the downtime in a recurrence time. There are six state trajectory cases to be considered. Case 1) The trajectory is 1 to 3 to 1 and the downtime is 0. Case 2) The trajectory is 1 to 2 to 1 and the downtime is 0.
Fig. 9. Continuous-time Markov chain of group process in the small network example.

Case 3) The trajectory is 1 to 3 to 4, followed by tours from tours from 4 to 2 to 4 (with the 4 to 3 to 4 and combination of tours done in any order) and then, nally, going from 4 to 3 to 1. The downtime is . Case 4) The trajectory is 1 to 3 to 4, followed by tours from tours from 4 to 2 to 4 (with the 4 to 3 to 4 and combination of tours done in any order) and then, nally, going from 4 to 2 to 1. The downtime is . Case 5) The trajectory is 1 to 2 to 4, followed by tours from tours from 4 to 2 to 4 (with the 4 to 3 to 4 and combination of tours done in any order) and then, nally, going from 4 to 3 to 1. The downtime is the same as in Case 3. Case 6) The trajectory is 1 to 2 to 4, followed by tours from 4 to 3 to 4 and tours from 4 to 2 to 4 (with the combination of tours done in any order) and then, nally, going from 4 to 2 to 1. The downtime is the same as in Case 4. Taking into account these cases, the mean downtime in the recurrence time is given by

The probability of state 1 is , where and . Hence, the path availability in the small example network is given by

REFERENCES
[1] J. P. Lang and J. Drake, Mesh network resiliency using GMPLS, Proc. IEEE, vol. 90, no. 9, pp. 15591564, Sep. 2002. [2] L. Song and B. Mukherjee, Accumulated-downtime-oriented restoration strategy with service differentiation in survivable WDM mesh networks, IEEE J. Opt. Commun. Netw., vol. 1, no. 1, pp. 113124, Jun. 2009. [3] G. Egeland and P. E. Engelstad, The reliability and availability of wireless backhaul mesh networks, in Proc. IEEE Int. Symp. Wireless Commun. Syst., Oct. 2008, pp. 178183.

CONWAY: FAST SIMULATION OF SERVICE AVAILABILITY IN MESH NETWORKS WITH DYNAMIC PATH RESTORATION

101

[4] M. M. Campista et al., Routing metrics and protocols for wireless mesh networks, IEEE Netw., vol. 22, no. 1, pp. 612, Jan./Feb. 2008. [5] I. Pefkianakis, S. H. Y. Wong, and S. Lu, SAMER: Spectrum aware mesh routing in cognitive radio networks, in Proc. 3rd IEEE DySPAN, Oct. 2008, pp. 15. [6] M. Xia, J. H. Choi, and T. Wang, Risk assessment in SLA-based WDM backbone networks, in Proc. OFC, Mar. 2009, pp. 13. [7] H. Naser and H. T. Mouftah, Availability analysis and simulation of mesh restoration networks, in Proc. 9th ISCC, 2004, vol. 2, pp. 779785. [8] L. Song and B. Mukherjee, New approaches for dynamic routing with availability guarantee for differentiated services in survivable mesh networks: The roles of primary-backup link sharing and multiple backup paths, in Proc. IEEE Globecom, 2006, OPN07-6. [9] L. Song, J. Zhang, and B. Mukherjee, Dynamic provisioning with availability guarantee for differentiated services in survivable mesh networks, IEEE J. Sel. Areas Commun., vol. 25, no. 3, pt. supplement, pp. 3543, Apr. 2007. [10] L. Song and B. Mukherjee, On the study of multiple backups and primary-backup link sharing for dynamic service provisioning in survivable WDM mesh networks, IEEE J. Sel. Areas Commun., vol. 26, no. 6, pt. supplement, pp. 8491, Aug. 2008. [11] J. Zhang, K. Zhu, H. Zang, N. S. Matloff, and B. Mukherjee, Availability-aware provisioning strategies for differentiated protection services in wavelength-convertible WDM mesh networks, IEEE/ACM Trans. Netw., vol. 15, no. 5, pp. 11771190, Oct. 2007. [12] L. Zhou, M. Held, and U. Sennhauser, Connection availability analysis of shared backup path-protected mesh networks, J. Lightw. Technol., vol. 25, no. 5, pp. 11111119, May 2007. [13] Z. Pandi, M. Tacca, A. Fumagalli, and L. Wosinska, Dynamic provisioning of availability-constrained optical circuits in the presence of optical node failures, J. Lightw. Technol., vol. 24, no. 9, pp. 32683279, Sep. 2006. [14] D. Papadimitriou, F. Poppe, J. Jones, S. Venkatachalam, S. Dharanikota, R. Jain, R. Hartani, D. Grifth, and Y. Xue, Inference of Shared Risk Link Groups, IETF, draft, 2001 [Online]. Available: draft-many-inference-srlg-02 [15] J. Strand, A. L. Chiu, and R. Tkach, Issues for routing in the optical layer, IEEE Commun. Mag., vol. 39, no. 2, pp. 8187, Feb. 2001. [16] D. Xu, Y. Xiong, and C. Qiao, Failure protection in layered networks with shared risk link groups, IEEE Netw., vol. 18, no. 3, pp. 3641, MayJun. 2004. [17] A. Goyal, P. Shanhabuddin, P. Heidelberger, V. F. Nicola, and P. W. Glynn, A unied framework for simulating Markovian models of highly dependable systems, IEEE Trans. Comput., vol. 41, no. 1, pp. 3651, Jan. 1992. [18] L. L. H. Andrew, Fast simulation of wavelength continuous WDM networks, IEEE/ACM Trans. Netw., vol. 12, no. 4, pp. 759765, Aug. 2004. [19] J. K. Townsend, Z. Haraszti, J. A. Freebersyser, and M. Devetsikiotis, Simulation of rare events in communications networks, IEEE Commun. Mag., vol. 36, no. 8, pp. 3641, Aug. 1998.

[20] G. L. J. Yates, R. Doverspike, and D. Wang, Experiments in fast restoration using GMPLS in optical/electronic mesh networks, in Proc. OFC, Mar. 2001, vol. 4, pp. PD34-1PD34-3. [21] H. C. Cankaya, A. Lardies, and G. W. Ester, Availability aware cost modeling of mesh architectures for long-haul networks, in Proc. 9th ISCC, 2004, vol. 2, pp. 766771. [22] M. To and P. Neusy, Unavailability analysis of long-haul networks, IEEE J. Sel. Areas Commun., vol. 12, no. 1, pp. 100109, Jan. 1994. [23] L. Kleinrock, Queueing Systems. New York: Wiley, 1975, vol. 1, Theory. [24] M. A. Crane and D. L. Iglehart, Simulating stable stochastic systems, III: Regenerative processes and discrete event simulations, Oper. Res., vol. 23, no. 1, pp. 3345, 1975. [25] J. M. Hammersley and D. C. Handscomb, Monte Carlo Methods. London, U.K.: Methuen, 1964. [26] OSPF version 2, IETF, RFC 2328, 1998. [27] C. Fraleigh, S. Moon, B. Lyles, C. Cotton, M. Khan, D. Moll, R. Rockell, T. Seely, and C. Diot, Packet-level trafc measurements from the Sprint IP backbone, IEEE Netw., vol. 1, no. 6, pp. 616, Nov.-Dec. 2003.

Adrian E. Conway (SM94) received the B.A.Sc. degree in electrical engineering from the University of Ottawa, Ottawa, ON, Canada, in 1981; the M.Sc. and D.I.C. degrees in communications engineering from Imperial College, University of London, London, England, in 1983; and the Ph.D. degree in electrical engineering from the University of Ottawa in 1986. In 1986, he was a Post-Doctoral Visitor with the IBM T. J. Watson Research Center, Hawthorne, NY. From 1986 to 1987, he was a Visiting Assistant Professor with the Department of Electrical Engineering, McGill University, Montral, QC, Canada. From 1987 to 1996, he was a Principal Member of Technical Staff with GTE Laboratories Incorporated, Waltham, MA. In 1995, he was a Visiting Professor with the Computer Science Department, Universit Pierre et Marie Curie, Paris, France, on sabbatical leave from GTE. In 1996, he was also a Lecturer with the Computer Science Department, Boston University, Boston, MA. From 1996 to 1998, he was a Research Staff Member with Racal-Datacom, Fort Lauderdale, FL. At Racal, he also worked jointly with NCC Ltd., Israel, under a BIRD Foundation grant. From 1998 to 2000, he was a Senior Engineer with GTE Internetworking, Waltham, MA. In 2000, he was a Principal Software Engineer with Infolibria, Waltham, MA. Since 2000, he has been a Distinguished Member of Technical Staff with Verizon Laboratories, Waltham, MA. He has also been a Part-Time Lecturer with Northeastern University, Boston, MA, teaching in the Graduate School of Engineering in 2005 and the College of Computer and Information Science from 2007 to 2009. He has over 50 research publications in journals and conference proceedings and seven U.S. Patents. He is an author of the book Queueing NetworksExact Computational Algorithms (MIT Press, 1989). Dr. Conway is a Member of the Editorial Board of IEEE Network.

You might also like