Designing a high availability subnetwork to support availability differentiation

In a recent series of papers [1-3] we studied the problem of how to provide both high levels of availability and service differentiation to traffic flows in a cost efficient manner. The basic idea developed was to embed at the physical layer a high availability set of links and nodes (termed the spine) in the network topology to support protection and routing in providing differentiated classes of resilience with varying levels of end-to-end availability. In this paper, we present an optimization model formulation of the spine design problem considering link availability and the cost of upgrading link availability. Numerical results show the advantages of modifying the availability of a subset of the network topology to provide QoR classes.


I. INTRODUCTION
The predominant approach to supporting multiple level of resilience is based on assigning different protection and restoration schemes at a certain technology layer to flows based on their Quality of Resilience (QoR) classes [4].However, such a scenario-based approach might result in a resilience level below what is needed or inefficient utilization of network resources.Availability-based routing was proposed to provision services with explicit availability constraints in the routing algorithm, combined with protection schemes (e.g., alternate path routing with dedicated protection [5], and shared protection [6]).Another line of research tried to achieve differentiation through time in order to minimize both resource usage and the risk of not meeting availability targets [7,8].The main drawback of this approach is the need to maintain failure status for all flows and make real time routing decisions.
In general, there are some limitations of the existing approaches.First, the range and the spacing between availability classes for existing approaches are somewhat narrow.Both need to be extended to cover a wider range of classes.Second, high availability for mission-critical services (i.e., four to six 9's) might not be achieved using basic protection schemes such as 1+1 [9].For example, in [10], the 1+1 recovery scheme of the gold class is insufficient to support extremely high availability levels (e.g., five or six 9's).Possible options to improve the availability include using a higher configuration of dedicated protection (e.g., two disjoint backup paths for each working path) or reserving adequate sharable spare capacity to restore traffic from multiple simultaneous failures (e.g., dual failure shared backup path protection [11]).Both approaches, however, lead to inefficient utilization of network resources and are also constrained by the diversity of the network which implies that in some cases adding new links and possibly nodes to the network topology to support additional parallel routes is required.However, expanding a nationwide backbone network simply to improve availability is difficult to justify economically.
An alternative way to achieve high availability is by improving the availability of network components.The authors of [12,13] try to optimize network availability by improving the availability of a subset of physical links via shielding.In addition, Botton et al. [14] study a network design problem with a subset of edges that for a given cost can be upgraded to be more reliable.They show that having a set of more reliable edges as a substitute to having edge-disjoint path-pairs can improve overall resource efficiency.These approaches, however, do not support resilience differentiation.The third limitation of existing approaches is related to the application of these approaches to layered networks.With just a few exceptions, most of the existing approaches suffer from crosslayer mapping issues [15], as without full knowledge of the physical layer and the mappings between layers no hard guarantees on availability can be provided (i.e., due to fault propagation).
Our approach to provide high and differentiated levels of availability stems from the Brinbaum's importance measure.According to this measure, improving the component with the higher availability in parallel configuration yields the best overall availability [16].To illustrate this, assume we have a flow f routed over a working path (WP) and a disjoint backup path (BP) with availability A W P f = 0.99 and A BP f = 0.90, respectively.The end-to-end availability A e−e of flow f is based on its A W P f and A BP f , and is calculated as a parallel configuration in which Assume we want to strengthen one or both of the links by adding some availability units ∆a, with an option of (A W P f +∆a), (A BP f +∆a), or (A W P f +∆a/2 and A BP f +∆a/2) to add these units.Figure 1 plots the overall end-to-end availability of flow f for the three options.It shows that improving A W P f only, achieves better end-to-end availability A e−e than any other option.From this, it is clear that having relatively highly available components forming a working path in parallel combination with a lower availability backup path can achieve better overall availability than homogenous Definition.The spine is a substructure with comparatively higher availability embedded into a network at the physical layer to improve the overall network availability without substantial modifications to the topology.
Our approach requires designing a network with heterogenous link availabilities such that a substructure of the network has relatively larger availability values.The high availability substructure portion of the network is termed the spine.The spine would connect those nodes with traffic needing a high level of availability and provide a basis for differentiated classes of resilience.For example, the highest quality of resilience class traffic could be routed on the spine or use the spine as a backup path.The nodes, link interfaces and links on the network spine would have higher availability than the equipment that is not part of the spine.This provides levels of availability differentiation at the physical level which can be leveraged with restoration techniques, logical virtual network topology routing, cross layer mapping and other methods to further differentiate resilience classes and provide an extended range of availability guarantees.
In our previous work [1][2][3], we designed the spine as a subnetwork that takes the layout of a minimum spanning tree embedded at the physical layer and use this subnetwork for routing connections of the higher availability QoR classes.We provided heuristics based on graph structural properties to find candidate spines and study the design performance.We showed numerically that the spine approach actually widens the availability ranges of network flows, and achieves high availability values that cannot be achieved with standard protection configurations.In [2,3], we assumed all links on the spine have the same availability (a S ) and similarly all links off the spine have the same availability (a O ) with (a O < a S ).Our sensitivity analysis in [3] shows that modifying either value; the improvement step of availability ∆, or considering heterogenous link availabilities, results in a change in the ranking of the best spines with respect to the availability metrics considered.In follow on work we showed that using spineaware crosslayer routing, the spine concept provides levels of availability differentiation in multilayer networks with upper [17] or lower layer restoration [18].In this paper, we revisit our proposal of the spine concept and formulate the spine design problem as a mixed integer linear programming (MILP) problem that considers the cost of upgrading link availability.For simplicity we consider the uncapacitated network case and note that the our previous work [17,18] showed only modest increases in the capacity are needed with the spine approach with the amount depending on the percentage of highest QoR class traffic.
The remainder of the paper is organized as follows.In Section II, we provide some background on the spine and describe our model.In Section III, we present the spine design optimization problem.In Section IV, we conduct a numerical study to evaluate our model and show a sample of the results.We conclude our paper in Section V.

II. THE SPINE MODEL
Here we adopt an optimization model approach to determining the spine that involves taking into consideration factors such as the possibility to improve the links and a target level of availability.The problem can be stated as follows.Given a physical network G P = (V P , E P ) with a set of nodes V P and a set of physical links E P , a set K of improvement options for each link with associated cost c k ij ; (i, j) ∈ E, k ∈ K and a set of end-to-end connections/flows (s, t) ∈ F that need high availability, one seeks to determine the subnetwork, that forms the spine G S = (V S , E S ), where G S ⊂ G P , that minimizes the total cost while achieving an availability target.

A. Incremental Link Availability Model
For a given network, each link is assigned an initial link availability value a ij based on its length, with longer links being less reliable.Specifically we use a distance-based link availability formula found in [19].The link availability is calculated as a ij = a cij × a tij where a tij is the product of cable-ends equipments (i.e., OXC, ROADM etc...), and a cij is the fiber cable availability that can be calculated from: where CC is the cable cut rate, M T BF and M T T R are the mean time between failures and mean time to repair in hours, respectively.
We study the scenario where each link in a network, can be purposely strengthened so that its MTBF is increased, for example by altering the cable implementation method (e.g., burying an aerial cable) [20,21] or adding physical protection [12,13] or the MTTR is reduced by focussed maintenance and repair efforts [20,22].For each link, one can collect possible options to improve its availability, and each option would result in a different availability level and incurred cost.Specifically, if the link e spans node-pair (i, j) and has availability a ij , using method k, the link availability can be improved to a k ij with cost c k ij , whereas using method k + 1 that costs c k+1 ij , availability is improved to a k+1 ij .We assume "K" possible availability values (a k ij , k = 1, 2, ...K) with (a 1 ij = a ij ).For each value k, the corresponding unavailability is reduced by , so that Reducing a link unavailability is analogous to reducing its expected downtime.Note that, in reality the k different options might not have fixed downtime differences within nor across links.Here, we choose a fixed for illustration purpose.In practice, not all links will likely have the same number of options as this depends on several factors (e.g., the terrain, cable type, the associated cost, etc).However, we assume that this is the case here in order to simplify the model.The cost associated with each improvement step k is calculated by a cost function, A precise formula for cost of availability is difficult to determine in practice.Instead, many researcher rely on some mathematically known models (e.g., constant, linear, quadratic, etc) to relate cost to availability [23].Also, note that their are diminishing returns in that the cost of improving a link becomes larger as the availability gets higher [22].For example, the cost of improving an already high availability link (e.g., 0.999 to 0.9999) typically costs more than improving a link with a moderate availability (e.g., 0.9 to 0.9009) by the same amount.We consider the following cost functions, f c 's, to compute the cost of improving the link availability per unit of length.The cost function, f c1 , is a polynomial in the availability improvement where α is a scaling parameter.This function captures the idea that the greater the improvement in availability, the larger the cost.The second cost function, f c2 , is a polynomial in the availability improvement ∆a k ij = a k ij − a 1 ij but also weighted by the unavailability of the link.Hence for equal ∆a k ij , it compounds the cost for the link with higher availability.This formula is very similar to f 2 in [24] The third cost function, f c3 , is derived from f 1 in [24] and models the notion that the impact of the cost on the improved availability decreases exponentially.
The length factor in the upgrade cost is included by the relationship, where d ij is link (i, j) length.The exponent α in (3) and ( 4) was set to 2 to impose quadratic growth of the cost.Figure 2 shows the Polska network topology and the avail-ability options for three different links with K = 7.Each table in the figure shows the availability levels of a link and the corresponding cost for the different cost functions.In the figure, k = 1 denotes the initial link availability.The case of k = 2 models transferring maintenance capabilities between links.Therefore, we set Thus the expected downtime of a link with k = 2 (i.e., off the spine link) would increase and incur negative cost which would reduce the total cost C. Subsequently, the availability of other links (i.e., on the spine) can be improved by the relocation of maintenance and repair capabilities, and take advantage of transferring operational expenditure from degraded links.H G total number of (undirected) links required by shortest (minhop) path pairs. Variables: x ij a binary variable indicating whether link (i, j) is selected on the spine (x ij =1) or not (x ij =0).x st ij (y st ij ) a binary variable denoting whether physical link (i, j) is used for routing the WP (BP) of connection (s, t).r k ij a binary variable indicating if method k is used for link (i, j).p st ij (q st ij ) a continuous variable denoting link (i, j) unavailability given that it is on connection (s, t) WP (BP).

B. Optimization Model Formulation
The spine design problem aims at finding the best combination of links to form the spine and selecting the improvement options for all links in order to achieve a target end to end availability, while minimizing the cost.We route all flows on the spine with fully link-disjoint backup paths.This ensures that all high QoR priority traffic supported by the spine can be given 1+1 dedicated protection.Note that, this also enables 1:N shared protection, however this topic is left for future work.Furthermore, we assume that the QoR class of service needing high availability has traffic between all possible nodepairs (e.g., a full mesh of demand of one unit between each node pair), hence we adopt a minimum spanning tree (MST) for the spine structure.The high QoR class flow availability is constrained to be greater or equal to target values ( a wp , a bp ).Instead of looking for the spine that maximizes the average availability, here we require that the minimum WP availability on the spine to be greater than or equal to a target availability, a wp .As shown earlier, increasing the availability on the working path improves the end to end availability more effectively then improving all links at once.Finally, the objective C of the design problem aims at minimizing the total costs of embedding the spine and improving flow availabilities to reach or exceed the target value.
Given the notation above, the spine design problem can be formulated as an optimization model as follows: s.t.

WP and BP computation:
hj∈E hj∈E P Loopless Routing: Disjointness constraints: Hop-count constraint: MST formation: x ij = x ji , ∀(i, j) ∈ E P (16) ij∈E P i<j Availability constraints: Flow availability targets: A BP st = 1 Variables: The core of the formulation is the flow conservation constraints ( 8) and ( 9), which find primary and backup paths for all flows.A flow conservation constraint pushes a unit of demand along a path between the two end-nodes of a given flow.Constraint sets ( 10)-( 12) ensure loop free routing.Constraint set (13) ensures that for each flow the primary and backup paths are fully link-disjoint.The sum of these paths H S , however, is constrained by ( 14) which sets a maximum limit for the hop count of the sum of the link-disjoint pathpairs where δ is a scaling factor and H G is the total length of the shortest path pairs between all node-pairs in a network.
Each link used by a primary path of any flow is considered as an on-spine link.Constraint (15) enforces this by turning the spine link selector variable for a link, x ij , to 1 if the link is used in a primary path of at least one flow.Due to the route of different (s, t) flows and equation (15), both x ij and x ji may take the value 1, but not necessarily both every time.Hence, because the network is undirected, constraint ( 16) is required.Then, constraint (17) limits the number of the (undirected) links selected for the spine to |V P | − 1 which is the number of links for a MST.Next, for the availability constraints, constraint set (18) ensures that only one improvement method is selected for each link.Constraint (19) requires that a link has the same improvement method in both directions.Constraint sets (20) and ( 21) are used to relate a flow WP and BP unavailability to the unavailability of each link along the flow path.Variable p st ij or q st ij will have an unavailability value only if flow (s, t) WP or BP is routed through link (i, j).These two sets of constraints, turn the optimization problem into a integer nonlinear programming (INLP) model, because the product of two variables i.e., x st ij with r k ij in (20) and y st ij with r k ij in (21).Note that, to compute a single path availability for a given flow, one can multiply the availability of the links along the path, but this results in a nonlinearity.Instead, we use the approximate version of the unavailability formula for a system connected in series, (u st ≈ ij u st ij ).Hence, WP availability can be computed as (1− ij p st ij ).BP availability is computed in the same way.Constraints ( 22) and ( 23) require that a flow WP and BP are above target availability values a wp and a bp , respectively.Lastly, constraint sets ( 24) and ( 25) declare binary and continuous variables.
To remove the nonlinearity of the INLP, constraints set (20) can be replaced with constraint sets eqs.( 26) to (28).The three constraints provide the same function as (20).Similarly, constraints set ( 21) that computes BP unavailability can be replaced with the set of eqs.( 29) to (31): IV. NUMERICAL STUDY We studied the Polska, Spain, and Italia14 network topologies drawn from published literature (see [25]) but due to space limitations only report results from the Polska network.We set K = 7, and δ in constraint ( 14) is set to 110%, allowing for a maximum of 10% increase in total resources over the resources required by shortest path-pairs, H G (i.e., H S ≤ 1.1H G ).We solved the spine design optimization problem for the test networks using AMPL/Gurobi.Here, the availability goal for WP, a wp in Equation (22), is set to a wp = {0.99,0.995, 0.996, 0.9964}.At each each a wp setting, we solved the model for the three cost functions.The solution time to achieve 0% integrality gap ranged from 11 to 687 seconds.Here we ignore the a bp constraint.Note, that increasing the target value a wp increases the total cost of the design and may also result in a different spine layout.
Figure 3 shows the spine layouts obtained for the Polska network, as the target a wp increases.First, the spine topology varies slightly as the target availability, a wp , or the cost function changes.Though one can see that there is a persistence substructure that appears in almost all the spines, e.g., the star-like substructure rooted at node 3. Table I shows the graph theoretic structural properties of the spines, namely: eb S the average edge betweeness centrality of the spine, ed S the average edge degree of the spine, h S the average shortest path on the spine, and di S the diameter of the spine.The first row shows the corresponding measures for the full physical layer graph G P with no spine.Observe that the spines tend to have a comparatively small edge betweenness eb S and average shortest path h S and a large edge degree ed S .Only in a few cases does the corresponding measure match the minimum (or maximum) value found by generating all MSTs for the network as given in [3].However, these results are consistent with the findings of the heuristic algorithm (reported in [2,3]) with respect to the spine that maximizes A Moreover, as the spine layout might be attributed to the structural importance of the links and nodes, it is also shaped by the cost associated with the links and their availability as well as the hop-count constraint.
It was observed that similar spines within and across cost functions are likely to have different availability and link types.This is illustrated by Figure 4 where downtime per year for each link versus the link length and the link improvement method/type k selected for each link is shown for the cases corresponding to Figure 3.In the figures, each circle represents a link and the number inside the circle is the improvement method/type k.The red circles represent links comprising the spine and the blue are the off spine links.For example, the spines obtained with a wp4 = 0.9964 for cost functions f c1 and f c2 are identical in the layout, as shown in Figures 3d  and 3h, but the corresponding link improvement method k of the links are different as shown in Figures 4d and 4h.However, the spine and the selected methods are identical for f c2 and f c3 , whereas, they were completely different for a wp1 as shown in Figures 3e and 3i.One can also see that, for the same cost function, different link improvement methods k can be selected as the WP availability target a wp increases.For example, the first three spines obtained for the Polska network for cost function f c1 , shown in Figures 3a to 3c, have different link improvement assignments as the WP availability target a wp changes.This is shown in the downtime and availability assignment Figures 4a to 4c.Initially, shorter links (i.e., with higher availability and lower improvement cost) are favored as a spine link, thus exploiting existing heterogeneity.As the availability target increases, expensive links are selected to meet the more stringent requirement.For example, consider how the spine layout changes from the initial one of Figure 3a with a wp1 , to that of Figure 3d in order to achieve a wp4 .Note that link (2,3) (that is, the third longest link) is selected to be on the spine despite its high cost.Also Table I shows that the spine for a wp4 has better structural measures (e.g., smaller di S ) than the spine for a wp1 .The results for Spain and Italia14 networks exhibit similar observations.Notice that the off spine links are selected as type k = 2 which are less reliable, but provide additional budget resources to improve the spine links from k = 1 to better quality links (i.e., k > 2).In addition to the structural properties, we compare the average expected flow downtime d S = (1−A S )×8760 hrs/year and average expected WP downtime d W P = (1 − A W P S ) × 8760 hrs/year for the different scenarios and cost functions.We also include the corresponding downtime of an equivalent network with no spine and considering link-disjoint path pairs.In the no spine network case, all links are improved using the same method, k, and the total cost C is computed accordingly.Figure 5 shows the average expected WP and end-to-end flow downtimes for for the Polska network considering the different cost functions and a wp 's.Also the results are shown for three cases: no spine (dotted line) which is the baseline case; then the case where the MTTR is relaxable for the off spine links and the third that forbids relaxing the MTTR for the off spine links.From the figures we can make a number of observations.First, there is significant improvement in the downtime values over the no spine model when using cost function f c1 , slight improvement for the case of f c2 , and no improvement with f c3 .Lastly, comparing relaxable and non-relaxable MTTR cases shows a significant saving in cost when relaxing the off spine links MTTR.The results for the Spain network were similar to the Polska network.However, the Italia14 network which is more dense, achieves lower downtime in all cases across all cost functions these results are not shown due to page limits.
Recall that the spine concept aims to create different levels of availability and also meet the most stringent availability requirements.Figure 6 shows the expected downtime for each path type for the optimal spines obtained for the Polska network.The downtime results are represented for each scenario as a box plot.The upper and lower edges of each box represent the third and first quartile of the values, respectively, the middle bar (in red) represents the median, and the upper and lower bars represent the maximum and minimum downtime values across all paths, respectively.Note that, even for the spine with the lowest cost (i.e., a wp 1 and relaxable M T T R), there are three different levels of availability classes resulting from using only one protection scheme.The lower availability class can be given an unprotected path equivalent to the backup path with large expected downtime.Then, the middle class is routed on an unprotected path on the spine which achieves shorter expected downtime compared to the lower class.The higher class is routed on the spine and protected by a linkdisjoint backup path, and its expected downtime is minimal.One also can see from the graph that the target availability also controls the downtime of the higher class since the WP of this class is routed on the spine, and its downtime decreases as the target availability increases.The spacing between each level of availability is mainly determined by the range of link availabilities (initial and improved) and the WP target availability.

V. CONCLUSIONS
In this paper, we revisit the spine concept of embedding a subgraph structure with higher availability in a network together with protection mechanisms aiming to improve the overall end-to-end availability.We provided an optimization based formulation for designing the spine taking into consideration that links availability are upgradeable for a given cost.The design problem aims at exploiting existing link availability heterogeneity and the upgradeability of links to achieve a target flow availability while minimizing the total cost.Our results demonstrate the spine model efficiency in terms of average flow availability and potential advantage over the shortest path model with no spine.This efficiency, however, depends primarily on network density and link improvement cost distribution.

Fig. 1 :
Fig. 1: Improving availability in a parallel configuration

Fig. 5 :Fig. 6 :
Fig. 5: Average expected WP and end-to-end flow downtime/year versus cost for the Polska network

TABLE I :
Structural properties of the spines.