You are on page 1of 13

Ant Colony Optimization and Stochastic Gradient Descent

Nicolas Meuleau and Marco Dorigo IRIDIA, Universit Libre de Bruxelles e Avenue Franklin Roosevelt 50, CP 194/6 B-1050 Brussels, Belgium nmeuleau@iridia.ulb.ac.be, mdorigo@ulb.ac.be Technical Report N. TR/IRIDIA/2000-36 December 2000
Abstract In this paper, we study the relationship between the two techniques known as ant colony optimization (ACO) and stochastic gradient descent (SGD). More precisely, we show that empirically designed ACO algorithms approximate stochastic gradient descent, and we propose provably convergent implementations of stochastic gradient descent that belong to the family of ACO algorithms. We then use this insight to explore the mutual contribution of the two techniques.

1 Introduction
The study of self-organization in social insects as a source of inspiration for novel distributed forms of computation is a promising area of AI known as ant algorithms (or sometimes as swarm intelligence) that is knowing growing popularity [2, 3, 4, 6]. A particularly successful form of ant algorithms are those inspired by the ant colonies foraging behavior. In these algorithms, applied to combinatorial optimization problems, a number of articial ants are given a small set of simple reactive behaviors that model the behavior of real ants. Articial ants are then left free to move on an appropriate graph representation of the considered problem: they probabilistically build a solution to the problem and then deposit on the graph some articial pheromones that will bias the probabilistic solution construction activity of the future ants. The amount of pheromone deposited and the way it is used to build solutions are such that the overall search process is biased towards the generation of approximate solutions of improving quality. The historically rst example of an algorithm inspired by the ants foraging behavior was ant system (AS) [8] and its rst application was to the Traveling Salesman Problem (TSP), a well known -hard problem [10]. As a follow up of AS, a number of similar algorithms, each one trying either to improve performance or to make AS better t a 1

particular class of problems, were developed. Currently, many successful applications of such algorithms exist to both academic and real-world combinatorial optimization problems; often the AS-based algorithms provide state-of-the-art performance. As a consequence of this success, the ant colony optimization (ACO) metaheuristic was recently dened [7] to put in a common framework all the algorithms that can be considered as offspring of AS. In the following, we rst briey review ant system using the TSP as example problem and we show that AS is indeed closely related to the technique known as stochastic gradient descent (SGD), which has been extensively used in machine learning [11]. We then propose an ACO algorithm that converges with probability 1 to a local optimum. In Sec.3 we generalize the reasoning so that it can be applied to any ACO algorithm for any combinatorial optimization problem. Finally, we discuss how the relationship between ACO and SGD evidentiated in this paper can favor cross-fertilization.

2 Ant System as an Instance of Stochastic Gradient Descent


Ant system is a simple distributed algorithm that can be applied to any (constrained) minimum cost path problem on a graph. In this paper we use the asymmetric TSP (ATSP) as an example to present our arguments.

2.1 Ant System


The ATSP can be dened as follows. Let be a set of cities, , and be a distance matrix, with and . We will de note by the set of acyclic paths of length (in terms of the num ber of cities crossed). ATSP is the problem 1 of nding a path that minimizes the length of the corresponding tour, dened as

The main variables of the AS algorithm are the pheromone trails associated with each pair of cities . Let be the bidimensional vector gathering all the s. The basic principle of AS is to simulate articial ants that use the pheromone trails to build a random tour. Once its tour completed, each ant makes a backward trip following the same path and updates the pheromones on its way back. Finally, the pheromone trails partially evaporate, that is, they decrease by a factor called the evaporation rate. The behavior of each ant can be resumed as follows: Forward:

1 If

Draw the start city at random uniformly;

then the ATSP reduces to the TSP .

At each step , after following the path , draw the next city at random following

(1)

where means that the acyclic path traverses . Backward:

After generating the path trails for each .

, reinforce the pheromone

and

by the amount

Therefore, the pheromone update rule in an asynchronous implementation of AS is, for each ant and each ,

where if is the immediate successor of in the tour associated to , and 0 otherwise. This simple framework gave raise to a great number of algorithms that have recently been put in a unifying framework called ACO metaheuristic [7]. ACO is composed of three main procedures. In the rst one, articial ants probabilistically construct feasible solutions to the considered problem by moving on its graph representation. In this phase the construction process can be biased by previous experience memorized in the form of pheromone trails and by heuristic information available about the considered problem (see discussion in Sec.4.1). The second phase, briey discussed in Sec.4.2, is optional: here the solutions generated by the articial ants can be taken to their local optima by a suitable local search routine. In the last phase, pheromone trails are updated, either by the ants retracing backward their path or by other suitable processes. In the end of this section, we modify the pheromone update rule of AS so that the new algorithm performs stochastic gradient descent. All the forward components of the algorithm are unchanged, we just modify the way ants update pheromones on their backward trip.

2.2 Stochastic Gradient Descent


First, we will consider maximizing the expected value of the inverse of the length of an ants forward trip, given the current pheromone trails. We will climb stochastically the gradient of the error dened as:

Note that the expectation is conditional on because the probability of a given tour happening depends on the current pheromone trail vector , while the local error does not depend on the weights . Then we have

for each pair of cities . The probability of a given path is equal to the product of the probabilities of all the elementary events that compose it: if , then

where is equal to truncated after step : Therefore,

and

, as is an acyclic path; and because the pheromone trails never fall to 0 in the original AS algorithm (however, we will see later that this is a problem for the new algorithm). Dene as

Here we have supposed that , which is always true because



then

(2)

We will see later how to calculate the s. We can already outline the basis of the SGD algorithm. Climbing the gradient of the error corresponds to updating iteratively in the direction of the gradient of :

that is,

for each individual weight rate. Using the previous results, the gradient of the error may be expressed as

, where is the step-size parameter or learning


Therefore we could do exact gradient ascent in the space of pheromone trails by enumerating all possible paths and calculating, for each of them, the probability that an ant follows this path during its forward trip (given the current pheromone trails), the length of the corresponding tour , and the variable for each . However, the size of grows exponentially with the number of cities, so that this approach quickly becomes infeasible. Moreover, exact gradient descent performs poorly in many complex domains because it gets trapped in the rst local optimum on its way. In stochastic gradient descent, an unbiased estimate of the gradient is used instead of the true gradient. In our ant algorithm, if we draw one or a few tours following , calculate their contributions to the gradient, and average the probability their contributions, then the result is a random vector whose expected value is equal to the gradient. In other words, it is an unbiased estimate of the gradient that can be used to update the weights. It can be showed that, if the learning rate is decreased at an appropriate rate, and if the gradient estimate satises some regularity and boundedness conditons, then SGD converges with probability 1 to a local optimum of the error [1]. Moreover, there is a little bit of the idea of Simulated Annealing in SGD. Sometimes we make bad moves, because gradient estimate is wrong, but they allow jumping out of bad local optima. Therefore SGD performs usually better than exact gradient descent in multimodal search spaces. The basis of our comparison between ACO and SGD is the analogy between the . During actions of sending an ant forward and sampling a solution from its forward trip, the action of an ant is precisely to sample a solution following this probability distribution. Therefore, the forward component of AS can be used in a SGD algorithm as well, we just have to change the weight update rules. We show below that the updates associated with a sampled tour are very similar in the two algorithms.

2.3 A rst ACO/SGD Algorithm


Given a path is to calculate verify that:

and a pair of cities , the problem as dened by equation (2). Using equation (1), it is easy to

if or , then is independent of and if and then

if , and then
5

We see that the weight update corresponding to a tour can be performed by the ant that sampled the tour during a backward trip. When returning to the city of the tour, the ant increases the pheromone trail by the amount


Then the pheromone trail amount

for all evaporates, that is, it is decreased by the

The ants must be a little bit more intelligent than in AS. They need remember not only the tour they followed, but also the probability of choosing each city at each step of the tour. If they do not have such a memory, they can always recompute the probabilities using the pheromone trails, but the trails must not have changed in between (due to another ant updating pheromones). Therefore, this approach will work only in a sequential implementation of the algorithm. is very close to the The previous results show that SGD of the error original AS algorithm. The main differences are that:

the reinforcement of a pheromone trail is inversely proportional to its current value;

is not reinforced; the evaporation of a pheromone trail depends on the reward, it is inversely
when we were in ;
proportional to its current value, and proportional to the probability of choosing the pheromone trails that are not used during the forward trip do not evaporate. If we already visited when we were in , then was not used during the forward trip. As a consequence, it does not evaporate after the backward trip. This makes sense, since if is not used during the generation of a tour, then this tour provides no information about the good direction to move .

The last point is very important. It will be true in any application of SGD. It implies that the weight update associated with a forward trip (i.e., a sampled solution), can always be performed in a backward trip following the same path. This is thus the basis of ant implementations of SGD presented in Sec.3. There are a few problems with the SGD algorithm we just dened. First, the evaporation rule may bring the weights at or below 0. Negative pheromone trails do not really make sense. Moreover, we supposed the pheromones (strictly) positive when calculating the gradient. When some pheromones are 0, the analytical expression of the gradient is more complex. An empirical solution to this problem consists of pre. However, there venting articially the weights from falling below a given value is another problem with this algorithm: the gradient estimate and its variance tend to innity when some tend to 0, so we cannot prove convergence with probability one of the algorithm. In the next section, we present a new implementation of SGD that can be proved to converge to a local optimum with probability 1, and that belongs to the family of ACO algorithms.

2.4 A Convergent ACO/SGD Algorithm


The main change is the rule of choice of the next city as a function of the current pheromone trails. The proportional rule of equation (1) is replaced by the Boltzmann law:

The derivation presented in Sec.2.2 is still valid, the only changes are in the calculation of the traces (Sec.2.3):

if or , then if and then

if , and then
The resulting update rules are very similar to AS rules. The reinforcement rule is


and the evaporation rule is


The main differences with AS update rules are that:

is not reinforced; evaporation does not concern weights that have not been used during the forward
trip,

the evaporation of pheromone depends on the reward and is proportional to the probability of choosing when we were in .

Ants need the same memory capacity as in the previous SGD algorithm. This new algorithm does not have the drawbacks of the previous. The weights are unconstrained and can take any real value. Moreover, the gradient estimate is uniformly bounded by the value , where is the length of the shortest path. Therefore, it may be showed that, if the learning rate is decreased in such a way that

where is the learning rate used by the ant, then the algorithm converges with probability 1 to a local optimum of the error [1]. 2

3 Generalization
First, it is easy to modify the algorithm so that it optimizes other criteria than For instance, if we want to minimize the expected tour length , then the reinforcement rule of the convergent SGD algorithm becomes


and the evaporation rule becomes

Note that now we are really descending, and not climbing, the gradient of the error function. The choice of the objective function determines the topology of the search space. Therefore, the performance of SGD may vary with different objective functions, even if these functions have the same local and global optima. More generally, the same approach could be applied to every ACO algorithm and combinatorial optimization problem. The generic ACO/SGD approach to a given combinatorial optimization problem with solution set and objective function can be resumed as follows: First, design a stochastic controller that uses a set of weights to generate solutions in an iterative way. Then dene the error function as the expected value of a solution produced by the controller, given the current weights:

Solutions are generated by an iterative stochastic process that follows a nite number of steps. Let be the set of trajectories that the controller may follow when generating a solution, and be the function that assigns the solution produced to a given trajectory. The error may be rewritten as:

2 More precisely, the gradient tends to 0 as the number of steps tends to innity, but the weights may not converge.

The gradient of the error is then the expectation over all possible trajectories of the value of the solution produced times an eligibility trace:

for each individual weight . The trace is the sum of the partial derivatives of the log of every elementary event that composes the trajectory . Stochastic gradient descent can thus be performed by sampling a few trajectorieswhich can be seen as sending a few ants forward in the set of weightsand calculating their contribution to the gradient. The corresponding weight updates can be performed by the ants during a backward trip in the set of weights, because weights that are not used during a forward trip do not have to be updated. If the controller uses the Boltzmann law to make random choices during solution generation, then the gradient estimate is uniformly bounded and convergence to a local optimum with probability 1 can be proven.

4 Mutual Contributions
The interest of establishing connections between ACO and SGD is multiple. On one side, many questions posed in the study of ACO algorithms receive a new light under the SGD interpretation. For instance, the problem of determining the number of ants to send before updating the pheromones appears to be closely related to the problem of choosing the momentum in articial neural networks [11]. Also, a new way of proving convergence is proposed. On the other side, ACO algorithms show that SGD can be implemented in a completely distributed and parallel way, using only stigmergic information 3. Finally, several improvements to the basic ACO scheme presented in Sec.2.1 have been proposed and may be transposed to SGD algorithms. The most efcient of these are the use of heuristic information and the combination of ACO with a discrete local search algorithm. In the end of this section, we examine how these two modications may be integrated and understood in the framework of SGD.

4.1 Using Heuristic Information


The most efcient implementations of the ACO metaheuristic combine information from the pheromone trails and heuristic information when generating solutions. For instance, in the case of an application of AS to ATSP, the city selection rule (1) is modied so that, if , then

where is the heuristic information associated with , and and are two (positive) parameters that determine the relative inuence of pheromone trails and
3 Stigmergy is a particular form of indirect communication used by social insects to coordinate their activities. Its role in the denition of ant algorithms is discussed at length in [7, 6, 2, 3].

heuristic information. In general, the function reects heuristic information about the good way to continue a partial tour. For instance, in the case of AS and ATSP, a for all . In this case, the closest common choice is unvisited cities have larger probability to be chosen than without heuristic information. Moreover, in the successful applications of ACO to non-stationary (time-varying) problems such as ANTNET [5], the function is used to provide information about the current state of the problem to the algorithm. There are several ways of integrating a similar mechanism in our SGD algorithm. First, we may recalculate the gradient to take into account the new selection rule. It leads to weight update rules where the variation of each is proportional to . What we change in this case is the way the weights encode a probability distribution on solutions, that is, the structure of the controller. Therefore, the objective functionseen as a function from weight vectors to real numberalso changes. Hence, the performance of the algorithm may vary. Another solution is to keep the basic update rules unchanged and to consider the use of the heuristic information as a bias in our search. We are now using a biased estimate of the gradient instead of the true gradient. The Monte Carlo technique of importance sampling [12] can then be used to suppress this bias. Stated simply, using the notation of each forward trajectory by of Sec.3, we multiply the contribution the coefcient

where is the probability of generating without using the heuristic information, and is the probability of sampling when using the heuristic information. Because

the estimate is still unbiased. In the framework of ATSP, the overall effect of importance sampling is the following: when an unvisited city is close to the current city , it has a bigger chance to be chosen than without heuristic information. In counterpart, if wins the random drawing, then the reinforcement of the pheromone trail is smaller than the reinforcement that would have received if a more distant city had won the drawing (provided that the total length of the tour does not change). This approach bears some similarities with the previous: in both cases, we sample and we get an unbiased estimate of the gradient. trajectories following However, the objective function differs from an algorithm to the other: importance sampling optimizes the same function as the algorithm without heuristic information. Therefore, the two alternatives may bear different performances. Moreover, the two estimates may have different variances, and thus different accuracy. In particular, the importance sampling estimate often becomes unstable when the sampling distribution ) differs a lot from the target distribution ( ), that is, (in our case for large values of . Finally, a third possible approach is to accept the bias induced by the use of heuristic information and not to try to correct it. In other words, we keep the update rules absolutely unchanged, despite the fact that the solution generation rule has changed. This may be interesting for two reasons:

10

First, recent (unpublished) results of the theory of Monte-Carlo estimation show that there is a dilemma between bias and variance, and that the biased estimate may produce a smaller estimation error than the unbiased importance sampling estimate, when the number of samples is small. More importantly, we may hope, in the case of SGD, that the bias will be efcient, that is, that it will help the algorithm to get out of bad local optima that attract unbiased gradient descent.

Note that this approach is the closest to the actual implementations of ACO with heuristic information. Further research is needed to determine which of these three solutions performs best. Empirical results with ACO algorithms suggest that the third can be an efcient way to improve SGD algorithms.

4.2 Using Discrete Local Search


The (ACO) metaheuristic is often used in conjunction with local search [13, 9]. In this approach, an ACO algorithm generates starting points for a discrete local search algorithm. Each ant produces a solution, say , which is then transformed into another solution, say , by the local search. Then the pheromones are updated. As our goal is to maximize the quality of the nal solution , pheromones updates must be proportional to the quality of , not . Given this, there are still two ways of updating the pheromones:

either we reinforce the pheromones corresponding to the nal solution . In other words, we do as if the solution was generated directly by the ant algorithm, without the help of the local search (in this approach, we suppose that there is a mapping between the solution set and the set of possible forward trajectories). or we reinforce the pheromones corresponding to the intermediate solution .

By analogy with similar procedures in the area of genetic algorithms [14], we call the rst alternative the Lamarckian approach, and the second the Darwinian approach. There are several arguments supporting the Lamarckian approach. For instance, one could think that, if we can teach directly the better solution to the ant algorithm, it would be stupid to teach it only the worse solution . In practice, only this alternative has been used. In the case of SGD, however, the Darwinian approach may make more sense. It is easy to show that, if we try to maximize the expected value of the solution produced by the local search algorithm, then the update rule of an SGD algorithm is to reinforce the pheromones corresponding to the intermediate solution proportionally to the value of the nal solution . The formal framework developed in Sec.3 can be used for this calculation, the effect of the local search being modeled in the function . Having understood this, we can derive qualitative arguments in favor of the Darwinian approach. For instance, if the good starting points of the local search are very far from the corresponding local optima in the topology of the gradient algorithm, then the Darwinian approach could outperform the Lamarckian. 11

Notwithstanding these theoretical results, practical experiments with ACO and SGD algorithms show that the Lamarckian approach very often performs better than the Darwinian. The observed efciency of the Lamarckian approach can be explained by the idea of an efcient bias: the gradient estimate of the Lamarckian algorithms is biased, but it is probable that this bias has a positive effect on the performances because it helps the algorithms jumping out of low-value local optima. Further research is needed to validate this hypothesis.

5 Conclusion
We have shown that AS is very similar to SGD. We have proposed a provably-convergent implementation of SGD that enters into the framework of ACO algorithms. Then we have outlined a general ACO/SGD algorithm for combinatorial optimization. The performances of this algorithm depend crucially on some basic choices such as the structure of the controller and the objective function. More research is needed to understand what are the good choices for a given problem. This work allows a better understanding of the mechanisms at work in ACO algorithms, and of some important issues in the theory of these algorithms. Moreover, we have used this analogy to propose improvements to the basic scheme of SGD based on biasing the search using heuristic information and discrete local search algorithms. An empirical study of these propositions is in progress and the results will be the subject of a later publication.

Acknowledgments
This work was supported by a Marie Curie Fellowship awarded to Nicolas Meuleau (Contract No. HPMFCT-2000-00230). Marco Dorigo acknowledges support from the Belgian FNRS, of which he is a Senior Research Associate. This work was also partially supported by the Metaheuristics Network, a Research Training Network funded by the Improving Human Potential programme of the CEC, grant HPRN-CT-199900106. The information provided is the sole responsibility of the authors and does not reect the Communitys opinion. The Community is not responsible for any use that might be made of data appearing in this publication.

References
[1] D.P. Bertsekas. Nonlinear Programming. Athena Scientic, Belmont, MA, 1995. [2] E. Bonabeau, M. Dorigo, and G. Theraulaz. Swarm Intelligence: From Natural to Articial Systems. Oxford University Press, New York, NJ, 1999. [3] E. Bonabeau, M. Dorigo, and G. Theraulaz. Inspiration for optimization from social insect behavior. Nature, 406:3942, 2000. [4] E. Bonabeau and G. Theraulaz. Swarm smarts. Scientic American, 282(3):54 61, 2000. 12

[5] G. Di Caro and M. Dorigo. AntNet: Distributed stigmergetic control for communications networks. Journal of Articial Intelligence Research, 9:317365, 1998. [6] M. Dorigo, E. Bonabeau, and G. Theraulaz. Ant algorithms and stigmergy. Future Generation Computer Systems, 16(8):851871, 2000. [7] M. Dorigo, G. Di Caro, and L. M. Gambardella. Ant algorithms for discrete optimization. Articial Life, 5(2):137172, 1999. [8] M. Dorigo, V. Maniezzo, and A. Colorni. The Ant System: Optimization by a colony of cooperating agents. IEEE Transactions on Systems, Man, and Cybernetics Part B, 26(1):2941, 1996. [9] L. M. Gambardella and M. Dorigo. Ant Colony System hybridized with a new local search for the sequential ordering problem. INFORMS Journal on Computing, 12(3):237255, 2000. [10] E. L. Lawler, J. K. Lenstra, A. H. G. Rinnooy Kan, and D. B. Shmoys. The Travelling Salesman Problem. John Wiley & Sons, Chichester, UK, 1985. [11] T. Mitchell. Machine Learning. McGraw-Hill, Boston, MA, 1997. [12] R.Y. Rubinstein. Simulation and the Monte Carlo Method. Wiley, New York, NY, 1981. Ant System and local search [13] T. St tzle and H. H. Hoos. The u for the traveling salesman problem. In Proceedings of the 1997 IEEE International Conference on Evolutionary Computation (ICEC97), pages 309314. IEEE Press, Piscataway, NJ, 1997. [14] D. Whitley, S. Gordon, and K. Mathias. Lamarckian evolution, the baldwin effect and function optimization. In Proceedings of PPSN-III, Third International Conference on Parallel Problem Solving from Nature, pages 615. Springer Verlag, Berlin, Germany, 1994.

13

You might also like