You are on page 1of 8

CS 6700: Assignment 1: Reinforcement Learning

Chapter 1
Due on Wednesday, January 21, 2015

L.Aravind Srinivas

L.Aravind Srinivas

CS 6700 ( ): Assignment 1: Reinforcement Learning Chapter 1

Contents
Problem 1

Problem 2

Problem 3

Problem 4

Problem 5

Page 2 of 8

L.Aravind Srinivas

CS 6700 ( ): Assignment 1: Reinforcement Learning Chapter 1

Problem 1
Self Play: Suppose, instead of playing against a random oppponent, the reinforcement learning algorithm
played against itself. What do you think would happen in this case? Would it learn a different way of
playing?
This is quite an ambiguous question. Let us call the RL agent as A. When A plays as its own opponent,
there comes a question to how you will value the states while moving as the second player. This is
important because of the following: In the algorithm described in the book, the updates to the value of
the states after greedy moves happen after the opponent plays. That means that, when you view the
agent as the second player, you will have to make updates after the agent moves as the first player to the
values of the states available to the agent as the second player. On the other hand, when seen as the first
player, the agent will have to make updates to the state values after the agent moves as the second player.
Clearly, this means valuing is done by the agent for all possible states (irrespective of X or O player) in
the tic-tac-toe game, which is not correct logically. To overcome this, I will assume that the valuation
done by the O playing nature of the agent, is done as if there is a separate O playing agent, (though it is
self play), but doesnt affect the X playing agents value look up table. This may seem trivial, but quite
important to remove the ambiguity that both the players are the same here.
An important thing to remember is that a draw/loss is the same (0) in our algorithm.
Now, we get into the problem. Initially, it is just random moves (the first game) and the result would be
a draw mostly (for the first game). Further, more results are seen and different states are explored and
greedy choices made.
A loss/draw for A as the X player is the same, as far as the value assigned to the terminal state is concerned.
However, with regards to the opponent A (the O player), a win/loss (correspondingly) is not the same.
A win is obviously better for him. In such a case, for the opponent A, a strategy that goes for the win
from a win/draw position is the correct one.
Any other case, the X playing agent will go for the win, if a win is possible. Basically, what happens
is, when the X playing agent loses, it is like a favorable state for the O playing agent (the replica agent
essentially). So on future moves, the O playing will try to force to get to such states. That is, if the
X playing agent randomly explores some no-win state, the O playing tries to greedily get to win state.
This makes the X playing agent slightly vulnerable when exploring, but the vulnerability is not so much,
because not everytime, you explore the same no-win state for the greediness of the O playing agent to
converge to a good value and take advantage of the wrong exploration. Similarly, when the O playing
agent explores instead of greedy choices, the X playing agent will prefer such states from where it can get
to already high valued states greedily and try to make a win out of an earlier possibly no-win situation,
due to the random exploratory mistake of the O playing agent.

Problem 1 continued on next page. . .

Page 3 of 8

L.Aravind Srinivas

CS 6700 ( ): Assignment 1: Reinforcement Learning Chapter 1Problem 1 (continued)

With the analysis for exploration set aside, the greedy movements analysis will reveal that if a greedy choice
is made to a winning state/position, intially, wins can be forced, but as time progresses, the opponent
starts moving greedily too with its own valuations, and thus even greedy moves may converge to a draw
rather than a win. So, the agent learns to force a win if a win is possible, against any random opponent.
But during the self play training, games soon become draws. This is because, if a win is not possible, it
aimlessly plays for maximising value, without knowing whether it is playing for win or draw, since the
opponent (a copy of itself) is also becoming better. This is precisely, what I tried to explain, through the
no-win situaion, where a non-winnable state for player A is good with respect to opponent A who has
learned quite well. A loss/draw is of no difference to the X playing agent and he aimlessly plays, whereas
the O playing agent tries to enforce a win. However, because of earlier losses and experiences, inherently,
the X playing agent avoids loss taking paths (though not explicitly) to maximise the value/utility, while
the O playing agent also tries to avoid mistakes and this ends up in a let me get a draw strategy, for non
winnable positions.
To conclude, yes, it does learn a different way of playing.Exploring is partly taken care of by playing against
a learning opponent. It kind of learns to play optimally against itself. But that is clearly not a global
optimal way of playing. If updates are made for exploratory moves, it is good at exploring only opponentlose situations and good at taking advantage of wrong explorations by opponent. Apart from that, it is
good at making greedy choices that maximise its accumulated value, though it doesnt really know why it
is doing that, when the positions are non-winnable (some kind of a short-sighted long sightedness).
Note: However, I think this algorithm (self play) will be really optimal if the draw is given 0, while loss
is given -1 and win is given 1 (for rewards).

Page 4 of 8

L.Aravind Srinivas

CS 6700 ( ): Assignment 1: Reinforcement Learning Chapter 1

Problem 1

Problem 2
Symmetries Many tic-tac-toe positions appear different but are really the same because of symmetries. How
might we amend the reinforce- ment learning algorithm described above to take advantage of this? In what
ways would this improve it? Now think again. Suppose the opponent did not take advantage of symmetries.
In that case, should we? Is it true, then, that symmetrically equivalent positions should necessarily have the
same value?
As far as amending the reinforcement learning algorithm goes, we can clearly see that, we can have
symmetry horizontally, vertically and along the two diagnols of the 3 x 3 tic-tac-toe box. This would mean
that, for a player who views the symmetries and takes advantage of them, there would be 8 possible states
that are going to be equivalent to one state, due to symmetry. As an example, consider we have an X in
squares (1,1), (3,1) and 0 in squares (2,1),(1,2). This can be reflected upon the second row, the second
column to get extra two states. The extra state from reflection upon second column, can be reflected
upon the second row again. Apart from that, it can also be reflected upon the diagnol passing through
(1,1) to get another state. Now this extra state obtained from the diagnol reflection, can be subjected
to reflections about second row and column respectively. Any other transformations will result in one of
these 8 states only. Therefore, we can say that, roughly, 8 equivalent symmetrical states are there for all
states in the tic-tac-toe. I am using the word roughly, because, for already symmetric (with respect to
the 3 x 3 box) state configurations, some reflections might be redundant. However, those states are only
a few in number, and in general, let us assume we can group together, somewhere around 8 equivalent
states as one state due to symmetry. So, for a player who can look at symmetries and take advantage
of them, all these 8 states must have the same value function number. Therefore, the amending in the
reinforcement learning algorithm must be as follows:
Play as before (the same update rules), but make sure that the symmetrically equivalent states for a state,
have the same value (estimated probability of winning from that state). This can be done, by hashing the
symmetrically equivalent states with the same key and updating the value for the bucket of those states,
as a whole.
In what what way would this improve? This will give improvements in speed of termination of the
reinforcement learning algorithm, because by coming across a few states itself, we are simultaneously
updating the values of several states (since we have compressed our state space, approximately, by a
factor of 8). With a compressed state space, we can expect faster termination to the winning probabilities.
Apart from speed, we can also save on the memory, since the whole learning algorithm is effectively a
Dynamic Programming framework, and lesser the precomputation for each iteration, better it is, as we
save on the memory. I dont need to hash each state separately with a separate key, but club them
together under one bucket. To put it together, we are improving both on time and memory through this
amended learning with symmetry.
Suppose the opponent did not take advantage of symmetries, the above amendment in the learning algorithm might be bad actually. A simple reason on why that is true is that, assume the opponent is
like a random player who hardly understands symmetries. So for the 8 positions that could be equivalent
states as far as the first player is concerned, the second player doesnt seem them as equivalent and
responds differently to those. Let us call the set of symmetric states at that point to be S. Now, for
some states in S, the second player could play perfectly from then on, while for other states in S, he
might play imperfectly from then. Therefore, as a first player, I should be able to take advantage of
the imperfections, wherever they exist. If I enforce the symmetry equality constraints, I am not getting
anyhwere towards taking advantage of the imperfections in the opponent. And this might even make
me a bad player against a random opponent, if you consider the extremes. Therefore, I think, taking
advantage of symmetries for the case of an opponent who doesnt, doesnt offer the right strategy.

Problem 2 continued on next page. . .

Page 5 of 8

L.Aravind Srinivas

CS 6700 ( ): Assignment 1: Reinforcement Learning Chapter 1Problem 2 (continued)

In such a case, symmetrically equivalent positions need not have the same value. What we can do is, play
two separate agents. One agent, who takes advantage of symmetries and the other agent who doesnt. For
a random opponent, let us assume he has a stochastic policy of using symmetries and not using symmetries.
We can keep playing several games, using the two agents, and estimate his stochasticity, based on which,
we can assign a weighted measure, for the values learnt from the 2 agents for each state.
To be more clear, assume agent 1 plays with symmetric states having same value constraint. Play several
games against the random opponent, and keep updating the values for states with the constraint. However,
also, keep a track of how much the random opponent has used symmetry during those games.
Similarly, allow agent 2 to play with symmetric states need not have same value strategy. Play several
games against the random opponent, and update values for each state. Keep a track of how much the
random opponent is not using symmetry.
With the approximately estimated stochasticity of the opponent, while finding the final value of each
state from the values estimated by the 2 agents, use those stochasticity as the weights for the weighted
averaging.
Note: For keeping a track of how much the random opponent has used or not used symmetry, I dont
know how to do it. I guess, even equal weight averaging would not be bad. So, probably, playing a large
number of games, and taking the normal average (with 0.5 weight to each agent) could do decently by
itself. We are giving provision for the effect of the opponents possibility to use symmetry and be random
as well. This could be a decently optimal strategy.

Problem 3
Greedy Play: Suppose the reinforcement learning player was greedy, that is, it always played the move that
brought it to the position that it rated the best. Would it learn to play better, or worse, than a nongreedy
player? What problems might occur?
It will most possibly learn to play worse than a non greedy player. A possible problem in being a greedy
reinforcement learning player is that it might view a non optimal state as the best possibility at some level
and keep going with it, rather than exploring possibly better states. To elaborate, say there are many
choices of the next state, and initially all of them had equal winning chance of 0.5 initialised. Say, you chose
state A and went on to win and the state got an update of some number greater than 0.5 (0.5( + 1)).
Now, next time, you will make the original (earlier) random move (as all had 0.5 winning chances initially)
as the greedy choice now. The value of that state A will continue to go up (towards 1) but will converge
to its winning probability, as far as the game is concerned. Let us say, it is 0.7. However, there might be
an exploratory choice to a state B from where the winning probability could be 0.9. Because the agent
is greedy, it will never come to explore the other state B and the final winning probabilities for all states
wont be correct. The greedy choice state will converge, but whether it is optimal or not, we never know
if we are going to be greedy. Only if we explore, we will come to know if there is a better winning state
with better value. We would never get to explore in cases like these. This is a typical example where the
explore-exploit dilemma reveals why explit alone is not good.
A solution for this, is to use a small probability  for exploring and update even the exploratory moves,
with the same rule as what we do for greedy moves. This will be explained in the next question.

Page 6 of 8

L.Aravind Srinivas

CS 6700 ( ): Assignment 1: Reinforcement Learning Chapter 1

Problem 3

Problem 4
Learning from Exploration: Suppose learning updates occurred after all moves, including exploratory
moves. If the step-size parameter is appropriately reduced over time, then the state values would converge
to a set of probabilities. What are the two sets of probabilities computed when we do, and when we do
not, learn from exploratory moves? Assuming that we do continue to make exploratory moves, which set of
probabilities might be better to learn? Which would result in more wins?
The set of probabilities learnt when do learn from exploratory moves, will correspond to the values /
winning probabilities from each position taking into consideration that the agent explores too, in addition
to playing greedily. That is, the values for each state converge to winning probabilities that attribute the
win due to Exploration as well as Exploitation (greediness). However, the set of probabilities learnt, when
do NOT learn from exploratory moves, will correspond to the values / winning probabilities at each state,
assuming that only greedy moves are enforced from that state onwards. In this case, exploration opens new
states or possibilities for you, but you do not attribute the effect of the exploration to the values assigned
to the state. Let me illustrate this with an example:
Assume you are in a state X, from where you want to decide how to select your moves. Among the possible
next states, you have 2 states, say A and B. State A has a higher value than state B, say 0.7 and 0.6
respectively. Let us assume state X has 0.51. Now, moving to state A would be the greedy move. Let us
assume the agent decides to explore now and go to state B. Also, let us assume that, finally, the actual
winning probability is better for state B than A, but so far not found by the agent. From state B, moving
greedily, and winning, will give an incremental update to state Bs value, say to 0.65. However, we do
not change state Xs value on going to state B. Therefore, this shows that having exploratory moves, but
not updating them, will result in winning probabilities at each state, which stand for the probability that
you will win from that state, provided you play greedily from that state onwards. This is because even
though I explored state B from X, I use only greedy moves to update X.
As a side note, even though ultimately state B is going to better than state A, the agent, from its present
values, will find it hard to realize it. Since it has exploited A a lot and got to 0.7 where as only 0.65 for
B, the random  probability for exploring from X, should allow it to go to B more and more (which itself
is hard since there are other states than A and B that could be explored too, from X).
So, the value for X is given assuming you play only greedily after you reach X. But this is not correct
because you will continue to explore with  probability and it is not that you play greedily alone.
This is why the other method of value updates is better. You are making exploratory moves, and it only
makes sense, to account for their effect in helping you realize the goal, from each state. In the above
example, state X has only helped you find state B. So it is natural you should give it more value. Infact,
that is the definition of value.
The two sets of probabilities, after convergence, might surely give different winning paths, because, say for
example, for reaching state X, the agent was in a state Z, and had a choice to move to state Y as well. If
X and Ys actual winning probabilities are not too different, probably the exploratory winning value from
X could add more value to it and help you to differentiate even better.
So, if we do continue to make exploratory moves, it makes sense to account for them in the values for
the parent state. Also, this would result in more wins because, since we attach value to exploring, we
might end up finding more beneficial paths (as explained in the example) and be better off in estimating
the actual winning probability from a state. The second set, is therefore better and will find more wins,
provided the agent is always continuing to explore.

Page 7 of 8

L.Aravind Srinivas

CS 6700 ( ): Assignment 1: Reinforcement Learning Chapter 1

Problem 4

Problem 5
Other Improvements: Can you think of other ways to improve the reinforcement learning player? Can
you think of any better way to solve the tic-tac-toe problem as posed?
The answer to this question depends on what is the goal for the reinforcement learning agent. If it is
playing against one specific opponent (assume he has some imperfections), then I can let one agent play
against that opponent, and learn how the opponent makes mistakes, and then incorporate those as some
a-priori information for another agent. This will serve as better initial values assigned to each state and
the agent suffers lesser now from the explore-exploit dilemma. Apart from this, one can also estimate the
stochasticity of the opponent, through some algorithms and try to formulate the initial values taking into
consideration the findings.
As far as playing against any generic opponent goes, the above strategy will not work. In such a case, I
would do something like this:
Play against some random simulated opponent for some number of games. Then save the agent after some
games but continue playing against the random opponent. Save the agent periodically after some predecided number of games. Now, with the multiple saved agents, we can make some sort of a round-robin
games organized, where each gets to play the other, with the least experienced agents playing amongst
each other first, and subsequently playing the most experienced agents. In this way, we can have multiple
well learned agents, who are close to an optimal way of playing, against any opponent. Also, this could
be done with various simulations of the initial random opponent, to further improve the playing.
As far as the learning algorithm goes, I think carrying forward the value updates up to the initial state,
with decreasing importance, could be done, (including exploratory moves). That is, we give credit for the
sequence of moves, but with decreasing level as we move up the hierarchy. This could lead to a faster
learning as the greediness and exploration will together result in good estimates and since they are carried
up, the convergence might become faster in the initial levels itself. However, the above, can also go wrong.
So we will have to play around a lot with the hyper parameters.
We can also try to have models for and  to simulate their variations over time, differently for different
opponents. The symmetry idea (second problem) can also be tried.

Problem 5 continued on next page. . .

Page 8 of 8

You might also like