You are on page 1of 9

http://lambda.uta.edu/cse5317/notes/node9.

html

Deterministic finite-state machine


From Wikipedia, the free encyclopedia
Jump to: navigation, search

In the theory of computation, a deterministic finite state machine (also known as


deterministic finite state automaton (DFSA) or deterministic finite automaton (DFA)
) is a finite state machine where for each pair of state and input symbol there is one and
only one transition to a next state. DFAs recognize the set of regular languages and no
other languages.

A DFA will take in a string of input symbols. For each input symbol it will then transition
to a state given by following a transition function. When the last input symbol has been
received it will either accept or reject the string depending on whether the DFA is in an
accepting state or a non-accepting state.

Advantages and disadvantages


DFAs are one of the most practical models of computation, since there is a trivial linear
time, constant-space, online algorithm to simulate a DFA on a stream of input. Given two
DFAs there are efficient algorithms to find a DFA recognizing:

• the union of the two DFAs


• the intersection of the two DFAs
• complements of the languages the DFAs recognize

Because DFAs can be reduced to a canonical form (minimal DFAs), there are also
efficient algorithms to determine:

• whether a DFA accepts any strings


• whether a DFA accepts all strings
• whether two DFAs recognize the same language
• the DFA with a minimum number of states for a particular regular language

DFAs are equivalent in computing power to nondeterministic finite automata.

On the other hand, Finite State Automata are of strictly limited power in the languages
they can recognize — many simple languages, including any problem that requires more
than constant space to solve, cannot be recognized by a DFA. The classical example of a
simply described language that no DFA can recognize is bracket language, that is,
language that consists of properly paired brackets, such as (()()). More formally the
language consisting of strings of the form anbn — some finite number of a's, followed by
an equal number of b's. If there is no limit to recursion (i.e., you can always embed
another pair of brackets inside) it would require an infinite amount of states to recognize
http://en.wikipedia.org/wiki/Deterministic_finite-
state_machine#References

Converting a Regular Expression into a Deterministic


Finite Automaton
The task of a scanner generator, such as flex, is to generate the transition tables or to
synthesize the scanner program given a scanner specification (in the form of a set of
REs). So it needs to convert a RE into a DFA. This is accomplished in two steps: first it
converts a RE into a non-deterministic finite automaton (NFA) and then it converts the
NFA into a DFA.

A NFA is similar to a DFA but it also permits multiple transitions over the same character
and transitions over . The first type indicates that, when reading the common character
associated with these transitions, we have more than one choice; the NFA succeeds if at
least one of these choices succeeds. The transition doesn't consume any input
characters, so you may jump to another state for free.

Clearly DFAs are a subset of NFAs. But it turns out that DFAs and NFAs have the same
expressive power. The problem is that when converting a NFA to a DFA we may get an
exponential blowup in the number of states.

We will first learn how to convert a RE into a NFA. This is the easy part. There are only 5
rules, one for each type of RE:

As it can been shown inductively, the above rules construct NFAs with only one final
state. For example, the third rule indicates that, to construct the NFA for the RE AB, we
construct the NFAs for A and B, which are represented as two boxes with one start state
and one final state for each box. Then the NFA for AB is constructed by connecting the
final state of A to the start state of B using an empty transition.

For example, the RE (a| b)c is mapped to the following NFA:

The next step is to convert a NFA to a DFA (called subset construction). Suppose that you
assign a number to each NFA state. The DFA states generated by subset construction have
sets of numbers, instead of just one number. For example, a DFA state may have been
assigned the set {5, 6, 8}. This indicates that arriving to the state labeled {5, 6, 8} in the
DFA is the same as arriving to the state 5, the state 6, or the state 8 in the NFA when
parsing the same input. (Recall that a particular input sequence when parsed by a DFA,
leads to a unique state, while when parsed by a NFA it may lead to multiple states.)

First we need to handle transitions that lead to other states for free (without consuming
any input). These are the transitions. We define the closure of a NFA node as the set of
all the nodes reachable by this node using zero, one, or more transitions. For example,
The closure of node 1 in the left figure below

is the set {1, 2}. The start state of the constructed DFA is labeled by the closure of the
NFA start state. For every DFA state labeled by some set {s1,..., sn} and for every
character c in the language alphabet, you find all the states reachable by s1, s2, ..., or sn
using c arrows and you union together the closures of these nodes. If this set is not the
label of any other node in the DFA constructed so far, you create a new DFA node with
this label. For example, node {1, 2} in the DFA above has an arrow to a {3, 4, 5} for the
character a since the NFA node 3 can be reached by 1 on a and nodes 4 and 5 can be
reached by 2. The b arrow for node {1, 2} goes to the error node which is associated with
an empty set of NFA nodes.

The following NFA recognizes (a| b)*(abb | a+b), even though it wasn't constructed with
the above RE-to-NFA rules. It has the following DFA:
2.2 Deterministic Finite Automata (DFAs)
A DFA represents a finite state machine that recognizes a RE. For example, the following
DFA:

recognizes (abc+)+. A finite automaton consists of a finite set of states, a set of transitions
(moves), one start state, and a set of final states (accepting states). In addition, a DFA has
a unique transition for every state-character combination. For example, the previous
figure has 4 states, state 1 is the start state, and state 4 is the only final state.

A DFA accepts a string if starting from the start state and moving from state to state, each
time following the arrow that corresponds the current input character, it reaches a final
state when the entire input string is consumed. Otherwise, it rejects the string.

The previous figure represents a DFA even though it is not complete (ie, not all state-
character transitions have been drawn). The complete DFA is:

but it is very common to ignore state 0 (called the error state) since it is implied. (The
arrows with two or more characters indicate transitions in case of any of these
characters.) The error state serves as a black hole, which doesn't let you escape.
A DFA is represented by a transition table T, which gives the next state T[s, c] for a state
s and a character c. For example, the T for the DFA above is:

a b c
0 0 0 0
1 2 0 0
2 0 3 0
3 0 0 4
4 2 0 4

Suppose that we want to build a scanner for the REs:

for - keyword = for


identifier = [a - z][a - z0 - 9]*

The corresponding DFA has 4 final states: one to accept the for-keyword and 3 to accept
an identifier:

(the error state is omitted again). Notice that for each state and for each character, there is
a single transition

Converting Deterministic Finite Automata to


Regular Expressions
Christoph Neumann
Mar 16, 2005
Abstract
This paper explores three techniques for converting deterministic _-
nite automata (DFA) to regular expressions and compares the usefulness
of each technique. The techniques examined are the transitive closure
method, the state removal method, and the Brzozowski algebraic method.
1 Background
Kleene's seminal article de_nes regular expressions and their relationship
to _nite automata [7]. Kleene proves the equivalence of _nite automata
and regular expressions thereby providing us with the _rst technique, the
transitive closure method, for converting DFAs to regular expressions.
Later Brzozowski expanded on Kleene's method by introducing the notion
of derivatives of regular expressions [3], but his paper passed into obscurity
until G. Berry and R. Sethi brought Brzozowski's paper to the forefront
in [2]1. The state removal method appears in [4] but Linz presents a more
straightforward method in [8].
2 De_nitions
We will using the Moore model [9] for _nite automata. Given an automaton
M with an input alphabet Ak = f0; 1; _ _ _ ; k 1g, M has m
states Ms = fq1; q2; _ _ _ ; qmg where q_ is the starting state of M and
Mf = fqf1 ; qf2 ; _ _ _ ; qfng are the n _nal states of M where n _ m. For
clarity, we will use letters to represent each value in the alphabet A instead
of using the numeric representation (a = 0; b = 1; etc), and for convenience
we will assume q1 is the starting state unless otherwise noted.
We will use Kleene's de_nition [7] of regular expressions. Regular
expressions are de_ned recursively as:
1. The symbols 0; 1; _ _ _ ; k 1; _; and _ are regular expressions, where
_ is the empty string and _ is the empty set.
1However, e_ciently converting regular expressions to automata is the focus of the G.
Berry and R. Sethi paper, not converting DFAs to regular expressions
1
2. Set Union: Given the regular expressions x and y, the union of x
and y, expressed as x + y, is a regular expression.
3. Concatenation: Given the regular expressions x and y, the concatenation
(or product) of x and y, expressed as xy is a regular
expression.
4. Iteration: Given the regular expression x, the iteration (or star) of
x, expressed as x_, is a regular expression.
5. Given a regular expression x, (x) is a regular expression.
6. All regular expressions can be constructed by a _nite application of
rules 1-5.
The regular expressions in rule 1 are terminals. The concatenation of
terminals is a string. For a given regular expression x, there is a set X
which contains all the strings represented by x. We will use x and X interchangeably.
A single string is simply represented by the set containing
that single string.
We will also refer to and use the following identities:
(ab)c = a(bc) = abc
_x = x_ = x
_x = x_ = _
_+x=x
_ + x_ = x_
(_ + x)_ = x_
Since + is commutative, all the commutative versions apply.
3 Transitive Closure Method
Suppose the given DFA M is to be represented as a regular expression.
q1 q2 q3 q4
bca
Figure 1: A very simple automaton.
Consider the automaton in _gure 1. The input for edge in the automaton
is a regular expression. Quite simply, the regular expression for
the transition from q1 to q2 is b, the transition from q2 to q3 is c and so
on. Furthermore, the regular expression representing the transition from
q1 to q3 is the concatenation of the regular expressions thus forming bc.
Thusly, we can _nd the regular expression for the automaton to be bca
since that expression is the concatenation of all of the transitions from
the starting state q1 to the _nal state q4.
More generally, for a path from q_ to qf , the concatenation of the regular
expression for each transition in the path forms a regular expression
that represents the same string as the path from q_ to qf in the automaton.
Supposing there exists only one unique path in automaton M from
2
q_ to qf , there exists only one regular expression R such that R represents
the same string as the DFA M. However, this is a trivial automaton, let
us examine how to expand this to a more general case.
q1 q2
b
ab
Figure 2: A more complicated automaton.
Now consider the DFA in _gure 2. It is clear that multiple paths exist
from q1 to q2. We cannot derive a simple regular expression to represent
the DFA, however using the other operators (union and iteration) we can
build on our previous approach to create a construction that works for
all types of DFA. We will forgo the proof and leave it to Kleene [7]. The
proof is also demonstrated by Hopcroft and Ullman [5] and explained
quite clearly by Salomaa [10].
Suppose regular expression Rij represents the set of all strings that
transition the automaton M from qi to qj . Furthermore, suppose Rk
ij
represents the set of all strings that transition the automaton M from qi to
qj without passing through any state higher than qk. We can construct Rij
by successively constructing R1
ij ;R2
ij ; _ _ _ ;Rm
ij . Rk
ij is recursively de_ned
as:
Rk
ij = Rk 1
ik (Rk 1
kk )_Rk 1
kj + Rk 1
ij
assuming we have initialized R0
ij to be:
R0
ij =8<
:
r if i 6= j and r transitions M from qi to qj
r + _ if i = j and r transitions M from qi to qj
_ otherwise
As we can see, this successive construction builds up regular expressions
until we have Rij . We can then construct a regular expression
representing M as the union of all R_f where q_ is the starting state and
f 2 Mf (the _nal states for M).
This technique is similar in nature to the all-pairs shortest path problem.
The only di_erence being that we are taking the union and concatenation
of regular expressions instead of summing up distances. This
solution is of the same form as transitive closure and belongs to the constellation
of problems associated with closed semirings.
The chief problem of the transitive closure approach is that it creates
very large regular expressions. Examining the formula for an Rk
ij , it is
clear the signi_cant length is due to the repeated union of concatenated
terms. Even by using the previous identities, we still have long expressions.
4 State Removal Method
The state removal approach identi_es patterns within the graph and removes
states, building up regular expressions along each transition. The
3
advantage of this technique over the transitive closure method is that it
is easier to visualize. This technique is described by Du and Ko [4], but
we examine a much simpler approach is given by Linz [8].
First, any multi-edges are uni_ed into a single edge that contains the
union of inputs. Suppose from q2 to q5 there is an edge a and an edge
b, those would be uni_ed into one edge from q2 to q5 that has the value
a + b.
Now, consider a subgraph of the automaton M which is of the form
given in _gure 3. State q may be removed and the automaton may be
reduced to the form in _gure 4. The pattern may still be applied if edges
are missing. For an edge that is missing, leave out the corresponding edge
in _gure 4.
This process repeats until the automaton is of the form in _gure 5.
Then by direct calculation, the regular expression is:
r = r1 _ r2(r4 + r3r1 _ r2) _ :
qi q qj
ab
dc
e
Figure 3: Desired pattern for state removal.
qi qj
ae_b
ce_d
ae_d ce_d
Figure 4: Results after state removal.
q1 qf
r2
r3
r1 r4
Figure 5: Final form.
4
5 Brzozowski Algebraic Method
Brzozowski method [3]2 takes a unique approach to generating regular
expressions. We create a system of regular expressions with one regular
expression unknown for each state in M, and then we solve the system
for R_ where R_ is the regular expression associated with starting state
q_. These equations are the characteristic equations of M.
Constructing the characteristic equations is straightforward. For each
state qi in M, the equation for Ri is a union of terms. Each term can be
constructed like so: for a transition a from qi to qj , the term is aRj . If
Ri is a _nal state, _ is also one of the terms. This leads to a system of
equations in the form:
R1 = a1R1 + a2R2 + _ _ _
R2 = a1R1 + a2R2 + _ _ _
R3 = a1R1 + a2R2 + _ _ _ + _
...
=
...
Rm = a1R1 + a2R2 + _ _ _ + _
where ax = _ if there is no transition from Ri to Rj .
The system can be solved via straightforward substitution, except
when an unknown appears on both the right and left hand side of the
equation. This situation occurs when there is a self loop for state qi. Arden's
theorem [1] is the key to solving these situations. The theorem is as
follows:
Given an equation of the form X = AX + B where _ =2 A, the
equation has the solution X = A_B.
We use this equation to isolate Ri on the left hand size and successively
substitute Ri into the another equation. We repeat the process until we
have found R_ with no unknowns on the right hand side.
For example, consider again the automaton in _gure 2. The characteristic
equations are as follows (where R_ = R1):
R1 = aR1 + bR2
R2 = bR2 + _
We solve for R2 using Arden's theorem and the previously mentioned
identities:
R2 = bR2 + _
= b__
= b_
We substitute into R1 and solve:
R1 = aR1 + b(b_)
= aR1 + bb_
= a_(bb_)
= a_bb_
Thus, the regular expression for the automaton in _gure 2 is a_bb_.
2Kain [6] explains this method in more detail and gives illustrative examples.
5
6 Conclusions
The state removal approach seems useful for determining regular expressions
via manual inspection, but is not as straightforward to implement as
the transitive closure approach and the algebraic approach. The transitive
closure approach gives a clear and simple implementation, but tends to
create very long regular expressions. The algebraic approach is elegant,
leans toward a recursive approach, and generates reasonably compact regular
expressions. Brzozowski's method is particularly suited for recursion
oriented languages, such as functional languages, where the transitive closure

approach would be cumbersome to implement.

You might also like