You are on page 1of 4

Natural Language Processing Part 1: Words

Finite state morphology


Maria Holmqvist, marho@ida.liu.se

Part 1 Designing a FST for lexical parsing of Swedish nouns


The task was to design a finite state transducer for translating Swedish
noun word forms into a lexical description which contains the lemma, partof-speech category, gender, definiteness, number and case. For each of
the five declensions of Swedish an FST was written.
1
2
3
4
5

flicka, ros
pojke, spik
bild, sko
byte
hus

To combine these 5 transducers into one, the simple procedure used here,
was to include all states and transitions in a large transducer for all noun
declensions by giving every FST the same start and end state. This larger
FST for all nouns contains 36 states.

Part 2 Implementing the Finite state transducer


The lexical parser is made up of two components, (i) the transducer
description and (ii) a program for processing an input and traversing
through the states of the transducer.
FST description
The description of the finite state transducer is a simplified version of the
table format in Jurafsky and Martin (2000). Only the significant cells from
the table are included in this description, i.e, only the valid transitions
between states. Table 1 contains the description of the FST for inflections
of the noun ros. The first column contains all states and the rest of the row
contains alternative paths from each state. The path from a state has the
format x:y:n, where x is the accepted input, y the output and n the new
state. Starting at state 0, the only valid input is ros. All other inputs will
result in a failed analysis. When encountering input ros, ros will be
output and we move to state 1. In state 1, the valid input is or, and so on.
Compare this description with the FST figure above.
Three special symbols are used in the description. The symbol E is used
as the epsilon symbol. If E stands in input position, it means that we
move down this path without looking at the input. If E is in output
position nothing will be output. The #-symbol is used to denote end-of-

string. The states in the first column are marked with a :-symbol if they
are accepting states.

State
Legal transitions
0
ros:ros:1
1
E: N UTR SG:2
2
en: DEF:4
3
na: DEF:4
4
#: NOM:6
5
#: NOM:6
6:
Table 1. The FST for ros

or: N UTR PL:3


E: INDEF:5
E: INDEF:4
s#: GEN:6
#: GEN:6

The description of finite state transducers for the five noun declensions in
Swedish can be found here:
ros.fst, flicka.fst, spik.fst, pojke.fst, bild.fst, sko.fst, byte.fst, hus.fst,
and the combined FST here:
noun.fst
Implementation
The program for transforming a noun from morphological to lexical level
was implemented in Perl and can be found here. When the user specifies a
word, the program will output the lexical description(s) of the word or else
produce Failed.
The FST-description is supplied to the program as a command line
argument.
> perl fst.pl noun.fst
Write a word and press Enter. (q = quit):
spik
spik N UTR SG INDEF NOM
Ambiguous input
The program will produce all possible morphological analyses of
ambiguous word forms like ros and hus:
ros

ros N UTR SG INDEF NOM


ros N UTR SG INDEF GEN

hus

hus
hus
hus
hus

N
N
N
N

NEU
NEU
NEU
NEU

SG INDEF NOM
SG INDEF GEN
PL INDEF NOM
PL INDEF GEN

This is done by keeping a stack of all transitions in progress and then


processing each transition on the stack one step forward in the FST and
then putting this new result back on the stack. We can exemplify this by
analysing the word form ros and using the FST for ros in table 1.
After seeing input ros we will be in state 1. Since there has been no
ambiguity so far we only have one transition in progress on our stack.

This transition contains three pieces of information: the remaining input


(#), output so far (ros) and the current state (1):
Stack:

#, ros, 1

After two E-transitions our stack still contains only one alternative:
Stack:

#, ros N UTR SG INDEF, 5

We pop this transition from the stack and for each alternative path given
the remaining input # we create a new transition and put it on the stack.
Stack:

. ros N UTR SG INDEF NOM, 6 , ros N UTR SG INDEF GEN, 6

In the next round we pop the first transition and find that the newly
produced states are accepting states and that theres no input left to
process. The analysis was a success and the output strings are printed.

References
Daniel Jurafsky and James H. Martin (2000). Speech and language
processing.

You might also like