Urdu Morphology, Orthography and Lexicon Extraction

Urdu Morphology, Orthography and Lexicon Extraction
Muhammad Humayoun Harald Hammarström, Aarne Ranta

Department of Mathematics Department of Computer Science
University of Savoie Chalmers University of Technology
France Sweden
mhuma@univ-savoie.fr {harald2,aarne}@cs.chalmers.se
amount of vocabulary. Urdu-Hindi together is the

Abstract second most widely spoken language in the world
with 1,017,290,000 speakers (Native + second
Urdu is a challenging language because language) after Chinese (Rahman, 2004).
of, first, its Perso-Arabic script and Today the state of the art technology to write
second, its morphological system having morphologies is to use special-purpose languages
inherent grammatical forms and based on finite-state technology. The most well-
vocabulary of Arabic, Persian and the known is XFST (Xerox Finite State Tool) which is
native languages of South Asia. This based on regular expressions. In our opinion, these
paper describes an implementation of the languages are still close to the machine code.
Urdu language as a software API, and we Therefore, we emphasis on using a higher level
deal with orthography, morphology and language to capture the linguistic abstraction.
the extraction of the lexicon. The Then, that higher level code should be translated
morphology is implemented in a toolkit into finite state code by some tool if required.
called Functional Morphology (Forsberg
& Ranta, 2004), which is based on the 2 Goals and Contributions
idea of dealing grammars as software
libraries. Therefore this implementation Implementing grammar for Urdu also requires
could be reused in applications such as dealing with orthography. In this paper we present
intelligent search of keywords, language the process of creating the following resources:
training and infrastructure for syntax. We 1) The morphology component as an open source
also present an implementation of a small software API having:
part of Urdu syntax to demonstrate this i) A type system that covers the language
reusability. abstraction of Urdu completely.
ii) An inflection engine that covers word-and-
1 Introduction paradigm morphological rules for Urdu for
every word class.
Urdu is an Indo-Aryan language widely spoken in iii) Rules for automatic lexicon extraction
Pakistan, India, Jammu & Kashmir and using the extract tool (Forsberg,
Bangladesh. It is also spoken in all over the world Hammarström & Ranta, 2006)
due to the big South Asian Diaspora. Urdu is iv) A lexicon of 4,816 words.
closely related to Hindi and it shares morphology, 2) An orthography component containing
syntax and almost all phonology. Unicode Infrastructure for the Urdu
However the two languages differ in their script, morphology API to accommodate Perso-
some of the phonology and vocabulary. Urdu has a Arabic script of Urdu, including a GUI
strong Perso-Arabic influence in its vocabulary and application and useful tools.
is written in a cursive, context-sensitive Perso- 3) A syntax component containing the
Arabic script from right to left; whereas Hindi has implementation of a small part of Urdu syntax
a strong influence of Sanskrit and is written in the in Grammatical Framework (Ranta, 2004) by
Devanagari script from left to right. reusing the above mentioned components.
Despite the differences, both languages share
grammar (morphology + syntax) and a huge
The overall picture of the Urdu morphology is a Transliterator class of ICU4J (ICU4J 3.4, 2006)
shown in the following diagram: which is an open source API for software
internationalization.
The roman correspondences to the Urdu letters
Infrastructure
Urdu Script (Unicode enabled Urdu)
Application +
Unicode were mostly chosen to be reminiscent of
Demo
Transliteration
prototypical pronunciation values of the ASCII
ASCII / Roman Urdu characters. However, precedence is given to the
widely used keys in Urdu keyboard layouts and to
Language Dependent Part (Urdu)
the characters used in roman Urdu on Internet.
Urdu Functional
Morphology
Morphology (Types, Rules, Lexicon)

This makes the transliteration easy-to-adapt for
FM API Language Independent Module Urdu users.
Analyzer Following are some example words converted
Exporter
Dictionary format from Urdu script to equivalent Roman
Synthesizer transliteration:
Figure 1: Urdu Functional Morphology
Urdu Meaning Roman
َ
3 Urdu Orthography and Unicode 1 ِ (koʃɪʃ) Struggle k(a)wX(i)X
Infrastructure 2 ‫( گ‬bhɑg) Run b|hag
Table 1: Transliteration examples
Urdu has an alphabet of 57 letters (Afzal &
Hussain, 2001, p 2) and 15 diacritic marks Other than the Urdu Transliteration component,
(Aərɑb/Hərkɑt̪). Urdu orthography inherits some the following four utilities are developed as a part
characteristics from Arabic such as the optional of the Unicode infrastructure:
use of diacritic marks. Similarly the short vowels • The Urdu Unicode Extractor
are not considered as letters of their own but • The Urdu Keyboard Input Method
applied above or below a consonant by using • The Main GUI application
appropriate diacritics. The Urdu Extractor is developed to extract the
In this work, it is decided to store all the work Urdu Unicode text from web pages, and has further
related to Urdu morphology in ASCII characters so been used for the collection of corpus (section 6).
that: An onscreen Urdu keyboard is developed so that
• It could be viewed and manipulated easily on a user can type Urdu without installing a specific
different platforms. Urdu keyboard. To render and display Urdu
• The same inflection engine could be reused correctly, an Urdu Font has been embedded inside
for Hindi morphology in future, just by this tool.
adding a lexicon and the transliteration The above mentioned tools are then combined
scheme for Hindi. and are further used to build a main GUI
Since we want the end product to support application that interface the Functional
Unicode Character set, therefore a clear, strict and Morphology (henceforth FM) runtime system into
reversible transliteration scheme is defined both for Java to provide morphological analysis, both in
the letters and the diacritic marks of Urdu script. Urdu and Roman. We interface four kinds of
This transliteration-function is one-to-one i.e. one analyses which are provided by FM as a part of its
string in Urdu gives exactly one roman string and runtime system.
vice versa (but one Urdu character need not
4 Functional Morphology Toolkit
correspond to exactly one roman character). It
takes Urdu script (represented by Unicode FM is a toolkit for morphology development in
Character codes) or Roman transliteration script Haskell (Forsberg & Ranta, 2004). It is based on
and transliterates it in the opposite system of an idea of using the high expressiveness provided
writing (Urdu ↔ Roman). In a similar way the by functional languages to define morphology. The
corresponding phonetic string can also be retrieved use of Haskell gives access to powerful
from the Roman or Urdu transliteration programming constructs and high level of
unambiguously. It is implemented in Java by using
abstraction, which is very useful to capture derivational affixes), hence leaving them to be
generalizations about a natural language. dealt at the syntax level.
FM toolkit is a successful experiment of how the
morphology can be implemented in Haskell by 5.1 Nouns
using it as a host language while FM acts as a
domain specific embedded language. The In Urdu, nouns are inflected for number and case,
productivity and reliability of FM has already been and have an inherent gender that could be either
proved by successful implementations of the masculine or feminine. Urdu has three cases for
morphologies for Swedish, Italian, Russian, nouns (Schmidt, 1999, p 7); the nominative,
Spanish and Latin (FM, 2004). oblique and vocative. Following are the types
defined for case and number:
4.1 Functional Morphology overview data Case = Nom | Obl | Voc
data Number = Sg | Pl
FM consists of two parts, first, the language
independent part and, second, the language The inflectional types, Case and Number are
dependent part as shown in Figure 1. then combined as a type named NounForm, which
A morphology implementer takes the language is further used to define a type Noun.
independent part as granted. It consists of, first, an data NounForm = NF Number Case
infrastructure for dictionary compilation, second, type Noun = NounForm → Str
the runtime applications (Analyzer, synthesizer) All types are language dependent and to be able
and third, the data export utility (Translator). The to use the common API functions, they should be
Translator can export morphology in many formats the valid instances of Param class. We do it in a
such as XFST & LEXC, SQL database and following way:
Grammatical Framework (henceforth GF)
instance Param Case where values = enum
grammar source code etc, providing wider
instance Param Number where values = enum
usability aspects. FM consists of three main type instance Param NounForm where
classes: Param, Dict and Language. These type values = [NF n c| n←values, c←values]
classes enable code reuse and provide generic
algorithms for analysis, synthesis and code Nouns can be divided into different groups
generation. based on their inflection. We divide nouns on the
The language dependent part of the system is the basis of the ending letters in their singular forms.
one that a morphology implementer has to provide We started our work by making suitable divisions
for a certain language which is Urdu in our case. of nouns into groups which are mentioned as
The implementation then appears as a new library groups by (Siddiqi, p 287, 289, 302-304) and
on top of the language independent part of FM. (Schmidt, 1999, p 4). However, changes have
The language dependent part consists, first a type been done to group the words with respect to pure
system, second an inflection engine and third a morphological perspective. This resulted into
lexicon. fifteen groups including three groups for loan
Arabic nouns and two for loan Persian nouns. We
5 Urdu Morphology show one group below as a running example:
• Singular masculine nouns ending with (‫ ا‬, a) ,
For morphology, Urdu is quite similar to other (‫ ہ‬, h) and ( ‫ع‬, e):
Indo-European languages, e.g. having
This group also includes the Arabic loan nouns
concatenative inflective morphological system.
ending with (‫ہ‬, h). According to this rule, first, if a
However some differences can be found in case of
causative verbs that also exhibit stem-internal word ends with letter (‫ا‬, a) or (‫ہ‬, h) then:
changes in some cases. In this section we discuss • To make plural nominative and singular
the Urdu morphology and then present our solution oblique, the last letter is replaced by letter
that explains them in FM. In morphology, we do (‫ے‬, E)
not deal with the words that would require
determining across multi-token units (some loan • To make plural oblique, the last letter is
replaced by string ( ‫وں‬, wN) and
• To make plural vocative, the last letter is interface function n1 is added in
replaced by letter ( ‫و‬, w) CommandsUrdu.hs to let it behave like a command
Second, if a word ends with ( ‫ع‬, e) then the rules for the lexicon and words are added in the lexicon
in the following way:
will remain same as above except that the above َْ
mentioned letters will be added at the end of words n1 l(a)R'ka (transliteration of ləɽkɑ, )
without replacing any existing letter.
Following is a table displaying forms of this 5.2 Verbs
َْ
group e.g. (ləɽkɑ, , boy).
The Urdu verbs are complex as compared to the
Nominative Oblique Vocative other word classes. Urdu verb inflects for tense,
َْ َْ َْ mood, aspect, gender and number. Many verb
Singular ləɽkɑ ləɽke ləɽke
auxiliaries are used to represent the tense, mood
َْ َْ َْ
Plural ləɽke ləɽkoɳ ‫ں‬ ləɽko and aspect of a verb. Furthermore these auxiliaries
Table 2: An example noun group also inflect as normal verbs.
Urdu verb shows direct and indirect causative
This group is defined in inflection engine in the behavior. In general, for each verb, there exists at
following way: least one stem form that could be Intransitive,
noun_lRka :: DictForm → Noun transitive etc. This basic stem form then normally
noun_lRka lRka nf = makes two other forms (direct & indirect
mkNoun sg pl pl pl_obl sg_obl pl_voc nf causatives) of that verb. These three forms are
where actually regular verbs. They make conjugation
sg = lRka
independently and can have similar or different
pl = lRk ++ "E"
pl_obl = lRk ++ "wN" meanings from each other. For example, consider a
pl_voc = lRk ++ "w" verb (bən, , be made):
lRk = if (end =="e") then lRka
Root Infinitive Oblique
else (tk 1 lRka)
end = dp 1 lRka Intransitive / bən bənnɑ bənne
(di) Transitive
This function generates the appropriate forms
for different cases and then passes them to a more Direct bənɑ bənɑnɑ bənɑne
generic function (mkNoun) as parameters. It is Causative
defined in the following way: Indirect bənwɑ bənwɑnɑ bənwɑne
mkNoun:: String → String → String → String → Causative ‫ا‬ ‫ا‬ ‫ا‬
String → String → Number → Case → String Table 3: An Example verb
mkNoun sg sg_Obl pl pl_Obl sg_Voc pl_Voc n c =
case (n,c) of (bənnɑ, , be made), (bənɑnɑ, , to
(Sg, Nom) → sg
make/cause to make) and (bənwɑnɑ, ‫ ا‬, cause to
(Sg, Obl) → sg_Obl
(Sg, Voc) → sg_Voc be made) in Table 3 are masculine infinitive forms
(Pl, Nom) → pl of three regular verbs.
(Pl, Obl) → pl_Obl
(Pl, Voc) → pl_Voc 5.2.1 Verb categories
Then an interface function for this group is
In the perspective of morphology, we divide verbs
defined in BuildUrdu.hs, which is a coordinator
in the following categories:
between type system, Inflection engine and the
lexicon, in the flowing way: 1) Verbs only having basic stem form, while
direct & indirect causatives do not exist
n1 :: DictForm → Entry
n1 df = masculine (noun_lRka df) 2) Verbs having basic stem form as well as direct
& indirect causatives. The direct and indirect
DictForm is string type, and masculine is a causatives are made by:
function applied on such functions that are written i) Rules
for the inflection of masculine words. Then this ii) Irregulars
3) Verbs only having basic and direct causative interface function v4 is defined for this group and
forms, while indirect causative does not exist. they are added in lexicon as follows:
4) Verbs only having basic and indirect causative v4 m(i)l'na m(i)lana m(i)l'wana ( ‫ ِ ا‬، ِ ، ِ)
forms, while direct causative does not exist
A general function mkGenVerb is used to
Morphologically, in Urdu, a verb inflects in:
produce the complete verb conjugation with
• Gender (Masculine, Feminine) morphological point of view.
• Number (Singular, Plural)
• Person (First, Second {casual, familiar, 5.3 Adjectives, Adverbs and closed classes
respectful}, Third)
In a similar way the Adjectives, Adverbs,
• Mood & Tense (Subjunctive, Perfective,
Pronouns, Postpositions, Particles and Numerals
Imperfective)
have been implemented with similar level of detail.
We show the implementation of the second
category as running example. We define it in the 6 The Lexicon
type system in the following way:
type Verb = VerbForm → Str
A wide-coverage lexicon is a key part of any
data VerbForm = morphological implementation. Today, most of the
VF Tense Person Number Gender | lexicons are built manually, which is a very time
Caus1 Tense Person Number Gender | consuming task. We aim to build a lexicon
Caus2 Tense Person Number Gender | automatically with minimal human effort. We use
Inf | Caus1_Inf | Caus2_Inf | a tool named extract which is primarily designed
Inf_Fem | Caus1_Inf_Fem | Caus2_Inf_Fem | for the morphologies developed in FM.
Inf_Obl | Caus1_Inf_Obl | Caus2_Inf_Obl | A morphology implementer provides a
Root | Caus1_Root | Caus2_Root paradigm file and a corpus to the tool. The tool
data Tense = Subj | Perf | Imperf reads the rules for all paradigms, searches the
The first three constructors in the definition of corpus for those words that fulfill the definition of
VerbForm provide verb analysis for the basic verb paradigms and extracts them along with the name
form, direct and indirect causative forms of the fulfilled paradigm.
respectively. Similarly Inf, Caus1_Inf and For the extraction of Urdu lexicon, the first step
Caus2_Inf are infinitive masculine forms; Inf_Fem, was to collect a reasonable amount of Urdu
Caus1_Inf_Fem, Caus2_Inf_Fem are the feminine Unicode text to make a corpus. We developed an
infinitive forms; Inf_Obl, Caus1_Inf_Obl and Urdu corpus of 1.5 million words from news and
Caus2_Inf_Obl are infinitive oblique forms; and literature domain (book banks and news on
Root, Caus1_Root, Caus2_Root are root forms. Internet). It was tokenized on space and
As described above, the verbs belonging to this punctuation marks, keeping the diacritics. It
category could be formed by rules; however there returned 63,700 unique words.
exist a big number of irregular verbs as well. We It is interesting to note that the unique words are
show the worst-case function written for irregular considerably less (23.86 times less) then the total
verbs: words in the corpus. This conforms well to our
intuition that high frequent items, such as
mkVerbCaus12 :: String -> String -> String -> Verb postpositions, auxiliaries, particles and pronouns,
mkVerbCaus12 vInf caus1_inf caus2_inf =
mkGenVerb root r1 r2 vInf caus1_inf caus2_inf
account for most tokens in Urdu text.
where We devised 26 rules (6 for verbs, 19 for nouns,
root = (tk 2 vInf) 1 for adjectives) to write a paradigm file for Urdu.
r1 = (tk 2 caus1_inf) Let’s look at the rule defined for irregular verbs
r2 = (tk 2 caus2_inf) that has basic, direct & indirect causative forms:
In this function we provide the basic, direct and paradigm v4 = x +"na" x+"ana" x+"wana"
indirect causative forms as arguments. As in Urdu, { x+"na" & (x+"ana" | x+"wana") };
the conjugation of verbs is very regular; a complete It results the output in a following format that is
inflection can be built from these three forms. An saved directly in the lexicon:
v4 dyk|hna d(i)k|hana d(i)k|hwana ( ‫ا‬ ‫ ِد‬، ‫ ِد‬، ‫)د‬ words or extra spaces inside words; and third, the
use of foreign words; e.g. the use of Arabic and
Then the tool is applied on the corpus along with
Persian text in Urdu, mostly in the religious, as
the paradigm file, resulting in an Urdu lexicon of
well as in the slightly old literary text; where text is
9,126 words. This result could vary with respect to,
normally aided by the Quranic verses and the
first, the occurrence of misspellings, foreign
Persian poetry. Similarly, the text from news
words, numeric expressions, pronouns etc in the
domain shows a big number of proper nouns and
corpus; second, the knowledge of the lexical
foreign words taken from English.
distribution of the language; and third, the level of
Following are the results altogether:
strictness in the paradigm rules. Here strictness
means a tighter definition of a paradigm rule by Words Diacritic words
requiring more word forms as a condition. Corpus 1,520,000 23,696
Like most of the other Arabic script-based Unique 63,700 6,633
languages, Urdu is commonly written without or Extracted lexicon 9,126 632
with a variant number of diacritic marks in Clean lexicon 4,816 415
electronic and print media. It specifically appears Table 3: Results
as a fundamental limitation to get a fully vocalized
corpus to build a lexicon. This situation may also 7 Urdu Syntax
lead to a problem of having more versions per
word with different diacritic information in Despite the fact that Urdu is an Indo-European
automatic extraction of the lexicon. For example, language, its syntax shows many differences from
for a word kɪt̪ɑb, we may get (‫ ِ ب‬, kɪt̪ɑb) and the other Indo-European languages due to the
inherent features of Arabic, Persian and the native
(‫ ب‬, kt̪ɑb) in the lexicon which are two languages of the Indo-Pak subcontinent. The
orthographically different words representing the pragmatically neutral constituent order in Urdu is
same word. Therefore in such cases, it is most SOV (Subject Object Verb).
desirable to save only one version of such words in To show the usability and effectiveness of our
the lexicon with full diacritics. approach, we provide an implementation of a small
However, the words with different diacritic part of Urdu syntax in GF which is an open source
information are not always the same words. They special-purpose programming language for
َ
may be different in their meanings; e.g. ( , t̪ær, to defining grammars.
swim) and ( ِ , t̪ir, arrow). In such cases, it is We port the Urdu morphology API from FM to
GF by using data export utility of FM. Later we
important to save all such words in the lexicon
apply some preprocessing and save the lexicon
with full diacritics.
directly in Unicode Urdu for GF and then we build
Further, since the use of Urdu on Internet is
the syntax as a separate part of the system on top
relatively new, we were also expecting a relatively
of the morphology.
high number of spelling mistakes in the extracted
In GF, a grammar is a combination of two parts:
corpus. Therefore, to be sure about the correctness
The Abstract syntax and the Concrete syntax.
of the lexicon with respect to the points raised
Below we show some functions of the Abstract
above, we manually re-checked the lexicon from
syntax from our implementation:
word to word; and all incorrect entries have been
thrown away resulting in a lexicon of 4,816 words, fun UsePastS: NP → VP → S;
generating 137,182 word forms. However, we did fun UsePresS: NP → VP → S;
not apply the missing diacritics on partly vocalized In our implementation, a sentence could be
words which could be seen as a fundamental formed such as:
limitation of our lexicon. • By combining a noun phrase (NP) and a verb
The manually checked lexicon (4,816 words) is phrase (VP).
approximately half (52.8%) of the extracted • By adding a conjunction between two
lexicon (9,126 words). We found that the incorrect sentences.
entries are mostly due to, first, the spelling We show the concrete syntax for the above two
mistakes; second, the lack of spaces between functions along with some explanation:
lin UsePastS np vp = yih_66. ِ +DemPron - Sg Obl - Pers3_Near
{ s = np.s ! Nom ++ vp.s ! Past ! np.p ! np.n ! np.g } ; َ
mayN_68. +PersPron - Sg Pers3_Near Obl-
This linearization rule states that the nominative < , kw>
form of noun phrase could be combined by the kw_18. +PostP -
verb phrase (which is a past tense auxiliary in this < , ktabyN>
case) and they both must agree for their Person, ktab_824. ‫ ب‬+N - Pl Nom - Fem
Number and Gender parameters. e.g. (ye mera < , lyny>
qələm t̪ha, a a‫ ا‬a , It was my pen) lyna_2. +Verb - Inf_Fem -
lin UsePresS np vp = { s = < , hyN>
np.s! Obl ++ "‫ "ﮐﻮ‬++ vp.s! Present! np.p! np.n! np.g};
hwna_0. +Verb_Aux - Present Pers1 Pl Masc -
Similarly this is the linearization rule for one of hwna_0. +Verb_Aux - Present Pers1 Pl Fem -
the functions responsible for building sentences ....
having a noun phrase (which is a Pronoun + Syntactic parsing:
Postposition) and a verb phrase (which is a Verb +
UsePresS (UseNP (UsePron mayN_68) kw_18
Auxiliary). e.g. (is ko kɪt̪ɑbeɳ leni heɳ, a a a‫ِاس‬ (UseN ktab_824)) (UseVP lyna_2 hwna_0)
a , He/she suppose to take the books).
UsePresS
In a similar fashion, we have implemented the +-------------+--------------+
noun phrases and verb phrases. We show some of UseNP UseVP
the implemented rules for them below: +------------+----------+ +-------+------+
UsePron kw_18 UseN lyna_2 hwna_0
DemPron → Num → CN → NP1 e.g. (ye d̪o kɪt̪ɑbeɳ, + +
a‫دو‬a , these two books), (wo aik kɪt̪ɑb, ‫ ب‬a ‫وہ ا‬, mayN_68 ktab_824
that one book) etc Figure 2: Syntax tree
DemPron→ PN → NP e.g. (wo Ali, ‫وہ‬, that Ali) etc
NP → PostP → CN → NP e.g. (is ko kɪt̪ɑbeɳ, a a‫ِاس‬ 9 Related Work
, to him the books) etc
A large-scale on-going implementation of the Urdu
Verb_Aux → VP e.g. (heɳ, , are) etc grammar is the Parallel Grammar project (Butt &
King, 2002). In this project, the Urdu/Hindi
Verb → Verb_Aux → VP e.g. (leni t̪hiɳ, a , was
morphology is based on Xerox finite state
suppose to take) etc
technology and it relies on ASCII transliteration.
GF follows Interlingua-based approach. Hence The Urdu Localization Project is also on-going
for an Abstract syntax, we may provide Concrete project (Hussain, 2004). Its translation component
syntax for different languages and GF can not only is based on LFG formalism.
parse them but also translate them from one syntax A number of publications are available for the
(Concrete) to another; hence providing translation. above mentioned projects but their implementation
is not publicly available.
8 A Complete Example EMILLE was a three year project in which 97
million word corpus was generated for the South
We demonstrate the analysis of the following Asian languages. For Urdu, an automated part-of-
sentence as a complete example: speech tagger was further developed (Hardie,
(is ko kɪt̪ɑbeɳ leni heɳ) a a a a‫ِاس‬ 2005) that was then subsequently used to tag the
Transliteration: a(i)s kw ktabyN lyny hyN Urdu corpus.
Morphological analysis: The CRL Language Resource Project2 provides
<‫ِاس‬, a(i)s> an Urdu Resource Package that contains an online
Urdu-English dictionary and a morphological
analyzer. However the design decisions regarding
1
Type safe linguistic categories, (Num: Numerals, CN:
Common Noun, DemPron: Demonstative Pronoun,
2
VAux: Verb Auxiliary, PostP: Postposition) http://crl.nmsu.edu/Resources/lang_res
morphological implementation are not well • This system can equally be used for Hindi
documented. (morphology + syntax) by providing a
A notable transliteration system for Urdu and lexicon and a transliteration scheme for
Hindi is Abbas Malik’s Hindi-Urdu Machine Davanagari script.
Transliteration System (Malik, 2006), in which
SAMPA transcription System is used. 12 Reference
10 Results M. Afzal, S. Hussain. 2001. Urdu Computing

Standards: Development of Urdu Zabta Takhti (UZT
FM has many merits and strengths for the 1.01). Proceedings of IEEE International Multi-topic
development and implementation of a linguistic Conference, Pakistan. pp: 216-222.
model and is proved to be a good choice for M. Butt, T. H. King. 2002. Urdu and the Parallel
implementing Urdu morphology. Haskell provides Grammar Project. In Proceedings of COLING-2002:
us complete freedom for defining Urdu Workshop on Asian Language Resources and
morphology with great ease. Dealing word classes International Standardization. pp. 39-45.
and their parameters as algebraic data types, and M. Forsberg, A. Ranta. 2004. Functional Morphology,
the inflection tables (paradigms) for all word ICFP'04, pp. 213-223, Proceedings of the Ninth
classes as finite functions satisfying the ACM SIGPLAN International Conference of
completeness, makes this implementation elegant, Functional Programming
modular, extensible and reusable. We M. Forsberg, H. Hammarström, A. Ranta. 2006. Lexicon
demonstrated the usability of this work by Extraction from Raw, Text Data. In: Salakoski, T.
implementing a fragment of Urdu syntax in GF. and Ginter, F. and Pyysalo, S. and Pahikkala, T.
However, we do not provide a fully vocalized (eds.) Advances in Natural Language Processing:
lexicon which is a fundamental limitation. Further, Proceedings of the 5th International Conference,
for the moment, for analysis of words, the runtime FinTAL, Finland, August 23-25, 2006, pp. 488-499,
system of FM requires an exact match of a word or SPRINGER, LNCS 4139.
its word forms. Therefore one cannot check if there A. Hardie. 2005. Automated part-of-speech analysis of
exist any orthographically different versions of a Urdu: conceptual and technical issues. In: Yadava,
word in the lexicon. Y, Bhattarai, G, Lohani, RR, Prasain, B and Parajuli,
K (eds.) Contemporary issues in Nepalese linguistics.
11 Conclusions and Future work Kathmandu: Linguistic Society of Nepal.
S. Hussain. 2004. Urdu Localization Project. COLING:
This work presents an understanding of the Urdu WORKSHOP ON Computational Approaches to
language (morphology + orthography + lexicon) as Arabic Script-based Languages, Geneva. pp. 80-81
well as a simple and straight-forward solution.
Urdu is a challenging language and FM adequately ICU4J 3.4. 2006. International Components for Unicode
fulfills it with a good margin. for Java. Version 3.6. http://icu.sourceforge.net
This project could be further enhanced with the A. Malik. 2006. Hindi Urdu Machine Transliteration
following possible extensions: System, Master Thesis, University of Paris 7, France.
• A component that matches the partly A. Ranta. 2004. Grammatical Framework: A Type-
vocalized input words with the canonical Theoretical Grammar Formalism. Journal of
words in the lexicon, possibly returning Functional Programming, 14(2):145-189.
multiple results. R. L. Schmidt. 1999. Urdu an Essential Grammar,
• Algorithms that add the missing diacritics on Routledge Grammars.
partly vocalized words automatically.
• The remaining less frequent, very irregular A. Siddiqi ( a ‫ا ا‬ ‫)ڈا‬. 1971. dʒɑmeʊl-qwɑʔid -
group of words (especially loan Arabic and ‫ا ا‬ , Markazi Urdu Board, Pakistan
Persian words) in the inflection engine and a
T. Rahman 2004; Language Policy and Localization in
bigger coverage of lexicon. Pakistan: Proposal for a Paradigmatic Shift,
• A comprehensive implementation for Urdu Crossing the Digital Divide, SCALLA. 5-7 Jan 2004.
syntax.

Urdu Morphology, Orthography and Lexicon Extraction

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Urdu Morphology, Orthography and Lexicon Extraction

Uploaded by

Copyright:

Available Formats

Urdu Morphology, Orthography and Lexicon Extraction

Muhammad Humayoun Harald Hammarström, Aarne Ranta

amount of vocabulary. Urdu-Hindi together is the

Morphology (Types, Rules, Lexicon)

10 Results M. Afzal, S. Hussain. 2001. Urdu Computing

You might also like