You are on page 1of 67

ELECTRONIC LANGUAGE RESOURCES

AND THEIR USE IN EDUCATION

Assist. Prof. Rositsa Dekova, PhD

The Paisii Hilendarsky University of Plovdiv


Department of English Studies
Electronic Language Resources
 Electronic dictionaries
 Monolingual & Bilingual
 Picture dictionaries
 Learning games
 Annotated Corpora
 BulNC
 BNC
 COCA
 Parallel corpora
 Bulgarian-English Parallel Corpus
 Lexical Semantic Databases
 WordNet
 FrameNet
24.3.2018 г.
Electronic dictionaries

Learner’s dictionaries and thesauri


 https://en.oxforddictionaries.com/

 https://dictionary.cambridge.org/

 http://www.collinsdictionary.com/

 http://www.merriam-webster.com/

 http://thesaurus.com/

 http://www.eurodict.com/

 http://www.lingvozone.com/free-online-dictionary

24.3.2018 г.
Picture dictionaries
 http://www.oxfordlearnersdictionaries.com/wordli
st/english/pictures/pics_A-B/
 http://www.kidzwood.com/#a-z
 Pictorial dictionary with voice
 http://www.opdome.com/
 Online picture dictionary
 http://www.anglomaniacy.pl/index.html
 Vocabulary & Grammar
 Songs & Printables

24.3.2018 г.
Learning English can be fun
Crosswords and word searches
 http://puzzlemaker.discoveryeducation.com/WordSearchSetupForm
.asp?campaign=flyout_teachers_puzzle_wordcross
 http://worksheets.theteacherscorner.net/make-your-own/word-
search/
 http://www.teachers-direct.co.uk/resources/wordsearches/
 http://tools.atozteacherstuff.com/word-search-
maker/wordsearch.php
 http://www.puzzle-maker.com/WS/
 http://www.armoredpenguin.com/crossword/
 http://www.abcya.com/make_a_word_search.htm
24.3.2018 г.
Word search example
African Animals

Word search generated by


http://puzzlemaker.discoveryeducation.com

24.3.2018 г.
Word search example

Word search from the database of


http://www.teachers-direct.co.uk/

24.3.2018 г.
Word search example

24.3.2018 г.
Word search example

24.3.2018 г.
Interactive games

 http://gamestolearnenglish.com/
 http://www.vocabulary.co.il/english-language-
games/
 http://www.learninggamesforkids.com/vocabulary
_games/foreign-languages.html
 http://www.eslgamesplus.com/
 http://www.abcya.com/
 ...
24.3.2018 г.
ANNOTATED CORPORA

Corpus – a large body of machine-readable


naturally occurring linguistic evidence.
Annotated Corpus – enhanced with various types
of linguistic information
 Morphological
 POS tagging
 Semantic
 tagging with words senses
 Syntactic
 tagging for syntactic information
 …
24.3.2018 г.
TEXT SEGMENTATION

 Electronic text is just a sequence of characters.


 Before any processing is done the text has to be
segmented into linguistic units, such as words,
punctuation, numbers, alphanumericals (H2O), etc.
 This process is called TOKENIZATION and the
segmented units are called TOKENS.
 The process of segmenting the text into sentences
is called SENTENCE SPLITTING.

24.3.2018 г.
LEMMATIZATION

 Reduces inflectional forms and sometimes


derivationally related forms of a word to return
the base or dictionary form of a word, which is
known as the lemma.
 For instance:

 am, are, is  be
 car, cars, car's, cars'  car

24.3.2018 г.
PART-OF-SPEECH TAGGING

 Parts of speech
 The morphological and syntactic classes that the
different parts of speech can be assigned to.
 POS tagging
 Automatic assignment of descriptors called tags
to input tokens.

24.3.2018 г.
THE TAGSET
 The tagset includes all the tags that will be
used in the POS tagging.
 We could use a very coarse tagset:
 N, V, Adj, Adv, Prep...

 More commonly used set is finer-grained:


 NN, NNS, NNP, NNPS, VB, VBG, VBN, VBP, VBZ…

 The level of granularity used in the tagset


directly affects the search possibilities.
24.3.2018 г.
EXAMPLE

 The _ DT
 little _ JJ
 boy_NN1
 quickly_RB
 ate_ VVD
 the_DT
 green_JJ
 apple_NN1
 ./.
24.3.2018 г.
CASES OF AMBIGUITY

 They love summer vacations.


 Their love started in the summer.

 Plants need water and light.


 We should all plant one.

24.3.2018 г.
CASES OF AMBIGUITY

 They_PNP  Time_NN1_VVB
 are_VBB  flies_VBZ_NNS
 flying_NN1_VBG  like_PRP_VVB
 planes_NN2  an_AT0
 ./.
 arrow_NN1
Coreference
 ./.
Resolution

24.3.2018 г.
Examples of Taggers and Parsers

 CLAWS WWW tagger (Free web tagging service for English)


 http://ucrel.lancs.ac.uk/claws/trial.html

 The Stanford Parser online


 http://nlp.stanford.edu:8080/parser/

 Shallow Parsing Demo


 Syntactic Tree Generator URL
 An app that builds syntactic trees from labelled
bracket notations.

24.3.2018 г.
Applications for formula reading

 www.readspeaker.com

 http://www.robobraille.org/robobraille-projects

 http://www.inftyproject.org/en/index.html

 http://lpf-esi.fe.up.pt/~audiomath/demo/AM1.htm

24.3.2018 г.
BRITISH NATIONAL CORPUS (BNC)

 A 100 million word collection of samples of


written and spoken language from a wide
range of sources, designed to represent a
wide cross-section of current British English,
both spoken and written.
 More info at: http://www.natcorp.ox.ac.uk/
 Available online at:
https://corpus.byu.edu/bnc/
24.3.2018 г.
BRITISH NATIONAL CORPUS (BNC)
 The written part of the BNC (90%) includes
 extracts from regional and national newspapers,
 specialist periodicals and journals for all ages and interests,
 academic books and popular fiction, published and unpublished
letters and memoranda,
 school and university essays, etc.

 The spoken part (10%) consists of


 orthographic transcriptions of unscripted informal conversations
(recorded by volunteers selected from different age, region and social
classes in a demographically balanced way)
 spoken language collected in different contexts, ranging from
formal business or government meetings to radio shows and
phone-ins
24.3.2018 г.
THE CORPUS OF CONTEMPORARY
AMERICAN ENGLISH (COCA)

 The Corpus of Contemporary American


English (COCA) is the largest freely-available
corpus of English, and the only large and
balanced corpus of American English.
 The corpus was created by Mark Davies of

Brigham Young University.


 Available online at:
http://corpus.byu.edu/coca/
24.3.2018 г.
COCA

 Contains more than 450 million words of text and


is equally divided among spoken, fiction, popular
magazines, newspapers, and academic texts.
 Includes 20 million words each year from 1990-2012.
 The corpus is also updated regularly (the most
recent texts are from Summer 2012).
 Suitable for looking at current, ongoing changes in
the language.
24.3.2018 г.
THE SEARCH ENGINE
 Searches for exact words or phrases, wildcards,
lemmas, part of speech, or any combinations of
these.
 Searches for surrounding words (collocates)
within a ten-word window.
 Limit searches by frequency and compare the
frequency of words, phrases, and grammatical
constructions:
 bygenre or even between sub-genres (or domains)
 over time

24.3.2018 г.
Results for collocates of black

24.3.2018 г.
Search for: Verb + Preposition

24.3.2018 г.
24.3.2018 г.
SEMANTICALLY-BASED QUERIES OF THE CORPUS

 Contrast and compare the collocates of two related


words (little/small, democrats/republicans,
men/women).
 Determine the difference in meaning or use between
these words.
 Find the frequency and distribution of synonyms for
nearly 60,000 words
 Compare the frequency in different genres.
 Create your own lists of semantically-related words, and
then use them directly as part of the query
24.3.2018 г.
SPECIFIC USES OF COCA
 To look at recent changes in English:
 morphology (new suffixes –friendly and –gate)
 syntax (including prescriptive rules, quotative like, so
not ADJ, the get passive, resultatives, and verb
complementation)
 semantics (such as changes in meaning with web,
green, or gay)
 lexis – including word and phrase frequency by year,
to produce lists of all words that have had large shifts
in frequency all words that have had large shifts in
frequency between specific historical periods.
24.3.2018 г.
PARALLEL CORPORA
 Alignment (of bitexts)
 Differences in grammatical structure
 with the sun not shining - нямаше слънце
 Differences in lexical structure
 the thermometer walks inch by inch up to the top of the
glass,  и термометърът пълзи сантиметър по
сантиметър до върха на скалатa
 No lexicalization
 It wasn't the butler coming back.  Не беше икономът.
 It’s this way  Положението е такова
24.3.2018 г.
INTELLIGENT SEARCHES IN BG-EN PARALLEL CORPUS

 The Bulgarian National Corpus search engine is available at:


http://search.dcl.bas.bg/
 The syntax allows search by (combinations of) word forms,
grammatical tags, semantic relations.
 Thanks to the alignment, the corresponding sentences in parallel
documents are also accessible.
 The hits are paginated and the matches are highlighted.
 The user is able to view the detailed information for a given sentence
in the hit set - the sentence metadata, its context, and
correspondence(s) in the other languages.
 http://ibl.bas.bg/en/BGNC_search_en.htm (Search instructions)
24.3.2018 г.
SEARCH ASSISTANT

24.3.2018 г.
LEXICAL SEMANTIC NETWORKS

Electronic language resources which define notions


through their relations with other notions.
 LSN are knowledge representation schemes involving nodes
and links (arcs or arrows) between nodes.
 The nodes represent objects or concepts.
 The links represent relations between nodes.
 The links are directed and labeled.

24.3.2018 г.
An example of classical taxonomy tree

boy = male child


girl = female child
man = male adult
woman = female adult
child = young human human
adult = grown-up human

adult child
[+adult] [-adult]

man woman boy girl


[+male] [-male] [+male] [-male]

24.3.2018 г.
Lattice structure with multiple classifications

24.3.2018 г.
WORDNET - http://wordnet.princeton.edu/

 The largest lexical semantic database of English


 Originally developed at Princeton University (Miller, 1990)
 EuroWordNet - http://www.illc.uva.nl/EuroWordNet/
 BalkaNet - http://www.dblab.upatras.gr/balkanet/index.htm
 Each wordnet represents a unique language-internal
system of lexicalizations
 In addition, the wordnets are linked to an Inter-Lingual-
Index, based on the Princeton wordnet

24.3.2018 г.
WORDNET STRUCTURE
 Nouns, verbs, adjectives and adverbs are grouped
into sets of cognitive synonyms (synsets), each
expressing a distinct concept.
 Each synset is linked to other synsets by means of a
small number of “conceptual relations.”
 WordNet really consists of four sub-nets, one each
for nouns, verbs, adjectives and adverbs, with few
cross-POS pointers.

24.3.2018 г.
WORDNET STRUCTURE
http://wordnet.princeton.edu/man/wngloss.7WN.html
 Each synonym set - SYNSET - encodes the relation of
equivalence between a number of lexical items –
LITERALS where each lexeme:
 has unique meaning (specified by the value of SENSE)
 pertains to one and the same part of speech
(specified as the value of POS)
 represents one and the same lexical meaning
(specified as the value of DEF - definition)
24.3.2018 г.
An example: learn (Wordnet)

24.3.2018 г.
BulNet http://dcl.bas.bg/bulnet/
 A lexical semantic network of Bulgarian
 comprises around 49,189 synonym sets
distributed into nine parts of speech
 open-class words: nouns, verbs, adjectives and
adverbs
 closed-class words: pronouns, prepositions,
conjunctions, particles and interjections

24.3.2018 г.
STRUCTURE
 Each synset is linked to its counterpart in PWN3.0 by
means of a unique identification number – ID.
 The common synsets in the Balkan languages are
marked as common concepts subsets – BCS.
 In the monlingual database a synset should be linked to
at least one other synset through an intralingual
relation.
 Non-obligatory information may also be encoded such
as examples of usage, stylistic, morphological or
syntactic properties.
24.3.2018 г.
RELATIONS IN BULNET
Synonymous sets are linked through various relations:
 SEMANTIC
 Synonymy, antonymy, hypernymy, hyponymy, meronymy,
holonymy, entailment, inclusion, causation, etc.
 MORPHOSEMANTIC

 BE IN STATE
 MORPHOLOGICAL
 DERIVED
 PARTICLE

 EXTRALINGUISTIC
24.3.2018 г.
SEMANTIC RELATIONS
SYNONYMY – a semantic relation of equivalence
between literals belonging to the same POS;
The synonyms form the synonym set also called
SYNSET.
For example:
 The lexical units
{auto:1, car:2, automobile:2, machine:3, motorcar:1}
form a synset as they refer to the same concept.
24.3.2018 г.
SEMANTIC RELATIONS
HYPERNYMY and HYPONYMY - semantic relations between
synsets, which corresponds to the notion of class-
inclusion: if W1 is a kind of W2, then W2 is
hypernym of W1 and W1 is hyponym of W2.
Example:
 rose < plant < living organism
Multi-parent relations:
 actress < actor
 actress < female.
24.3.2018 г.
SEMANTIC RELATIONS
ANTONYMY – a semantic relation of opposition,
established between two members belonging to
one and the same POS.
Examples:
 man - woman
Hyponyms of two antonyms (nouns) should also be
antonymous pair by pair:
 man - woman
 actor - actress
24.3.2018 г.
SEMANTIC RELATIONS
MERONYMY and HOLONYMY – semantic relations linking
synsets denoting wholes with those denoting their
parts: if W1 has a W2, and W2 is part, portion,
member of W1, then W1 is a meronym of W2 and
W2 is a holonym of W1
Examples:
 ...
MERONYMY may not be always reversible to HOLONYMY:
 tree - forest
24.3.2018 г.
TYPES OF MERONYMY
 PART OF:
 клон – дърво
 книга - библиотека
 MEMBER OF:
 дърво – гора
 футоболист – отбор – лига
 PORTION OF:
 капка – течност
24.3.2018 г.
WORD-FORMING RELATIONS
Morpho-semantic relations
 BE IN STATE
 кост – костен
Morphological relations
 DERIVED
 сервирам – сервитьор
 PARTICLE
 видя – видян

24.3.2018 г.
EXTRALINGUISTIC RELATIONS
 REGION DOMAIN
 степ – Русия
 USAGE DOMAIN
 тиранти – множествено число
 CATEGORY DOMAIN
 гимнастически уред – гимнастика
24.3.2018 г.
THE RELATIONS IN BULNET

The large number of relations encoded in the


Bulgarian wordnet effectively illustrates the
language's semantic and derivational
richness

This offers diverse opportunities for numerous


applications of the multilingual database.

24.3.2018 г.
BulNet http://dcl.bas.bg/bulnet/

24.3.2018 г.
APPLICATIONS

 options for synonym selection


 queries for semantic relations of a word in the
language's lexical system
 antonymy, holonymy, etc.

 explanatory definition queries


 translation equivalents for a lexical item

24.3.2018 г.
FRAMENET (Fillmore and Baker 2001, 2010)
 A lexical database of English that is both human-
and machine-readable.
 Based on annotated examples of how words are
used in actual texts.
 Tries to capture human insight into how a word
can be used and converts it into semantic
knowledge that is machine-readable.
 Available online at:
http://www.icsi.berkeley.edu/~framenet
24.3.2018 г.
FRAME SEMANTICS (Fillmore, 1976, 1985)

 A semantic frame is a structure used to define the


semantic meaning of a word.
 Cutting

 Frame elements are the separate elements which


make up a frame.
 An Agent cuts an Item into Pieces using an Instrument.
 Lexical units are the words that evoke a particular
frame.
 carve.v, chop.v, cube.v, cut.v, dice.v, fillet.v, mince.v,
pare.v, slice.v

24.3.2018 г.
FrameNet Data: Frame Index

24.3.2018 г.
FrameNet Data: Frame Index

24.3.2018 г.
FrameGrapher

24.3.2018 г.
FrameGrapher

24.3.2018 г.
FrameGrapher

24.3.2018 г.
FrameGrapher

24.3.2018 г.
Uses of Electronic Language Resources

 EDUCATION
 Intelligentsearches for particular language
phenomena, i.e. search by (combinations of) word
forms, grammatical tags, semantic relations;
 Collocations;

 Word and phrase frequencies;

 Recent changes in the language;

 Translation equivalents;

 Semantic structure of the words and their use;

 etc.
24.3.2018 г.
References

 Davies, Mark. 2010. The Corpus of Contemporary American


English as the first reliable monitor corpus of English Lit
Linguist Computing (2010) 25 (4): 447-464 first published
online October 27, 2010 .
 The British National Corpus, version 3 (BNC XML Edition).
2007. Distributed by Oxford University Computing Services on
behalf of the BNC Consortium. URL:
http://www.natcorp.ox.ac.uk/
 Reference Guide for the British National Corpus (XML Edition)
edited by Lou Burnard, February 2007. URL:
http://www.natcorp.ox.ac.uk/XMLedition/URG/
24.3.2018 г.
References

 Miller, George A. (1995). WordNet: A Lexical Database for English.


Communications of the ACM Vol. 38, No. 11: 39-41.
 Fellbaum, Christiane (1998, ed.) WordNet: An Electronic Lexical
Database. Cambridge, MA: MIT Press.
 Koeva, S., T. Tinchev and S. Mihov. Bulgarian Wordnet - structure and
validation. In Romanian Journal of Information Science and
Technology, Vol. 7, No. 1-2, 61-78, 2004. ISSN 1453-8245 pdf file
 Koeva, S. Derivational and morphosemantic relations in Bulgarian
Wordnet. In Intelligent Information Systems, XVI, Warsaw, Academic
Publishing House, 2008, 359-389. ISBN 978-93-60434-44-4 pdf file

24.3.2018 г.
References

 Ruppenhofer, J. et al. 2010. FrameNet II: Extended Theory and


Practice. https://framenet2.icsi.berkeley.edu/docs/r1.5/book.pdf
 Fillmore, Charles. Introduction to FrameNet.
https://framenet.icsi.berkeley.edu/fndrupal/sites/default/file
s/FNintroCJF.ppt
 Fillmore, Charles J. 1985. Frames and the Semantics of
Understanding. Quaderni di Semantica 6(2): pp. 222-53.

24.3.2018 г.
THANK YOU
FOR YOUR ATTENTION!

24.3.2018 г.

You might also like