Professional Documents
Culture Documents
Outline
Background
As an illustration
Goals of corpus linguistics Types of corpora Applications of corpus analysis Exploring units of meaning Case study
Aims and objectives of project Similar existing projects Procedures Current status Sample corpora Sample search
Chomskyan linguistics
Corpus linguistics
Langue (competence) Ideal speaker/hearer Language = innate mental faculty Intuitive evidence Universals Grammar
Parole (performance) Complexity/variation Language = social phenomenon Empirical evidence Differences Meaning
3
Basic tools
Corpus: a systematic collection of speech or writing that is built according to explicit design criteria for a specific purpose c.f. EAGLES broad definition: A corpus can potentially contain any text type, incl. word lists, dictionaries, etc. Concordancer: search engine (e.g. WordSmith; SARA) Concordance: occurrences of search item, displayed in list with immediate context shown
Types of corpora
Written vs Spoken General vs Specialised
Monolingual vs Multilingual
Written corpora
Specialised corpora
CO
First generation ma
Brown Corpus (1960s)
Br
Empirical teaching data authentic examples of language use Reference source answering learners questions or explaining learner errors: Preparation of teaching materials e.g. vocabulary lists, CLOZE tests CALL; concordancing and data-driven learning Using parallel texts to find suitable translation equivalents Creation of translation databases or glossaries for domain-specific terminology, e.g. business, law, science Exploring units of meaning in texts Lexicography & lexical studies e.g. relative word frequency Language variation e.g. linguistic features across registers Grammar corpora used as data to test hypotheses, syntactic theory Pragmatics & discourse e.g. CA of discourse features in spoken (conversational) data
Whats the difference between at last and in the end? How is hardly used?
Translation
People interested in the meanings of texts, in how language is actually used in discourse Meaning is a key problem for translation, language learning, information management Language teaching (TEFL): vocabulary often introduced in the form of new single words Words considered to be basic units of meaning If you dog a dog during the dog days of summer, youll be a dog tired dog catcher Can I sit down? My dogs are barking
Most lexical errors made by language learners result from failure to deal with ambiguities of single words
10
Notion of an Unambiguous Unit of Meaning necessary for understanding meaning UUoM = keyword and all words in the context that contribute to making the word unambiguous Compounds, idioms, multi-word units, collocations, set phrases Often determined by a syntactic pattern
Adj + N V+N
friendly fire, closing remarks invite proposals, draw conclusions politically correct, environmentally friendly cause of death, proof of identity, code of practice, duty of care
Adv + A
N + of + N
11
Case study
Search results
Represent fairly new concepts Occur in the newer corpora (1990s) as units of meaning Occur as entries in some of the online dictionaries only (not bilingual dictionaries)
New terminology and terms of common usage not always recorded in dictionaries and termbanks One way of using corpora for learning and translation:
Use corpus evidence to help students recognise units of meaning; introduce notion of units of meaning into language learning
16
To be used by staff and students in the department For teaching, language learning and research purposes A WWW interface via which users can freely access the language bank With browse, search and concordance facilities
17
Ingredients of PULB
Sources: standard corpora, departmental collections Medium: written texts, transcribed spoken data Language types: native speaker, learner corpora Languages: English, Chinese, Japanese, French, German Genres: business, law, academia, media, social, literature Target Size: 30 million words (European) / characters (Asian)
18
Authentic examples of language use at your fingertips Empirical teaching data covering different specialisms (ESP, EAP)
A ready-made collection of data waiting for you to work on Saving on time and resources
Way of incorporating new methods and information technology into the departments teaching and research activities
Increase students awareness of this rapidly developing methodology / branch of language studies (corpus linguistics, corpora studies) Way of integrating theory with technology in the classroom Train students to be more computer-literate All of the above can
Motivate students to become active learners Help students to more effectively learn the target language (cf goals of DDL) 19
http://clwww.essex.ac.uk/w3c/ Access to corpora (Gutenberg texts, LOB, LOB-tagged) Web interface for performing searches Online tutorial and info on corpus linguistics http://vlc.polyu.edu.hk/concordance/ Access to variety of corpora and texts (bilingual/parallel corpora, news, Bible, works of fiction) Web interface for performing searches
20
Build a language bank with features that parallel those of similar sites
~ VLC
~ Essex
Bring together corpora and texts of various types and genres, of different languages Make available different facilities for different categories of users (cf. legal considerations) Provide on-site tutorial, corpora-based info
Allow searches in multiple texts / corpora simultaneously Some form of parallel concordancing
30
General corpora
Learner corpora
Legal English
Academic English
English Literature
HK spoken corpus
Conference speeches
Academic presentations
Workplace English
n i s u B s s e n i t i r w g
h c a e T g n i t c e l f e r s n o i
a i c o S l a r e t n i n o i t c s
e d u t S t n k r o w
B N C
I C E
B R O W N
Specialised corpora
Spoken Corpora
31
Procedures (i)
Business Corpus (Li and Bilbow) Bilingual corpora (Xu) ESP / EAP corpora (Forey) Learner corpora (Sengupta)
32
Procedures (ii)
Clean up texts
E.g. Duplications of text samples E.g. Structural features (headings, typographic features) E.g. Personal information found in data
To protect anonymity or privacy of authors and speakers
Annotate texts
Provide descriptive information about each corpus Provide descriptive information about the texts
Number, size, genre of subtexts Bibliographic info (written text) Ethnographic info (spoken data) Compiler, time of compilation, type of collection
33
Procedures (iii)
PULB map Browse facility Search and concordance facilities Tutorial / general information
Transplant PULB onto dept website for use by staff and students Promote PULB among corpora community
34
Current status
Range of corpora totalling 12M+ words Individual corpus descriptions Index of corpora Simple to use built-in concordancer Available at http:// langbank.engl.polyu.edu.hk/
35
PolyU Business Corpus (Eng, Chi, Jap) BNC Sampler Corpus (Spoken, Written) Corpus of Multilingual Texts Corpus of Nursing and Health Science Texts Learner Corpus of Essays and Reports HK Bilingual Corpus of Legal and Documentary Texts ...
37
What would you like to see being incorporated into PULB? Can you think of other ways in which PULB can be organised and structured? How likely are you to make use of PULB in your teaching and research? Do you have any suggestions for corpus studies based on available or potentially available corpora from PULB? Do you know of similar projects being undertaken elsewhere that we can learn from? Do you have collections of language data from past research projects that are (could be) presented as a corpus (corpora)? Can we help you put your collections to good use? Can we work together to incorporate your collections into PULB?
In terms of corpora In terms of search facilities and supplementary information
41
Concluding remarks
Corpora represent a valuable but under exploited resource for teaching and research PULB aims to bring together various corpora under a single departmental archive, accessible via WWW You can help us by contributing your ideas and/or your language collections Please visit and test the PULB website at http:// langbank.engl.polyu.edu.hk/ and provide us with feedback using the online evaluation form Thank you very much
42
Social grooming
CLOZE
Business texts from: newspapers, government reports, company reports and brochures Has been used for creating a bilingual English-Chinese business lexicon
45
English (c. 1.3 M words) Chinese (c. 1.2 M words) Japanese (c. 1.1 M words)
Duplication