You are on page 1of 1

In linguistics, researchers are often interested in the whole variety of language rather than an individual author or text.

In such cases, there are two options for data collection i.e. by analyzing every single utterance of that variety or by analyzing a smaller constructed sample of that variety. This first option is impracticable because the number of utterances in living languages, such as, English and German, the number of utterances is constantly increasing and theoretically infinite. To analyze every utterance in such language would be an impossible task. It is, therefore necessary to choose the second option and build a sample of the language variety in which we are interested.

Quantitative analyses may be carried out on any sample of text, however, it can be misleading if one wants to generalize the findings on that sample to some larger population. In order to achieve a maximal representative sample, random sampling techniques are employed in corpus building. The application of random sampling techniques to corpus building requires precautionary measures for a maximum representation of the population. B iber (1993b) has provided an extensive account of this issue. He states that first we need to define the limits of our population or sampling-frame before defining the procedures for sampling techniques.

For instance, in analyzing Written German of 1993, first we will have to define our sampling frame. There are two approaches in defining sampling-frame in building the corpora of written language. The first approach is to use comprehensive bibliographical index for the written German of 1993. This may involve complete contents of an index of published works in German for that year, e.g. the Deutsche National Bibliographie. This approach has been taken by Lancaster-Oslo/Bergen corpus. The second approach is to define the sampling frame from a specific library e.g. we may define our sample from all the German books in Lancaster University library published in 1993. This approach has been used by Brown corpus.

In the case of informal language such as conversation, these approaches will not be helpful because it is not stored or indexed. In this regard demographic sampling can be used, which involves selecting informants of different age, sex, social class etc. There everyday conversation is recorded and is then used in corpus. This is the method used in selecting the spoken parts of language in British national corpus. Biber (1993b) also highlights determining the hierarchical structure or strata of the population such as various genres. In this regard, written German of 1993 may involve genres such as newspaper reporting, romantic fiction, scientific writings. He defends the representativeness of stratified sampling in comparison to probabilistic sampling in corpus building.

You might also like