You are on page 1of 11

Chapter 1 Basic Statistics 1.

1 : Introduction Statistics (or Statistical methods) is a systematic way of dealing with raw data or data set, to extract meaningful information. (What is meaningful depends on the context). The study of Statistics started with the methods for finding the mean and mode from a given data set. Later on, concepts like median, variance and standard deviation (SD) were added and methods for determining them were developed. Still later, when more information was required by the business communities, other concepts like quartiles, deciles, and percentiles and skewness and kurtosis were added to the repertoire of Statistical methods. All these concepts put together give a set of values, which is supposed to provide complete and essential information about the given data set. In this Chapter, we will study and learn certain of the above concepts which are quite popular and often used. 1.2: Data Set In Statistics, by a data set, we mean a set or collection of numerical values as given below: 02, 10, 46, 62, 47, 20, 04, 08, 28, 35, 62, 44, 18, 04, 01, 09, 01, 04, 86, 91, 00, 129, 116, 03, 21, 02, 48, 106, 100, 09, 99, 05, 08, 08, 06, 25, 47, 60, 34, 00, 16, 35, 25, 29, 41, 07, 10, 02, 01, 05.

Such a set of values is also called raw data. (Each value is known as datum and the collection of values data). This is because; they do not have any meaning or significance on their own, other than their numeric values. Their meaning depends on the context to which they are applied. We illustrate this by two examples.

Example (1): Suppose you are told that this set of values represents the number of runs scored by a batsman in 50 consecutive innings. Then, the data set acquires a meaning and we feel it is meaningful. Example (2): Again, suppose you are told that this set represents the quantity of rainfall (in mm) in a certain place for a period of 50 months (the rainfall is noted down on the same date of 50 consecutive months). Then the same data set acquires a meaning, even though a different one. So, raw data, by themselves convey no meaning, and their meaning varies from context to context. 1.3: The Mean of a Data Set Given the data set of Section 1.2,, we compute the mean or average (denoted by or x ) as follows: Add all the 50 values and divide by 50. This gives Sum = 1566 Mean = Sum / 50 = 31.32 Example (3): Let us go back to Example (1). In this interpretation the mean value of 31.32 represents the average number of runs scored by the batsman per innings. Example (4): Let us go back to Example (2). In this interpretation the mean value of 31.32 represents the average quantity of rainfall (in mm) in that place per day. The numeric value of the mean gives a good measure of excess or deficiency of rainfall, from which we can get to know the information for questions of the following type: On how many days there was rainfall in excess of the average? etc. We may answer this by counting; but an easier way is through visual aids like a diagram shown below:

Data Values Mean Value

Figure 1

We give below one more example. Example (5): Consider the following data which represent the length (in cm) of the index finger (right hand) of 30 men: 6.2, 6.5, 5.8, 7.2, 7.0, 5.9, 6.8, 6.4, 6.0, 6.1, 5.6, 5.8, 7.2, 7.5, 7.0, 6.9, 6.3, 6.1, 6.2, 6.0, 7.1, 7.1, 6.9, 6.5, 6.8, 6.6, 7.0, 7.9, 5.4, 6.3. Let us determine the mean. Sum = 197 and N = 30; Therefore, mean = Sum / N = 6.57. You should draw a diagram like the one shown above. Take the 30 numbers 1 to 30 along the X axis and the data values along the Y axis. Plot the points and join them in sequence. Draw the horizontal line indicating the mean. Now, answer the questions: What can you infer from this Figure? Does it convey any meaningful information to you? Write down your answers and compare with those of your friends.

1.4: The Mode of the Data Set Consider the same data set given in the Section 1.2. Pick out the maximum value / values. This maximum value represents the mode or modal value of the data set. For this data set , te mode is 129. This value of the mode conveys the information that the set of values in the data set cannot go beyond the modal value. This is the peak value. A diagram like the one given below conveys this simple idea. See Figure 2.

Data Values

Modal Value

Mean Value

Figure 2 Example (6): Let us go back to Example (1). The modal value indicates the highest number of runs scored by the batsman during the period. This he scored in his 12th innings. Example (7): Let us go back to Example (2). Here the modal value indicates the maximum quantity of rainfall in that place during the period. The maximum rainfall occurred on the 12th month. Example (8): Let us go back to Example (5). The highest value is 7.9. So, the modal value of this data set is 7.9.

In Examples (6), (7) and (8) the modal value occurs only once. Such a data set or its plot is said to be unimodal. Figure 2 shows a unimodal plot. Generally, the modal value may occur several times or may get repeated several times. In such cases, we say that the data set or its plot is multi modal. Its graph or plot will look like as shown in Figure 3.

Modal Value

Figure 3 1.5: Median of the Data Set The value (which may or may not belong to the data set) which bifurcates (that is, divides the data set into two equal half) the data set is called the median or median value. To be precise, arrange the values of the data set in increasing order, and look at the middle value (or values). This value (or values) gives the median value (or values) of the data set. Example (9): For a given data set, we have the sorted values in ascending order as: 03, 10, 16, 18, 20, 21, 23, 25, 25, 29, 34, 34, 34, 34, 35, 35, 41, 41, 44, 44, 46, 46, 47, 47, 48, 50, 50, 50, 52, 52, 56, 57, 57, 59, 62, 63, 65, 66, 66, 69, 70, 72, 75, 77, 78, 80, 82, 82, 83, 96.

Since there are 50 values in the data set, we look at the 25th and 26th values. Since the 25th and 26th values are 48 and 50 respectively, the median is given by the average of these two values; that is by [48 + 50] / 2 = 49. Note that this value 49 of the median does not belong to the data set. Now, let us plot the given data set as given by the increasing order and draw the horizontal line at 49. (The student should do this). You will get a plot like Figure 4. You will find that exactly 25 values (the first 25 values0 lie below this line and exactly 25 (the last 25 values) lie above this line. Example (10): Let us go back to Example (5). Arranging the data in increasing order, we get 5.4, 5.6, 5.8, 5.8, 5.9, 6.0, 6.0, 6.1, 6.1, 6.2, 6.2, 6.3, 6.3, 6.4, 6.5, 6.5, 6.6, 6.8, 6.8, 6.9, 6.9, 7.0, 7.0, 7.0, 7.1, 7.1, 7.2, 7.2, 7.5, 7.9. Since there are 30 values in the data set, we look at the 15th and 16th values. The 15th and 16th values are 6.5 and 6.5 respectively. They are same. Hence , the median value is given by the average of these two values; that is by [6.5 + 6.5] / 2 = 6.5. Note that in this case, the median value belongs to the data set. You should draw a figure for this Example.

Median

Figure 4

Let us digress briefly from our mainstream ideas to discuss another important concept associated with any Statistical Analysis of a data set. It is called the frequency table. This idea ia an important one and will reappear in the later Chapters. 1.6: Frequency Table Let us re consider our data set of 50 values of Example (9).. We have already arranged these values in increasing order. This arrangement indicates the frequency of occurrence of each datum in the set. This is recorded as follows. See Table 1.

Table 1

Datum Freq. Datum Freq. Datum Freq. Datum Freq Datum. Freq

00 0 10 1 41 2 62 1 80 1

01 0 16 1 44 2 63 1 82 2

02 0 18 1 46 2 65 1 83 1

03 1 20 1 47 2 66 2 96 1

04 0 21 1 48 1 69 1

05 0 23 1 50 3 70 1

06 0 25 2 52 2 72 1

07 0 29 1 56 1 75 1

08 0 34 4 57 2 77 1

09 0 35 2 59 1 78 1

The total of these frequencies is 50, as it should be. This frequency table can be further compressed and expressed in a compact manner as shown below. See Table 2.

Table 2 Class 00 -- 09 10 -- 19 20 -- 29 30 -- 39 40 -- 49 50 -- 59 60 -- 69 70 -- 79 80 89 90 99 Total Frequency 01 03 06 06 09 09 06 05 04 01 50

1.7: Histogram Closely associated with a frequency table is its visual counterpart called histogram. Histogram of a frequency table, immediately displays the necessary information in a visually understandable form and widely used in business discussions and assessment of national growth in industrial and financial sectors. They are also used for many other purposes and applications. Just as histogram is a visual counterpart of a frequency table, the frequency table is a tabular counterpart of a histogram. Given one, we can generate the other. We now illustrate the concept of a histogram by means of an example: Example (11): Let us go back to the Example of the previous Section. The frequency table given there can be converted into a histogram as shown below. Here we plot the class values (given in the first column of Table 2) along the X axis and plot the corresponding frequencies along the Y axis. The histogram is shown below. In the histogram given below, the middle values have been taken as 10, 20, 30, 40, 50, 60, 70, 80, and 90.

Histogram for Example (11)

1.8: Stem Plots or Node Charts This is another visual aid based on the frequency table, which helps in taking decisions and assessing about profit / loss, above average / below average, healthy / sick , etc. 1.9: Variance & Standard Deviation Once we are given a data set and we have found the mean, then the variance is determined based on the view point taken by us or the information provided or available to us. If we have no information, then we treat the data as a population and compute the variance of the data by the formula Var = [ (xk m)2] / N Where N = number of datum and m = the mean of the data set. 9

If we have information that the given data set represents a sample from a population, then we compute the variance from the formula Var = [ (xk m)2] / [N 1] Where N and m are as above. The standard deviation of the data set is given by the positive square root of variance. In the first case, where the data is considered as population, it is denoted by and in the second case it is denoted by s. that is, we have = + Sqrt [[ (xk m)2] / N] and s = + Sqrt [[ (xk m)2] / [N 1]] Example: Let us go back to Example (5). We have the length (in cm) of the index finger (right hand) of 30 men: 6.2, 6.5, 5.8, 7.2, 7.0, 5.9, 6.8, 6.4, 6.0, 6.1, 5.6, 5.8, 7.2, 7.5, 7.0, 6.9, 6.3, 6.1, 6.2, 6.0, 7.1, 7.1, 6.9, 6.5, 6.8, 6.6, 7.0, 7.9, 5.4, 6.3. We have already found the mean m for this data. It is = 6.57. So we compute the mean square deviations of each datum: (0.37)2 + (0.07)2 + (0.77)2 + (0.63)2 + (0.43)2 + (0.67)2 + (0.23)2 + (0.17)2 + (0.57)2 + (0.47)2 + (0.97)2 + (0.77)2 + (0.63)2 + (0.93)2 + (0.43)2 + (0.33)2 + (0.27)2 + (0.47)2 + (0.37)2 + (0.57)2 + (0/53)2 + (0.53)2 + (0.33)2 + (0.07)2 + (0.23)2 + (0.03)2 + (0.43)2 + (1.33)2 + (1.17)2 + (0.27)2 = 10.363 Now, if we consider this as our population, then Var = 10.363 / 30 = 0.3454 and hence SD = Sqrt(0.3454) = 0.5877. But, if we consider the data set as our sample from some population, then Var = 10.363 / 29 = 0.3573 and hence SD = Sqrt(0.3573) = 0.5977. This completes our example. We close with two formulas for computing the variance and the standard deviation from frequency table. So, we start with a frequency table, whose colums are xk and fk. Here xk is the mid point of the class interval

10

and fk is the corresponding frequency. Let the mean be m. Then we have the formulas: (1) (2) (3) (4) Var = [ fk (xk m)2] / N Var = [ fk (xk m)2] / [N 1] SD corresponding to (1) is Sqrt(Var) , SD corresponding to (2) is Sqrt(Var). and

Exercises: 1. Consider the data set of Section 1.2. Compute the variance and the SD. 2. For the data set of Example 9, compute the variance and the SD. 3. For the frequency Table (2) of Section 1.6, compute the variance and the SD.

11

You might also like