You are on page 1of 17

Basic Data Exploration with Statistics

Question 1(1)
Which of the following is a discrete random variable?
I. The number of Hindi films that receive a national award annually.
II. The number of Parliament Seats in India.
III. The average weight of a randomly selected group of students selected from Sunday
Night School.
1. I only
2. II only
3. III only
4. I and II
Question 2(1)
Which central tendency measure will best suit a categorical variable?
1. Mean
2. Median
3. Mode
4. None of the above
Question 3 (1)
A sample comprises six observations: -5, -3, -1, 1, 3, 5. Calculate its standard deviation.
1. 14
2. 11.66
3. 3.74
4. 3.42
For Questions 4&5: Refer to the following caselet.
A data researcher has received the following table for data exploration. The fields in the table are
outlined below. Using the information provided for the fields, help the researcher understand the
table:
Name: Name of the loan applicant
Gender: Gender of the loan applicant
DOB: The birth date of the loan applicant in MMDDYYYY format
Address: Full Postal Address of the loan applicant
Dependents: Number of Dependents that the loan applicant is providing for
Income: Annual Income of the Loan Applicant
FixedCosts: The total amount that the loan applicant pays on a monthly basis to service loans that
the applicant has outstanding.
SavPercent: The approximate percentage of annual income that can be saved on an annual basis.
Age: Age of the loan applicant
Question 4(1)
Which of the following fields are of the Categorical type?
1. Name, Gender, DOB, Address, Age
2. Name, Gender, Address
3. Name, Gender, DOB, Address
4. Name, Gender, Address, Age
Question 5(1)
To perform any sort of analysis the researcher wants to work with all the fields as they are. Is this the
right thing to do? Select the best possible answer from the options given below.
1. Yes, all the fields are independent to each other and give information that is useful for
analysis.
2. No, Name can be replaced by a unique id. It will aid analysis by removing a text field
that does not give information that can be analyzed.
3. No. DOB can be removed since we are considering Age.
4. Both 2 and 3
Question 6 (2)
Which of the following statements is true for a bi-modal data distribution that is symmetrical (no skew)?
1. The mean lies to the left of both the modes.
2. The mean lies between both the modes.
3. The mean lies to the right of both the modes.
4. Insufficient Information
Question 7 (2)
Which of the following statistic measures variability?
1. Mean
2. Median
3. Mode
4. Range

Sampling and Estimation

Question 8 (1)
A financial analyst is conducting a survey to estimate the risk and return profile of stocks in
general based on a minimum market capitalization. The analyst is hence sampling from a list of
1,000 stocks. The stocks belong to various industries: 2000 stocks are from the BFSI industry,
2000 stocks are from the IT industry, 2000 stocks are from the Metals Industry, 2000 stocks are
from Oil and Gas Industry, the rest of the stocks belongs to various industries. The analyst
selects a sample of 100 stocks, by randomly sampling 20 stocks from each industry type.
Are the stocks selected using Simple Random Sampling?
1. No, because each stock in the sample was randomly sampled.
2. Yes, because each stock in the sample had an equal chance of being sampled.
3. Yes, because stocks of every industry were equally represented in the sample.
4. No, because every possible 100-stock sample did not have an equal chance of being
chosen.
Question 9 (1)
Which of the following statements is true?
I. The mid-point of a confidence interval is the population parameter.
II. The greater the level of confidence, the smaller the confidence interval.
III. The confidence interval is a type of point estimate.
1. I only
2. II only
3. III only
4. None of the above
Question 10 (1)
The height of all the football players in a school is recorded. The average height of a football player is 70
inches with a standard deviation of 1.5 inches. If Alice's z-score is 2.4, what was her height?
1. 66.4 inches
2. 67.6 inches
3. 70 inches
4. 73.6 inches
Question 11 (1)
Which of the following statements is consistent with the Central Limit Theorem?
1. The mean of the sample is the same as the mean of the population 95% of the times, if
the level of confidence is 95.
2. Only if the original distribution of the population is normal then the sampling
distribution of the mean is normal
3. The sampling distribution of x is approximately normal regardless of the distribution of
the original data.
4. None of the above.
Question 12 (1)
The width of the confidence interval does not depend on which of the following?
1. Population Size
2. Sample Size
3. Population Standard Deviation
4. Level of Confidence
Question 13 (1)
Sheeksha is conducting a survey to test a hypothesis. An increase in the sample size will decrease which
of the following?
1. The power of the hypothesis test
2. The probability of making a Type I Error
3. The probability of making a Type II Error
4. Both 2 and 3
Question 14 (2)
Four hundred MBA graduates were randomly selected for a survey on income after the completion of
the MBA. Among the participants, the mean annual salary (in 00,000s) was 3.1, and the standard
deviation was 0.6. Calculate the margin of error, assuming a 95% confidence level.
1. 0.03
2. 0.06
3. 0.1176
4. 1.960
Question 15 (2)
A data set has sample mean as 50 and the sample standard deviation as 11, what percent of the data
would you expect to fall between 39 and 61, assuming that the data distribution is symmetric?
1. 68 percent
2. 81.5 percent
3. 95 percent
4. 99.7 percent
Question 16 (3)
Sixa Bank wants to recalibrate its pay scale. Hence it wants to do a sample survey of the salaries of Vice-
Presidents of various other Banks. The Bank wants its 95% confidence interval to be in the range of +/-
INR 10,000. How many salaries do they need to sample given that the salary of a Vice-President is
normally distributed with a standard deviation of INR 2,500?
1. 2401
2. 625
3. 1
4. Insufficient Information

Predictive Analytics: Linear Regression

Question 17 (1)
In the following diagram four dependent variables Y1, Y2, Y3 and Y4 are plotted against the independent
variable X (ranges from 1 to 20)? Determine which of the four dependent variables will be LEAST
amenable to regression analysis.
1. Y1
2. Y2
3. Y3
4. Y4
250

200

150
Y1
100
Y2
Y3
50
Y4
0
0 5 10 15 20 25
-50

-100

Question 18 (1)
Which of the following statements is least likely to be correct in the case of simple linear regression?
1. There are always as many points above the regression line as there are below the
regression line.
2. The mean of the dependent variable and the mean of the independent variable will
always lie on the regression line.
3. The mean of the estimated values of Y is the same as the mean of the observed values of
Y.
The regression line minimizes the sum of the squared residuals

Question 19 (1)
In a simple linear regression, the sample correlation between two variables, y and x, was found to be
0.81. Using this sample data, the least squares regression line was computed to be = 10.5 3x. Based
on these facts calculate the percentage of observed variation in y explained by a linear relationship with
x.
1. -0.81
2. 0.81
3. 0.656
4. None of these
Question 20 (1)
For the scatter plot given below what is the best estimate for correlation?
100

50

0
0 5 10 15

1. -0.98
2. -0.3
3. 0.3
4. 0.98
Question 21 (1)
Which of the following is not an assumption of the error terms in a simple linear regression model?
1. The error terms are normally distributed.
2. The error terms have a constant variance.
3. The mean of the error terms is zero.
4. The standard deviation of the error terms is one.
Question 22 (1)
If the error terms of a simple linear regression are independent, then which of the following
characteristics are the residuals most likely to possess?
1. Autocorrelation
2. Error terms are cyclical when plotted against time.
3. Alternating positive and negative error terms when plotted against time.
4. No pattern of error terms when plotted against time.
Question 23 (2)
A multi-national company wants to estimate revenue given the number of employees. The regression
line hence estimates y (revenue) based on x (number of employees). Currently the revenue is measured
in US Dollars, the company now wants to estimate revenue in terms of Euros. If the change from Dollars
to Euros takes place in the measurement of y then which of the following parameters will not be
affected?
1. The estimated intercept parameter.
2. The total sum of squares for the regression.
3. R squared for the regression.
4. The regression line minimizes the sum of the squared residuals.
Question 24 (2)
A multiple linear regression predicts the value of house prices based on various variables such as the
area of the house in square feet, the suburb in which the house is located, etc. If you add another
explanatory variable then which of the following results will take place?
1. The explained variability will increase or stay the same.
2. The unexplained variability will increase.
3. The total variability will increase.
4. The coefficient of determination will decrease.
For Questions 25 to 26: Refer to the following caselet.(3Marks each)

Data on the heights of 100 pairs of father and son were collected to study variations in
height. The supposition is that the height of the parents affects the height of the child.
The height for both was measured in cms. A simple linear regression was then
performed on these values. The results for the regression are given below. Use this
regression data to answer the following questions.

Regression Statistics
Multiple R 0.815694
R Square 0.665356
Adjusted R Square 0.661942
Standard Error 9.243665
Observations 100

ANOVA
Significance
Df SS MS F F
Regression 1 16648.93 16648.93 194.8489 4.97E-25
Residual 98 8373.644 85.44534
Total 99 25022.58

Standard Lower Upper


Coefficients Error t Stat P-value 95% 95%
Intercept -0.36653 12.68229 -0.0289 0.977002 -25.5341 24.80105
X Variable 1 1.046896 0.074999 13.95883 4.97E-25 0.898063 1.195729
Question 25 (3)
What statistical conclusion should you make about the regression analysis?
1. Since 194.8489 > F table value, the analysis can be used to estimate the height of the son
given the height of the father.
2. Since 12.6822 > t table value, the analysis can be used to estimate the height of the son
given the height of the father.
3. Since 13.9588 > t table value, the analysis can be used to estimate the height of the son
given the height of the father.
4. Since -0.0289 > t table value, the analysis cannot be used to estimate the height of the son
given the height of the father.
Question 26 (3)
Calculate the 95% confidence interval for the height of a son whose father is 175 cm?
1. 164.72 - 200.96
2. 173.60 192.08
3. 165.76 184.24
4. 156.88 193.96

Predictive Analytics: Logistic Regression

Question27 (1)
Which of the following best describes Odds Ratio?
1. The ratio of the probability of an event not happening to the probability of the event
happening.
2. The probability of an event occurring.
3. The ratio of the odds after a unit change in the predictor to the original odds.
4. The ratio of the probability of an event happening to the probability of the event not
happening.
Question 28 (1)
The Log Likelihood Function, (-2 Log Likelihood or -2LL) and the Chi-square are common statistical
measures used in logistic regression. If our independent variables have a relationship to the dependent
variable then we will improve our ability to predict the dependent variable accurately. In such a scenario
what can be said about the values of the Log Likelihood function and the Chi-Square in comparison to
their values without any independent variables.
Log Likelihood Chi Square
1. Decreases Decreases
2. Decreases Increases
3. Increases Decreases
4. Increases Increases
Question 29 (1)
In any binary logistic regression:
1. The dependent variable is continuous.
2. The dependent variable is divided into two equal subcategories.
3. The dependent variable consists of two categories.
4. There is no dependent variable.
Question 30 (1)
A null model in logistic regression means
1. There is no independent variable in the model-2
2. There is a single independent variable in the model
3. There is no dependent variable in the model
4. None of the Above
Additional Information for Questions 31 to32: Refer to the following caselet.
The following table shows the results of a multivariable logistic regression analysis on data from a study
with 7034 participants for whom 9 covariates / variables were measured 12 years back at baseline. The
dependent variable was whether the person has diabetes (coded as one) or does not have diabetes
(coded as zero) now i.e. after 12 years of the variables being measured.
Variable Definition bi P value RRi CI for RRi
Sex M=0,F=1 -1.588 <0.001 0.20 0.14 to 0.29
Age Years 0.081 <0.001 1.08 1.07 to 1.10
Height inches -0.053 <0.05 0.95 0.95 to 1.00
Diastolic Blood Pressure mm Hg 0.009 <0.02 1.01 1.00 to 1.02
Systolic Blood Pressure mm Hg 0.006 >0.05 1.01 1.01 to 1.02
Cholesterol mg/ml 0.007 <0.001 1.01 1.00 to 1.01
ECG abnormal Yes=1, No=0 0.854 <0.001 2.35 1.67 to 3.31
Relative weight 100wt/median wt)% 1.359 <0.001 3.89 1.89 to 8.00
Alcohol consumption oz/month -0.059 >0.05 0.94 0.88 to 1.01
Constant term a=-5.370

Question 31 (2)
Which of the following is true?
1. Cholesterol level shows a statistically insignificant result and a large effect size
2. Cholesterol level shows a statistically significant result and a large effect size
3. Cholesterol level shows a statistically insignificant result and a small effect size
4. Cholesterol level shows a statistically significant result and a small effect size
Question 32 (2)
Which of the following variables may be dropped from the current model?
1. Systolic blood pressure, Alcohol consumption, height
2. Systolic blood pressure, Alcohol consumption
3. Diastolic blood pressure, Age, Sex, Relative weight
4. Systolic blood pressure, Diastolic blood pressure, Cholesterol

Predictive Analytics: Forecasting

Question 33 (1)
Which of the following statements is least likely to be true for Forecasting?
1. Forecasts are rarely perfect.
2. Forecasting models need to be regularly updated.
3. Forecast for group of items is more accurate than forecasts for individual items.
4. Short range forecasts are less accurate than long range forecasts.
Question 34 (1)
Before starting any time series analysis, which of the following actions will most likely help in
understanding the data?
1. Performing preliminary regression calculations.
2. Calculating a basic moving average.
3. Plotting the data on a graph.
4. Identifying relevant correlated variables.
Question 35 (1)
Which of the following is a defining trait of Nave Forecasts?
1. They are based only on past values of the variable.
2. They are short-term forecasts.
3. They are long-term forecasts.
4. They usually result in incorrect forecasts.
Question 36 (1)
While conducting time-series analysis, which source of variation can be estimated by the ratio-to-trend
method?
1. Cyclical
2. Trend
3. Seasonal
4. Irregular
Question 37 (1)
Which of the following actions will most likely lead to the maximum smoothing effect?
1. Taking a moving average based on a small number of periods.
2. Performing exponential smoothing with a small weight value.
3. Using the root-mean-square error as an indicator of forecasting error.
4. Using the mean absolute deviation as an indicator of forecasting error.
Question 38 (2)
Regression analysis is used to estimate the linear relationship between the natural logarithm of the
variable to be forecast and time. In such a scenario, which of the following best defines the slope
estimate computed?
1. The linear trend.
2. The natural logarithm of the rate of growth.
3. The natural logarithm of one plus the rate of growth.
4. The natural logarithm of the square root of the rate of growth.
Question 39 (2)
The table below gives the demand for alphonso mangoes for the past six months. Using this data,
forecast the demand for alphonso mangoes for the next month using the four period moving average.
Period 1 2 3 4 5 6
Demand 28 30 32 30 34 28
1. 30
2. 31
3. 32
4. 33
Question40 (3)
The table below gives the weekly demand for DVD Rentals for three weeks. Using this data you need to
forecast the demand for DVD Rentals for the fourth week. What are the values of A, B and C in this
table?
Day of the Seasonal Seasonal Seasonal Avg.
Week Week1 Index Week2 Index Week3 Index Index Week4
Monday 80 84 88 B
Tuesday 90 95 99
Wednesday 90 94 98
Thursday 90 A 94 100
Friday 110 116 121
Saturday 120 126 133 C
Sunday 120 126 131
Total 700 735 770
Average 100 105 110 120
A B C
1. 90 84 126.33
2. 0.9 0.8 144
3. 1.1 0.8 132
4. 0.9 0.8 132

Market Basket Analysis

Question 41 (1)
For a market basket Association Rule to have business value it is crucial that the Association Rule has
(have)
1. Confidence
2. Support
3. Both
4. At least one
Question42 (1)
Association Rules exist for both Soap Deodorant and Deodorant Soap i.e. both these association
rules are satisfying some minimum support and minimum confidence. What can we say about their
support and confidence levels?
1. Both the Association Rules have the same support and confidence.
2. Both the Association Rules may have different support levels and different confidence
levels.
3. The support levels will be the same but the confidence levels may be different.
4. The confidence levels will be the same but the support levels may be different.
Question 43 (1)
Which of the following factors is the least likely to contribute as a bottleneck to the Apriori algorithm?
1. The number of association rules
2. The number of scans required
3. The computation of support counting for candidates
4. The number of generated candidates
Question44 (1)
The confidence associated with the Association Rule Computer Speaker,
1. Does not change with any changes in the frequency of Computer.
2. Does not change with any changes in the frequency of Speaker.
3. Decreases with an increase in the frequency of Speaker
4. Increases with an increase in the frequency of Speaker
Question 45 (1)
When the Apriori algorithm was applied on a database the number of frequent 1-itemsets was found to
be 10, then the number of candidate 2-itemsets before pruning will be:
1. 5
2. 20
3. 45
4. 90
Question 46 (2)
In a transactional database, the lift measure of the items Pasta and Chicken is equal to 0.5. Which of the
following statements can be interpreted from this lift measure?
1. Consumers that buy pasta are more likely to buy chicken
2. Consumers that buy chicken are more likely to buy pasta
3. Buying chicken and buying pasta are independent activities
4. Consumers that buy pasta are less likely to buy chicken
Question 47 (2)
Let c1, c2, and c3 be the confidence values of the Association Rules AR1, AR2 and AR3 respectively. AR1,
AR2 and AR3 are defined below:
AR1 : {x} {y}
AR2 : {x} {y, z}
AR3 : {x, z} {y}
It is known that c1, c2, and c3 have different values then which rule has the lowest confidence?
1. c1
2. c2
3. c3
4. Cannot be Determined
Question 48 (3)
Consider the following set of frequent 3-itemsets, which have been developed out of the
frequent 1 itemset F1 = {a, b, c, d, e, f}:
F3 = {{a, b, c}, {a, b, d}, {a, b, e}, {a, c, d}, {a, c, e}, {b, c, d}, {b, c, e}, {c, d, e}}.
How many candidate 4-itemsets can be obtained by a candidate generation procedure using the F3 F1
merging strategy.
1. 1
2. 2
3. 5
4. 10

Classification

Question 49 (1)
As the decision tree of a classification model increases its branches it becomes more complex. This
complexity will most likely lead to which of the following?
1. Model Under-fitting
2. Model Over-fitting
3. Decrease in the Training Error
4. Both 2 and 3
Question 50 (1)
Ensemble Methods are now commonly used in classification problems. What is the primary advantage
of using Ensemble Methods?
1. It is used to compare various classifying algorithms.
2. It is used to evaluate various classifying algorithms.
3. It is used to improve the accuracy of the overall classification.
4. It is used to calculate the accuracy of the overall classification.
Question 51 (1)
While constructing decision tree algorithms, attribute selection measures are used to
1. Select the splitting criteria that best separate the data
2. Reduce the dimensionality
3. Reduce the error rate
4. Rank attributes
Question 52 (1)
Which of the following is least likely a characteristic of models that suffer from under-fitting?
1. A model that is not very complex
2. High Training Error
3. High Testing Error
4. An Un-pruned Decision Tree
Question 53 (1)
While making a decision tree for classification, the entropy function has a value of 0. Which of the
following is least likely to be associated with a node that has an entropy value of 0?
1. An Entropy value of 0 will lead to higher Training Error.
2. The node is pure.
3. All Examples that are associated with this node belong to the same classifier class.
4. Both 2 and 3
Additional Information for Question 54 to 55: Refer to the following caselet.
A database of 5000 transactions was partitioned into fraudulent and non-fraudulent transactions. A
machine based learning algorithm was then deployed onto this database. The algorithm on completion
correctly labeled 75% of the actual fraudulent transactions as fraudulent. Using this information,
complete the table below and answer the questions that follow:
Actual class\Predicted class Fraudulent Non-Fraudulent Total
Fraudulent 500
Non-Fraudulent A
Total 4400 5000
Question 54 (2)
What is the value of A?
1. 4275
2. 4350
3. 4400
4. 4500
Question 55 (2)
Labeling a non-fraudulent transaction as fraudulent is less harmful than labeling a fraudulent transaction
as non-fraudulent. Hence in the F-score or F-measure Recall has a weight that is three times precision.
Calculate the F-measure for the above confusion matrix.
1. 0.64
2. 0.74
3. 1.36
4. 1.57
Additional Information for Question 56 to 57: Refer to the following caselet.
A survey was taken amongst 15 students in an MBA college, regarding education loans. The results of
the survey are shown in the table below:
Age Work Exp Marital Status Car Edu Loan
20 - 25 0 - 2years Single No No
20 - 25 No Work Exp Single No Yes
20 - 25 No Work Exp Single No Yes
20 - 25 0 - 2years Single Yes Yes
20 - 25 2 - 5 years Married Yes No
20 - 25 No Work Exp Single Yes No
26 - 30 2 - 5 years Married Yes Yes
26 30 2 - 5 years Married No No
26 30 2 - 5 years Married No No
26 30 2 - 5 years Single No Yes
26 30 0 - 2years Single Yes Yes
26 30 No Work Exp Married Yes No
26 30 5 - 10 years Single No No
30 35 5 - 10 years Married Yes No
30 35 2 - 5 years Single Yes Yes
As is evident, there are two classes within which a student can be classified:
Class1: Has an Education Loan i.e. Edu Loan = Yes
Class2: Does not Have an Education Loan i.e. Edu Loan = No
A Nave Bayes Classifier is used to allocate any new students into one of the two classes.
Question 56 (3)
When using the Nave Bayes Classifier there is a chance that some values of a given variable do not
influence the final classification. In the table above, which of the following values are such that they do
not influence whether a person has an education loan or not?
1. Age: 20 25 years
2. Work Exp: 2 5 years
3. Car: Yes
4. All of the above
Question 57 (3)
A new student Tasneem has the following attributes:
Age: 20 25 years
Work Exp: 2 5 years
Marital Status: Single
Car: Yes
Use the Nave Bayes Classifier and find P(Tasneem / Edu Loan = No) also determine in which class will
Tasneem be categorized.
1. 4.17%, Classified in the class: Has an Education Loan
2. 8.33%, Classified in the class: Has an Education Loan
3. 4.17%, Classified in the class: Does not Have an Education Loan
4. 8.33%, Classified in the class: Does not Have an Education Loan

Clustering

Question 58 (1)
K-means & K-mediods are clustering methods. Which type of method is it?
1. Agglomerative Methods
2. Partitioning Methods
3. Density - based Methods
4. Grid based Methods
Question 59 (1)
Which of the following statements is least likely to be correct?
1. K means is used to cluster numerical data.
2. K modes is used to cluster categorical data.
3. K mediods is used to cluster categorical data.
4. K means is used to cluster a mix of categorical and numerical data.
Question 60 (1)
Silhouette coefficient is used to evaluate clustering models by determining the natural number of
clusters. Which of the following methods is it useful for?
1. Partitioning Methods
2. Hierarchical Methods
3. Density - based Methods
4. Subspace Clustering Methods
Question 61 (1)
Which of the following clustering methods can form clusters of arbitrary shape?
1. k-means
2. DSBSCAN
3. CLARANS
4. None of the above
Question 62 (1)
What would a divisive clustering do given there are N statistical units?
1. Starts with one cluster and proceeds with extractions until N clusters are obtained
2. Starts with N clusters and proceeds with merges until one cluster is obtained
3. Extracts immediately the desired number of clusters
4. None of the above
Question 63 (2)
Which of the following statements is least likely to be accurate?
1. When cluster analysis is used as a general data reduction tool then subsequent
multivariate analysis can be conducted on the clusters rather than on the individual
observations.
2. The average linkage method is more preferred for hierarchical clustering than the single
and complete linkage methods.
3. One method of assessing reliability and validity of clustering is to use different methods
of clustering and compare the results.
4. Clustering should be performed on samples of 300 or more.
Question 64 (2)
SSE Sum of Squared Errors is a powerful tool for cluster analysis. SSE for a variable can be used to
improve clustering. Which of the following variables will be helpful to distinguish clusters?
1. Variables that have low SSE for all clusters.
2. Variables that have the same SSE for all clusters.
3. Variables that have high SSE for all clusters.
4. Variables that have extreme values for different clusters.
Question 65 (3)
For a data set with m points, cluster analysis needs to be performed. The data set needs to be
segregated into K clusters. The following information is available:
Half the points are in more dense regions,
Half the points are in less dense regions, and
The two regions are well-separated from each other.
Question 66 (3)
Given the above data, which of the following should occur in order to minimize the squared error when
finding K clusters?
1. Centroids should be equally distributed between more dense and less dense regions.
2. More centroids should be allocated to the less dense region.
3. More centroids should be allocated to the denser region.
4. Data available is insufficient to conclude about centroids.

You might also like