You are on page 1of 11

Looking for Real Exam Questions for IT Certification Exams!

We guarantee you can pass any IT certification exam at your first attempt with just 10-12
hours study of our guides.

Our study guides contain actual exam questions; accurate answers with detailed explanation
verified by experts and all graphics and drag-n-drop exhibits shown just as on the real test.

To test the quality of our guides, you can download the one-fourth portion of any guide from
http://www.certificationking.com absolutely free. You can also download the guides for retired
exams that you might have taken in the past.

For pricing and placing order, please visit http://certificationking.com/order.html


We accept all major credit cards through www.paypal.com

For other payment options and any further query, feel free to mail us at
info@certificationking.com
Cloudera DS-200 : Practice Test

Question No : 1

Why should stop an interactive machinelearningalgorithm assoon as the performanceof the


model on a test set stops improving?

A. To avoid the need for cross-validating the model


B. To prevent overfitting
C. To increase the VC (VAPNIK-Chervonenkis) dimension for the model
D. To keep the number of terms in the model as possible
E. To maintain the highest VC (Vapnik-Chervonenkis) dimension for the model

Answer: B

Question No : 2

What is default delimiterfor Hive tables?

A. ^A (Control-A)
B. , (comma)
C. \t (tab)
D. : (colon)

Answer: A
Reference:http://blog.spryinc.com/2013/10/four-useful-tricks-for-working-with-
hive.html(change the delimiter when exporting hive table)

Question No : 3

Certain individuals aremoresusceptibleto autismif they have


particularcombinationsofgenesexpressed in their DNA. Givena sample of DNAfrom
personswho have autismand a sample of DNAfrom persons who do not
haveautism,determine the best technique forpredictingwhetheror nota given individualis
susceptibleto developing autism?

A. Native Bayes
B. Linear Regression
C. Survival analysis

www.CertificationKing.com 2
Cloudera DS-200 : Practice Test
D. Sequencealignment

Answer: B

Question No : 4

You are working with a logistic regression model to predictthe probabilitythat a user will
click on anad.Your model has hundreds of features, andyou’renot sure ifall of thosefeatures
are helpingyour prediction.Which regularization techniqueshould you use to prune features
that aren’tcontributing tothe model?

A. Convex
B. Uniform
C. L2
D. L1

Answer: A

Question No : 5

Refer to the exhibit.

Which point in the figure is the median?

www.CertificationKing.com 3
Cloudera DS-200 : Practice Test
A. A
B. B
C. C

Answer: A

Question No : 6

Refer to the exhibit.

Which point in the figure is the mode?

A. A
B. B
C. C

Answer: C

Question No : 7

Refer to the exhibit.

www.CertificationKing.com 4
Cloudera DS-200 : Practice Test

Which point in the figure is the mean?

A. A
B. B
C. C

Answer: B

Question No : 8

Under what two conditions doesstochasticgradientdescentoutperform2nd-order


optimizationtechniques such asiterativelyreweightedleast squares?

A. When the volume of input data is so large and diverse that a 2nd-order optimization
technique can be fit to a sample of the data
B. When the model’s estimates must be updated in real-time in order to account for
newobservations.
C. When the input data can easily fit into memory on a single machine, but we want to
calculate confidence intervals for all of the parameters in the model.
D. When we are required to find the parameters that return the optimal value of the
objective function.

Answer: A,B

www.CertificationKing.com 5
Cloudera DS-200 : Practice Test

Question No : 9

What is the result of thefollowing command (thedatabase username is foo and password is
bar)?

$ sqoop list-tables - -connect jdbc :mysql: / /localhost/databasename - -table - -


usernamefoo - -password bar

A. sqoop lists only those tables in the specified MySql database that have not already been
imported into FDFS
B. sqoop returns an error
C. sqoop lists the available tables from the database
D. sqoopimports all the tables from SQLHDFS

Answer: C
Reference:https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-
15/getting-sqoop

Question No : 10

What is the mostcommon reason for a k-meansclusteringalgorithmto returnsa sub-optimal


clusteringof its input?

A. Non-negative values for the distance function


B. Input data set is too large
C. Non-normal distribution of the input data
D. Poor selection of the initial controls

Answer: C

Question No : 11

There are 20 patientswith acute lymphoblasticleukemia(ALL)and 32 patientswith


acutemyeloidleukemia(AML),both variantsof a blood cancer.

www.CertificationKing.com 6
Cloudera DS-200 : Practice Test
The makeup of the groups as follows:

Each individual has anexpression valuefor each of10000differentgenes. Theexpression


valuefor eachgene is a continuousvalue between -1 and 1.

You’vebuilt yourmodel for discriminatingbetween AML and ALLpatientsand you findthat it


worksquite well onyour current data.One month later, acollaborationtells you she
hasfreshdata from100 new AML/ALLpatients.You run the samples through yourmodel,and
turns out your model has very poorpredictive accuracyon the new samples;specifically,
your model predictsthat all males have ALL.What is the most reliableway to fixthis
problem?

A. Change the distance metric


B. Reduce the number of dimensions
C. Use a Gibbs sampler on a Bayesian network
D. Perform matched sampling across other provided variables

Answer: D

Question No : 12

There are 20 patients with acute lymphoblastic leukemia (ALL) and 32 patients with acute
myeloid leukemia (AML), both variants of a blood cancer.

www.CertificationKing.com 7
Cloudera DS-200 : Practice Test
The makeup of the groups as follows:

Each individual has an expression value for each of 10000 different genes. The expression
value for each gene is a continuous value between -1 and 1.

You want to use the data from the 52 patientsin the scenarioto improvethe abilityof
doctorsbeing able to distinguishbetween ALL and AML. What type ofdata scienceproblem
is this?

A. Classification
B. Regression
C. Clustering
D. Filtering

Answer: D

Question No : 13

There are 20 patients with acute lymphoblastic leukemia (ALL) and 32 patients with acute
myeloid leukemia (AML), both variants of a blood cancer.

The makeup of the groups as follows:

www.CertificationKing.com 8
Cloudera DS-200 : Practice Test

Each individual has an expression value for each of 10000 different genes. The expression
value for each gene is a continuous value between -1 and 1.

With which type of plot can you encodethe most amount of the datavisually?

A. A heat map sorting the individuals by group


B. A histogram of the expression values
C. A scatter plot of two largest principal components

Answer: C

Question No : 14

There are 20 patients with acute lymphoblastic leukemia (ALL) and 32 patients with acute
myeloid leukemia (AML), both variants of a blood cancer.

The makeup of the groups as follows:

www.CertificationKing.com 9
Cloudera DS-200 : Practice Test

Each individual has an expression value for each of 10000 different genes. The expression
value for each gene is a continuous value between -1 and 1.

With which type of plot can you encode the most amount of the data visually?

Rather than use all10,000features to separateAML from ALL, youpick a smallsubnet of


features to separatethem optimally.You feature vectorshave 10,000dimensionswhile you
only have 52 datapoints.You use cross-validation to testyour chosenset of features. What
three methods will choose thefeatures in an optimal way?

A. Singular value Decomposition


B. Bootstrapping
C. Markov chain Monte Carlo
D. Hidden Markov
E. Bayesian Information Criterion
F. Mutual Information

Answer: C,D,F

Question No : 15

There are 20 patients with acute lymphoblastic leukemia (ALL) and 32 patients with acute
myeloid leukemia (AML), both variants of a blood cancer.

www.CertificationKing.com 10
Cloudera DS-200 : Practice Test
The makeup of the groups as follows:

Each individual has an expression value for each of 10000 different genes. The expression
value for each gene is a continuous value between -1 and 1.

With which type of plot can you encode the most amount of the data visually?

You choose to performagglomerativehierarchicalclusteringon the 10,000features.How


much RAMdo you need to holdthe distance Matrix, assumingeach distance value is64-bit
double?

A. ~ 800 MB
B. ~ 400 MB
C. ~ 160 KB
D. ~ 4 MB

Answer: B

Question No : 16

You have a large m x n datamatrix M.Youdecide you want


toperformdimensionreduction/clusteringon your data and havedecideto use the
singularvaluedecomposition(SVD;also called principalcomponents analysis PCA)

www.CertificationKing.com 11

You might also like