Kaggleexperiencemlberkeley 160310040036

KAGGLE COMPETITIONS
MY EXPERIENCE
My experience
FOR MACHINE LEARNING AT BERKELEY
DMITRY LARKO
Hi, here is my short bio

About 10 years working experience in DB/Data Warehouse field.
4 years ago I learned about Kaggle from my dad, who competing on Kaggle as
well (and does that better than me)
My first competition was Amazon.com - Employee Access Challenge, I placed 10th
out of 1687, learned a tons of new techniques/algorithms in ML field in a month
and I got addicted to Kaggle.
Until now I participated in 25 completions, was 2nd twice and I am in Kaggle top100 Data Scientists.
Currently Im working as Lead Data Scientist.
Motivation. Why to participate?
To win this is the best possible motivation for

competition;
To learn Kaggle is a good place to learn and the best
place to learn on Kaggle is forum for past competitions;
Looks good in resume well only if youre constantly
winning
How to start?
Learn Python
MOOC for Machine learning (Coursera, Udacity, Harvard)
Participate in Kaggle Getting started and Playground competitions
Visit Kaggle finished competitions and go through winners solution posted at

competitions forum
Kaggles Toolset (1 of 2)
Scikit-Learn (scikit-learn.org). Simple and efficient tools for data
mining and data analysis. A lot of tutorials, many ML algorithms have
scikit-learn implementation.
XGBoost (github.com/dmlc/xgboost). An optimized general purpose
gradient boosting library. The library is parallelized (OpenMP). It
implements machine learning algorithm under gradient boosting
framework, including generalized linear model and gradient boosted
regression tree (GBDT).
Theory behind XGBoost:
https://homes.cs.washington.edu/~tqchen/pdf/BoostedTree.pdf
Tutorial:
https://www.kaggle.com/tqchen/otto-group-productclassification-challenge/understanding-xgboost-model-on-ottodata
Kaggles Toolset (2 of 2)
H2O (h2o.ai). Fast Scalable Machine Learning API. Has stat-of-the-art
models like Random Forest and Gradient Boosting Trees. Allows you to
work with really big datasets on Hadoop cluster. It also works on Spark!
Check out Sparkling Water: http://h2o.ai/product/sparkling-water/
Neural Nets/Deep Learning. Most of Python libraries build on top of
Theano (http://deeplearning.net/software/theano/):
Keras (https://github.com/fchollet/keras)
Lasagne (https://github.com/Lasagne/Lasagne)
Or TensorFlow:
Skflow(https://github.com/tensorflow/skflow)
Keras(https://github.com/fchollet/keras)
Kaggles Toolset - Advanced

Vowpal Wabbit (GitHub) fast and state-of-the-art online learner. A
great tutorial how to use VW for NLP is available here:
https://github.com/hal3/vwnlp
LibFM (http://libfm.org/) - Factorization machines (FM) are a generic
approach that allows to mimic most factorization models by feature
engineering. Works great for sparse wide datasets, has a few
competitors:
FastFM - http://ibayer.github.io/fastFM/index.html
pyFM - https://github.com/coreylynch/pyFM
Regularized Greedy Forest (RGF) - tree ensemble learning method, can
be better than XGBoost, but you need to know how to cook it.
Four leaf clover - +100 to Luck, gives outstanding model performance
and state-of-the-art golden feature search technique just kidding.
Which past competition to check?

Well, all of them of course. But if I would need to choose Id selected
these 4:
Greek Media Monitoring Multilabel Classification (WISE 2014). Why?
Interesting problem, a lot of classes.
Otto Group Product Classification Challenge. Why?
Biggest Kaggle competition so far.
A great dataset to learn and polish you ensembling techniques
Caterpillar Tube Pricing. Why?
A good example of business problem to solve

Need some work with data before you can build good models
Has a data leakage you can find and exploit
Driver Telematics Analysis. Why?
Interesting problem to solve.
To get an experience working with GPS data
Common steps
Know your data:
Know your metric.
Visualize;
Train linear regression and check the weights;
Build decision tree (I prefer to use R ctree for that);
Cluster the data and look at what clusters you get out;
Or train a simple classifier, see what mistakes it makes
Different metrics can give you a clue how approach the problem, for
example for logloss you need to add calibration step for random forest
Know your tools.
Knowledge of advantage and disadvantages of ML algorithms you use

can tell you what to do to solve a problem
Common steps contd
Getting more from data.
Domain knowledge
Feature extraction and feature engineering pipelines. Some ideas:
Get most important features and build 2nd level interactions
Add new features such as cluster ID or leaf ID
Binarize numerical features to reduce noise
Ensembling.
A great Kaggle Ensembling Guide by MLWave
Teaming strategy
Usually we agree to team up at pretty early phase but keep working
independently to avoid any bias in our solutions
After teaming up:
Code sharing can help you to learn new things but a waste of
time from competition perspective;
Share data especially some engineered features;
Combine your models outcomes using k-fold stacking and build
a second level meta-learner (stacking)
Continue iteratively add new features and building new models
Teaming up right before competition end:
Black-box ensembling linear and not so linear blending using
LB as validation set.
Linear: f(x) = alpha*F1(x) + (1-alpha)*F2(x)

Non linear: f(x) = F1(x)^alpha * F2(x)^(1-alpha)
Q&A
Thank you!

Kaggleexperiencemlberkeley 160310040036

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Kaggleexperiencemlberkeley 160310040036

Uploaded by

Copyright:

Available Formats

KAGGLE COMPETITIONS

FOR MACHINE LEARNING AT BERKELEY

Hi, here is my short bio

Currently Im working as Lead Data Scientist.

Motivation. Why to participate?

To win this is the best possible motivation for

MOOC for Machine learning (Coursera, Udacity, Harvard)

Participate in Kaggle Getting started and Playground competitions

Visit Kaggle finished competitions and go through winners solution posted at

Kaggles Toolset - Advanced

Which past competition to check?

A good example of business problem to solve

Know your data:

Know your metric.

Know your tools.

Knowledge of advantage and disadvantages of ML algorithms you use

Common steps contd

Getting more from data.

A great Kaggle Ensembling Guide by MLWave

Linear: f(x) = alphaF1(x) + (1-alpha)F2(x)

You might also like