You are on page 1of 12

KAGGLE COMPETITIONS

MY EXPERIENCE

My experience

FOR MACHINE LEARNING AT BERKELEY

DMITRY LARKO

Hi, here is my short bio


About 10 years working experience in DB/Data Warehouse field.

4 years ago I learned about Kaggle from my dad, who competing on Kaggle as
well (and does that better than me)
My first competition was Amazon.com - Employee Access Challenge, I placed 10th
out of 1687, learned a tons of new techniques/algorithms in ML field in a month
and I got addicted to Kaggle.
Until now I participated in 25 completions, was 2nd twice and I am in Kaggle top100 Data Scientists.

Currently Im working as Lead Data Scientist.

Motivation. Why to participate?

To win this is the best possible motivation for


competition;
To learn Kaggle is a good place to learn and the best
place to learn on Kaggle is forum for past competitions;
Looks good in resume well only if youre constantly
winning

How to start?

Learn Python

MOOC for Machine learning (Coursera, Udacity, Harvard)

Participate in Kaggle Getting started and Playground competitions

Visit Kaggle finished competitions and go through winners solution posted at


competitions forum

Kaggles Toolset (1 of 2)
Scikit-Learn (scikit-learn.org). Simple and efficient tools for data
mining and data analysis. A lot of tutorials, many ML algorithms have
scikit-learn implementation.
XGBoost (github.com/dmlc/xgboost). An optimized general purpose
gradient boosting library. The library is parallelized (OpenMP). It
implements machine learning algorithm under gradient boosting
framework, including generalized linear model and gradient boosted
regression tree (GBDT).
Theory behind XGBoost:
https://homes.cs.washington.edu/~tqchen/pdf/BoostedTree.pdf
Tutorial:
https://www.kaggle.com/tqchen/otto-group-productclassification-challenge/understanding-xgboost-model-on-ottodata

Kaggles Toolset (2 of 2)
H2O (h2o.ai). Fast Scalable Machine Learning API. Has stat-of-the-art
models like Random Forest and Gradient Boosting Trees. Allows you to
work with really big datasets on Hadoop cluster. It also works on Spark!
Check out Sparkling Water: http://h2o.ai/product/sparkling-water/
Neural Nets/Deep Learning. Most of Python libraries build on top of
Theano (http://deeplearning.net/software/theano/):

Keras (https://github.com/fchollet/keras)
Lasagne (https://github.com/Lasagne/Lasagne)
Or TensorFlow:

Skflow(https://github.com/tensorflow/skflow)
Keras(https://github.com/fchollet/keras)

Kaggles Toolset - Advanced


Vowpal Wabbit (GitHub) fast and state-of-the-art online learner. A
great tutorial how to use VW for NLP is available here:
https://github.com/hal3/vwnlp
LibFM (http://libfm.org/) - Factorization machines (FM) are a generic
approach that allows to mimic most factorization models by feature
engineering. Works great for sparse wide datasets, has a few
competitors:
FastFM - http://ibayer.github.io/fastFM/index.html
pyFM - https://github.com/coreylynch/pyFM
Regularized Greedy Forest (RGF) - tree ensemble learning method, can
be better than XGBoost, but you need to know how to cook it.
Four leaf clover - +100 to Luck, gives outstanding model performance
and state-of-the-art golden feature search technique just kidding.

Which past competition to check?


Well, all of them of course. But if I would need to choose Id selected
these 4:
Greek Media Monitoring Multilabel Classification (WISE 2014). Why?
Interesting problem, a lot of classes.
Otto Group Product Classification Challenge. Why?
Biggest Kaggle competition so far.
A great dataset to learn and polish you ensembling techniques
Caterpillar Tube Pricing. Why?

A good example of business problem to solve


Need some work with data before you can build good models
Has a data leakage you can find and exploit
Driver Telematics Analysis. Why?
Interesting problem to solve.
To get an experience working with GPS data

Common steps

Know your data:

Know your metric.

Visualize;
Train linear regression and check the weights;
Build decision tree (I prefer to use R ctree for that);
Cluster the data and look at what clusters you get out;
Or train a simple classifier, see what mistakes it makes

Different metrics can give you a clue how approach the problem, for
example for logloss you need to add calibration step for random forest

Know your tools.

Knowledge of advantage and disadvantages of ML algorithms you use


can tell you what to do to solve a problem

Common steps contd

Getting more from data.

Domain knowledge
Feature extraction and feature engineering pipelines. Some ideas:
Get most important features and build 2nd level interactions
Add new features such as cluster ID or leaf ID
Binarize numerical features to reduce noise

Ensembling.

A great Kaggle Ensembling Guide by MLWave

Teaming strategy
Usually we agree to team up at pretty early phase but keep working
independently to avoid any bias in our solutions
After teaming up:
Code sharing can help you to learn new things but a waste of
time from competition perspective;
Share data especially some engineered features;
Combine your models outcomes using k-fold stacking and build
a second level meta-learner (stacking)
Continue iteratively add new features and building new models
Teaming up right before competition end:
Black-box ensembling linear and not so linear blending using
LB as validation set.

Linear: f(x) = alpha*F1(x) + (1-alpha)*F2(x)


Non linear: f(x) = F1(x)^alpha * F2(x)^(1-alpha)

Q&A

Thank you!

You might also like