2 minute read

2019 - 11 - 2

Total views on my blog.

You are number visitor to my blog.

hits on this page.

Kaggle Competition Pipeline

Feature Engineering Modeling
Image Classification Scaling, shifting, rotation, CNNs (ResNet, VGG, dense-net)
Sound Classification Fourier, specgrams, scaling, CNNs (CRNN), LSTM
Text Classification Tf-idf, SVD, stemming, spell checking, stop word’s removval, x-grams, GBMs, Linear, DL, Naive Bayes, KNNs, LibFM, LibFFM
Time Series Lags, weighted averaging, exponential smoothing, Autoregressive models, ARIMA, linear, GBMs, DL, LSTMs
Categorical features Tagent enc, frequency, one-hot, ordinal, label encoding, GBMS, Linear, DL, LibFM, LibFFM
Numerical features Scaling, binning, derivatives, outlier removals, dimension reduction, GBMs, Linear, DL, SVMs
Interactions multiplications, divisions, group-by-features, concatenations, GBMs, Linear, DL
Recommenders Features on transaction history, Item popularity, CF, DL, LibFM, LibFFM, GBMs, frequency of purchase, acquire valued shopers
   

Ensembling

  1. All this time, predictions on internal validation & test are saved. If collaborating with others, this is the point where everyone passes on their predictions.
  2. Different ways to combine from averaging to multilayer stacking.
  3. Small data requires simpler ensemble techniques (averaging)
  4. Helps to average a few low-correlated predictions with good scores.
  5. Bigger data can utilize stacking.
  6. Stacking process repeats the modeling process

Matrix Factorization

  1. Several MF methods can find in sklearn
  2. SVD & PCA: Standard tools for matrix factorization
  3. Truncated SVD: works with sparse matrices
  4. Non-negative Matrix Factorization (NMF): ensures that all latent factors are non-negative; good for counts-like data

Practical Notes

  • Matrix factorization is a very general approach for dimensionality reduction and feature extraction
  • Can be applied for transforming categorical features into real-valued
  • Many tricks suitable for linear models can be useful for MF.

XgBoost & LightGBM

Inputs

XgBoost LightGBM
max-depth max-depth / num_leaves
subsample bagging_fraction
colsample_bytree, colsample_bylevel feature_feaction
min_child_weight, lambda, alpha min_date_in_leaf, lambda_l1, lambda_l2
eta, num_round learning_rate, num_iterations
seed *_seed

Sklearn

Random Forest / Extra Trees

  • N_estimators (the higher the better)
  • max_depth
  • max_features
  • min_sample_leaf
  • criterion
  • random_state
  • n_jobs

Neural Nets

  • Number of neutrons per layer
  • Number of layers
  • Optimizers (SGP + momentum / Adam, Adadelta, Adagrad, in practice to lead to more overfitting)
  • Batch size
  • Learning rate
  • Regularization (L1/L2 for weights, Dropout/Dropconnect, Static DropConnect)

Mean Encoding

Very big dataset, separable.

Comparison

  • Label encoding: random order, no correlation with target
  • Mean encoding: separate 0 from 1, reach a better loss with shorter trees.

XgBoost & LightGBM can’t handle large cardinality.

The more complicated / nonlinear feature target dependency, the more effective is mean encoding.

Notes:

  • Validation should be impeccable.
  • Basics for stacking
  • Need regularization

Regularization: battle with target variable leakage

  1. Cross Validation loop inside training data (4 - 5 folds robust)
  2. Smoothing
  3. Adding random noise
  4. Sorting & Calculating expanding mean

Notes:

  • least amount of leakage
  • no hyper parameters
  • irregular encoding quality
  • built-in in CatBoost

Generalization & Extensions

  1. Using target variable in different tasks, regression multiclass
  2. Domains with many-to-many relations
  3. Time Series
  4. Encoding interactions and numerical features

Pros:

  • Compact transformation of categorical variables
  • Powerful basis for feature engineering

Cons:

  • Need careful validation, can overfit
  • Significant improvement only on specific datasets