Data Science Practical Guides
2019 - 11 - 2
Total views on my blog.
You are number visitor to my blog.
hits on this page.
Kaggle Competition Pipeline
Feature Engineering | Modeling |
---|---|
Image Classification | Scaling, shifting, rotation, CNNs (ResNet, VGG, dense-net) |
Sound Classification | Fourier, specgrams, scaling, CNNs (CRNN), LSTM |
Text Classification | Tf-idf, SVD, stemming, spell checking, stop word’s removval, x-grams, GBMs, Linear, DL, Naive Bayes, KNNs, LibFM, LibFFM |
Time Series | Lags, weighted averaging, exponential smoothing, Autoregressive models, ARIMA, linear, GBMs, DL, LSTMs |
Categorical features | Tagent enc, frequency, one-hot, ordinal, label encoding, GBMS, Linear, DL, LibFM, LibFFM |
Numerical features | Scaling, binning, derivatives, outlier removals, dimension reduction, GBMs, Linear, DL, SVMs |
Interactions | multiplications, divisions, group-by-features, concatenations, GBMs, Linear, DL |
Recommenders | Features on transaction history, Item popularity, CF, DL, LibFM, LibFFM, GBMs, frequency of purchase, acquire valued shopers |
Ensembling
- All this time, predictions on internal validation & test are saved. If collaborating with others, this is the point where everyone passes on their predictions.
- Different ways to combine from averaging to multilayer stacking.
- Small data requires simpler ensemble techniques (averaging)
- Helps to average a few low-correlated predictions with good scores.
- Bigger data can utilize stacking.
- Stacking process repeats the modeling process
Matrix Factorization
- Several MF methods can find in sklearn
- SVD & PCA: Standard tools for matrix factorization
- Truncated SVD: works with sparse matrices
- Non-negative Matrix Factorization (NMF): ensures that all latent factors are non-negative; good for counts-like data
Practical Notes
- Matrix factorization is a very general approach for dimensionality reduction and feature extraction
- Can be applied for transforming categorical features into real-valued
- Many tricks suitable for linear models can be useful for MF.
XgBoost & LightGBM
Inputs
XgBoost | LightGBM |
---|---|
max-depth | max-depth / num_leaves |
subsample | bagging_fraction |
colsample_bytree, colsample_bylevel | feature_feaction |
min_child_weight, lambda, alpha | min_date_in_leaf, lambda_l1, lambda_l2 |
eta, num_round | learning_rate, num_iterations |
seed | *_seed |
Sklearn
Random Forest / Extra Trees
- N_estimators (the higher the better)
- max_depth
- max_features
- min_sample_leaf
- criterion
- random_state
- n_jobs
Neural Nets
- Number of neutrons per layer
- Number of layers
- Optimizers (SGP + momentum / Adam, Adadelta, Adagrad, in practice to lead to more overfitting)
- Batch size
- Learning rate
- Regularization (L1/L2 for weights, Dropout/Dropconnect, Static DropConnect)
Mean Encoding
Very big dataset, separable.
Comparison
- Label encoding: random order, no correlation with target
- Mean encoding: separate 0 from 1, reach a better loss with shorter trees.
XgBoost & LightGBM can’t handle large cardinality.
The more complicated / nonlinear feature target dependency, the more effective is mean encoding.
Notes:
- Validation should be impeccable.
- Basics for stacking
- Need regularization
Regularization: battle with target variable leakage
- Cross Validation loop inside training data (4 - 5 folds robust)
- Smoothing
- Adding random noise
- Sorting & Calculating expanding mean
Notes:
- least amount of leakage
- no hyper parameters
- irregular encoding quality
- built-in in CatBoost
Generalization & Extensions
- Using target variable in different tasks, regression multiclass
- Domains with many-to-many relations
- Time Series
- Encoding interactions and numerical features
Pros:
- Compact transformation of categorical variables
- Powerful basis for feature engineering
Cons:
- Need careful validation, can overfit
- Significant improvement only on specific datasets