Kick-start Tree Ensembles without the Titanic Clichés

A hands-on guide to tuning Gradient-Boosted Trees, preventing overfitting, and demystifying ensembles.

Published: Jul 15, 2025

Building predictive models on the famous Kaggle Titanic dataset is a rite of passage for many ML beginners. Most tutorials use Random Forests or even logistic regression as “just works” solutions. Gradient-Boosted Trees (GBTs) often get a bad rap in this toy example: out of the box they can easily overfit the small Titanic data. However, you might miss out on better accuracy if you avoid boosting. You can use GBTs — but only with extra care. In this article I’ll show how to cross-validate safely, use automated tuning (Optuna’s TPE sampler), and even try simple ensembles. Along the way I’ll clarify key ensemble concepts and evaluation metrics that many beginners overlook. Let’s dive in step-by-step.

Want to code along?

I’ve published the full Jupyter notebook — data prep, Optuna tuning, SHAP plots, and Kaggle-ready submission — right here:

Complete Titanic GBT Notebook

Feel free to fork it as you read.

Random Forest vs. Gradient Boosted Trees on Small Data

Tree ensembles are powerful, but not all are created equal. Random Forests use bagging. They train many deep trees on random subsets of data and average their votes to reduce variance. In contrast, Gradient Boosted Trees build trees sequentially. Each one correcting the errors of the previous ones. Because boosting adds trees greedily, a GBT model can fit noise in a small dataset. In fact, the scikit-learn docs note that bagging methods reduce overfitting. They work best with strong base models. Boosting methods work best with weak learners (shallow trees).

On the Titanic (891 training rows), this means a default Random Forest often “just works” out of the box, while a default GBT might overfit. More training data can itself act as a form of regularization, so with more data boosting tends to shine. But on small data you must add regularization (like shallow trees, subsampling, shrinkage) manually. In practice, you can use GBT on Titanic if you carefully tune its hyperparameters and maybe use techniques like early stopping.

For more tips on tree ensembles, model tuning, and Kaggle techniques, follow me on X @CoffeyFirst.

Avoiding Data Leakage with Cross-Validation

To truly guard against overfitting, use cross-validation with a proper pipeline. Always fit preprocessing steps on the training fold only, then apply to the validation fold. Never use information from the held-out set when you train. The scikit-learn docs warn that you should “never call fit on the test data” and that pipelines are ideal for cross-validation. In code, you’d wrap your preprocessing and model in a Pipeline, then run cross_val_score or cross_val_predict. For example:

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import cross_val_score

pipe = Pipeline([
    ('scale', StandardScaler()),
    ('gbt', GradientBoostingClassifier())
])
scores = cross_val_score(pipe, X, y, cv=5)
print("CV accuracy:", scores.mean())

Each fold will automatically fit StandardScaler() and the GBT on only the training portion. Then it will transform and score on the validation portion. This prevents leakage of information (like the global mean for scaling or imputation) from contaminating your validation set. Using pipelines in this way ensures you get a realistic estimate of performance. It also prevents overly optimistic results.

For more tips on tree ensembles, model tuning, and Kaggle techniques, follow me on X @CoffeyFirst.

Automated Hyperparameter Tuning with Optuna

Gradient-boosted models have many knobs (number of trees, learning rate/shrinkage, max depth, subsample ratio, regularization weights, etc.). Tuning them by hand or grid search can be very slow. Instead, you can use an automated search. Optuna is a library that implements efficient hyperparameter tuning using a Bayesian approach (the Tree-structured Parzen Estimator, or TPE). Optuna’s TPE sampler builds a probabilistic model of the objective scores and focuses on promising regions of the search space. This is often much faster than exhaustive grid search or blind random search. The default TPESampler in Optuna will try to maximize your validation metric (e.g. accuracy) by picking smart parameter values.

For example, you might write something like:

import optuna
from optuna.samplers import TPESampler

def objective(trial):
    n_trees = trial.suggest_int("n_trees", 100, 1000)
    max_depth = trial.suggest_int("max_depth", 3, 10)
    lr = trial.suggest_float("learning_rate", 0.01, 0.2, log=True)
    model = GradientBoostingClassifier(n_estimators=n_trees, max_depth=max_depth, learning_rate=lr)
    score = cross_val_score(model, X, y, cv=3).mean()
    return score

study = optuna.create_study(direction="maximize", sampler=TPESampler(seed=42))
study.optimize(objective, n_trials=50)
print("Best params:", study.best_params)

Optuna will stop when it finds a plateau (you can also use its pruning callbacks or custom early-stop callbacks). The key point is that automated tuning with a Bayesian sampler finds good hyperparameters more efficiently than naive methods. In my experiments, this helped GBTs converge to a strong model without endless guessing.

For more tips on tree ensembles, model tuning, and Kaggle techniques, follow me on X at @CoffeyFirst.

Building Ensembles: Bagging, Boosting, Voting, and Stacking

All the tree methods I use (Random Forest, YDF’s GBT, XGBoost) are ensembles of decision trees — but they ensemble in different ways. To recap the terminology:

Bagging (Random Forest): Train many independent trees on random subsets (samples and/or features) and average their predictions. Bagging mainly reduces variance, which is why it often guards against overfitting.
Boosting (GBT / XGBoost): Build trees sequentially, where each new tree tries to fix errors from the previous ones. Boosting can reduce bias by fitting residuals, but without care it can overfit small data.
Voting: Combine different models by majority vote or by averaging probabilities (soft voting). Voting requires you have multiple pre-trained models; it doesn’t train new trees, just blends predictions.
Stacking: Train multiple base models, then train a meta-model on their outputs. The idea is that a meta-learner (like a logistic regression) can learn how to weight each base model’s vote. In practice, stacking often ends up performing about as well as the single best base model.

I tried both soft voting and stacking with my XGBoost and YDF-GBT classifiers. For example:

from sklearn.ensemble import StackingClassifier, VotingClassifier
from sklearn.linear_model import LogisticRegression

stack = StackingClassifier(
    estimators=[('rf', RandomForestClassifier()), ('gbt', GradientBoostingClassifier())],
    final_estimator=LogisticRegression()
)
stack.fit(X_train, y_train)

voting = VotingClassifier(
    estimators=[('rf', RandomForestClassifier()), ('gbt', GradientBoostingClassifier())],
    voting='soft'
)
voting.fit(X_train, y_train)

In my Titanic runs, stacking/voting did not significantly beat the best GBT model. This aligns with theory: stacking “is as good as the best predictor of the base layer” and can sometimes combine strengths. On this small dataset, the YDF GBT alone was already very strong, so the ensemble couldn’t improve much. But it’s worth understanding these options. Many Kaggle winners do use complex stacking (often with many different algorithms) to eke out small gains.

For more tips on tree ensembles, model tuning, and Kaggle techniques, follow me on X @CoffeyFirst.

Beyond Accuracy: Choosing the Right Metric

In the Kaggle Titanic competition, accuracy is the official score (you want to maximize percentage correct). But accuracy is not always the best measure, especially if classes are imbalanced or if different errors have different costs. About 38% of Titanic passengers survived, so the classes aren’t extremely skewed, but still consider other metrics for insight.

Two common AUC metrics are ROC-AUC and Precision-Recall AUC. ROC-AUC (area under the ROC curve) plots true positive rate vs false positive rate at all thresholds. Precision-Recall (PR) AUC plots precision vs recall. Importantly, ROC-AUC can be overly optimistic on imbalanced data. That is because PR-AUC focuses on performance for the positive class. In fact, precision-recall curves are recommended for highly skewed domains. This avoids an “excessively optimistic view” with ROC curves.

You should also inspect the confusion matrix of predictions. A confusion matrix tabulates true positives (TP), false negatives (FN), false positives (FP), and true negatives (TN). From these we define:

Precision = TP / (TP + FP): fraction of predicted positives that are correct.
Recall (Sensitivity) = TP / (TP + FN): fraction of actual positives correctly found.
Specificity = TN / (TN + FP): fraction of actual negatives correctly found.
F1 score = 2⋅(precision⋅recall)/(precision+recall), the harmonic mean of precision and recall.

Depending on the problem, you might care more about recall (catching all survivors) or specificity. F1 score balances the two. Crucially, F1 is often preferable to raw accuracy on imbalanced data. In fact, as Google’s ML Crash Course notes, F1 (and related metrics) are generally better when classes are skewed. I used accuracy during tuning only because the competition metric was accuracy. But in a real application you’d look at ROC-AUC for a balanced class problem, PR-AUC for skewed cases, and precision/recall trade-offs via the confusion matrix.

For more tips on tree ensembles, model tuning, and Kaggle techniques, follow me on X @CoffeyFirst.

Key Takeaways

Use GBT on Titanic if you tune it carefully. Gradient boosting can outperform random forests, but on small data it overfits easily. More data naturally regularizes boosting, but on Titanic I had to rely on parameter tuning and early stopping.
Cross-validate with pipelines to avoid leakage. Always fit transformations on the training fold and then apply them to the validation fold. For example, wrap scalers, encoders, etc., in a Pipeline and then call cross_val_score. This keeps test information separate.
Automate hyperparameter search with TPE/Optuna. The Tree-structured Parzen Estimator (TPE) in Optuna is a Bayesian search method. It typically finds good parameters faster than brute-force grid or random search. You can also use pruning callbacks to stop early when improvements stall.
Ensemble concepts matter. Random Forest (bagging) reduces variance with deep trees. GBT (boosting) sequentially fits residuals. Voting and stacking are meta-ensembles. Stacking can sometimes boost performance, but often only matches the best base model. In this case, neither stacking nor voting surpassed the tuned GBT.
Accuracy isn’t everything. For balanced classes, ROC-AUC is a good general metric. For skewed classes, Precision-Recall AUC is more informative. Check the confusion matrix to compute precision and recall. The F1 score (harmonic mean of precision and recall) is often better than accuracy on imbalanced data. Align your metric with the business goal (sensitivity vs specificity).

For more tips on tree ensembles, model tuning, and Kaggle techniques, follow me on X @CoffeyFirst.

‍