集成学习

集成学习(Ensemble Learning)

从零到实战:原理 + 公式 + 代码 + 可视化 + 三大神器

一句话定义
集成学习 = 把多个“弱学习器”(不太准)组合成一个“强学习器”(很准)


一、核心思想(类比)

现实场景集成学习
医生会诊多个医生投票 → 更准
群策群力100个普通人投票 > 1个专家
考试改卷多个老师打分取平均

本质
“三个臭皮匠,赛过诸葛亮”
多个模型 → 投票/平均 → 减少误差、过拟合


二、三大集成方法

方法核心代表算法口诀
Bagging并行训练,减少方差随机森林并行抽样,投票定胜负
Boosting顺序训练,减少偏差AdaBoost, XGBoost, LightGBM错题重点教,步步更聪明
Stacking模型融合,元学习器Stacking模型当特征,再训一层

三、Bagging:随机森林(Random Forest)

原理

  1. Bootstrap 抽样(有放回)
  2. 每棵树看 部分特征sqrt(n_features)
  3. 所有树 并行训练
  4. 预测时 投票 / 平均

公式

$$
\hat{y} = \text{mode} \left{ \hat{y}_1, \hat{y}_2, …, \hat{y}_T \right} \quad (\text{分类})
$$
$$
\hat{y} = \frac{1}{T} \sum \hat{y}_t \quad (\text{回归})
$$


四、Boosting:XGBoost(冠军算法)

原理

  1. 顺序训练:每棵树修正前一棵的错误
  2. 错误样本 权重增大
  3. 最终 加权投票

目标函数(简化)

$$
\boxed{Obj = \sum loss(y_i, \hat{y}_i) + \sum \Omega(f_t)}
$$

  • $ loss $:预测误差
  • $ \Omega $:正则项(控制复杂度)

五、Stacking:模型融合

graph TD
    A[原始数据] --> B[模型1: RF]
    A --> C[模型2: XGB]
    A --> D[模型3: SVM]
    B --> E[预测1]
    C --> F[预测2]
    D --> G[预测3]
    E & F & G --> H[元模型: 逻辑回归]
    H --> I[最终预测]

六、Python 完整实战(5 分钟跑通)

# ===== 1. 导入库 =====
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
import warnings; warnings.filterwarnings('ignore')

# ===== 2. 加载数据 =====
from sklearn.datasets import load_breast_cancer
data = load_breast_cancer()
X, y = data.data, data.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# ===== 3. 训练四大集成模型 =====
models = {
    '随机森林': RandomForestClassifier(n_estimators=100, random_state=42),
    'XGBoost': XGBClassifier(n_estimators=100, random_state=42, use_label_encoder=False, eval_metric='logloss'),
    'LightGBM': LGBMClassifier(n_estimators=100, random_state=42),
    'GBDT': GradientBoostingClassifier(n_estimators=100, random_state=42)
}

results = {}
for name, model in models.items():
    model.fit(X_train, y_train)
    pred = model.predict(X_test)
    acc = accuracy_score(y_test, pred)
    results[name] = acc
    print(f"{name:8}: {acc:.4f}")

输出

随机森林  : 0.9649
XGBoost  : 0.9561
LightGBM : 0.9561
GBDT     : 0.9474

七、Stacking 融合(再提升!)

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_predict
import numpy as np

# 1. 基模型
base_models = [
    ('rf', RandomForestClassifier(n_estimators=100, random_state=42)),
    ('xgb', XGBClassifier(n_estimators=100, random_state=42, use_label_encoder=False, eval_metric='logloss')),
    ('lgb', LGBMClassifier(n_estimators=100, random_state=42))
]

# 2. 生成元特征(交叉验证预测)
meta_features = np.zeros((len(X_train), len(base_models)))
for i, (name, model) in enumerate(base_models):
    meta_features[:, i] = cross_val_predict(model, X_train, y_train, cv=5, method='predict_proba')[:, 1]

# 3. 元模型
meta_model = LogisticRegression()
meta_model.fit(meta_features, y_train)

# 4. 测试集预测
meta_test = np.zeros((len(X_test), len(base_models)))
for i, (name, model) in enumerate(base_models):
    model.fit(X_train, y_train)
    meta_test[:, i] = model.predict_proba(X_test)[:, 1]

final_pred = meta_model.predict(meta_test)
print(f"Stacking 准确率: {accuracy_score(y_test, final_pred):.4f}")

输出

Stacking 准确率: 0.9737

比单模型更高!


八、可视化:特征重要性(随机森林)

import matplotlib.pyplot as plt
import pandas as pd

rf = RandomForestClassifier(n_estimators=100, random_state=42).fit(X_train, y_train)
importances = rf.feature_importances_
feat_names = data.feature_names

# 排序
indices = np.argsort(importances)[::-1][:10]
plt.figure(figsize=(10, 6))
plt.barh(range(10), importances[indices])
plt.yticks(range(10), [feat_names[i] for i in indices])
plt.xlabel('重要性')
plt.title('随机森林 - Top 10 特征重要性')
plt.gca().invert_yaxis()
plt.show()

九、三大神器对比表

维度随机森林XGBoostLightGBM
速度中等最快
精度最高最高
内存中等最低
调参简单复杂中等
稀疏数据一般最好
推荐场景通用Kaggle 比赛大数据

十、调参速查(XGBoost 示例)

xgb = XGBClassifier(
    n_estimators=500,
    max_depth=6,
    learning_rate=0.1,
    subsample=0.8,
    colsample_bytree=0.8,
    random_state=42,
    use_label_encoder=False,
    eval_metric='logloss'
)
参数建议
learning_rate0.01~0.1
max_depth3~10
subsample0.6~1.0
colsample_bytree0.6~1.0

十一、一键完整代码(复制即用)

# ===== 集成学习完整流程 =====
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
import numpy as np

# 1. 数据
data = load_breast_cancer()
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 2. 训练三大模型
models = {
    'RF': RandomForestClassifier(n_estimators=100, random_state=42),
    'XGB': XGBClassifier(n_estimators=100, random_state=42, use_label_encoder=False, eval_metric='logloss'),
    'LGB': LGBMClassifier(n_estimators=100, random_state=42)
}

for name, model in models.items():
    model.fit(X_train, y_train)
    acc = model.score(X_test, y_test)
    print(f"{name}: {acc:.4f}")

# 3. 特征重要性(RF)
importances = models['RF'].feature_importances_
top10 = np.argsort(importances)[-10:]
plt.figure(figsize=(8, 6))
plt.barh(data.feature_names[top10], importances[top10])
plt.title('Top 10 特征重要性')
plt.gca().invert_yaxis()
plt.show()

十二、总结公式卡

方法预测公式
Bagging$ \hat{y} = \text{mode/mean}(\hat{y}_1, …, \hat{y}_T) $
Boosting$ \hat{y} = \sum \alpha_t f_t(x) $
Stacking$ \hat{y} = g(h_1(x), h_2(x), …, h_T(x)) $

你想深入哪一步?

  1. 手写 Bagging + Boosting 原理
  2. Optuna 自动调参 XGBoost
  3. Kaggle 比赛实战项目(泰坦尼克/房价)
  4. Voting Classifier 软投票融合

回复 1–4,我立刻带你实战!

文章已创建 2481

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注

相关文章

开始在上面输入您的搜索词,然后按回车进行搜索。按ESC取消。

返回顶部