5 分钟上手 + 30 分钟实战 + 代码即复制即运行
一、环境准备(3 分钟)
# 1. 安装 Python(推荐 3.9+)
# 2. 安装核心库(推荐用 conda 或 pip)
pip install numpy pandas scikit-learn matplotlib jupyter
推荐工具:
- Jupyter Notebook(写代码 + 看图)
- VS Code / PyCharm
- Google Colab(0 配置,免费 GPU)
二、机器学习第一步:认识数据
# 导入库
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
# 加载经典鸢尾花数据集(150 条,3 种花)
iris = load_iris()
X = iris.data # 特征:花萼长度、宽度,花瓣长度、宽度
y = iris.target # 标签:0=Setosa, 1=Versicolor, 2=Virginica
# 转成 DataFrame 看一眼
df = pd.DataFrame(X, columns=iris.feature_names)
df['species'] = y
print(df.head())
输出:
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) species
0 5.1 3.5 1.4 0.2 0
1 4.9 3.0 1.4 0.2 0
...
三、完整流程:从数据到预测(6 步)
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report
# 1. 划分训练集 / 测试集(80% 训练,20% 测试)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# 2. 选择模型(决策树)
model = DecisionTreeClassifier(max_depth=3, random_state=42)
# 3. 训练模型
model.fit(X_train, y_train)
# 4. 预测
y_pred = model.predict(X_test)
# 5. 评估
print("准确率:", accuracy_score(y_test, y_pred)) # 通常 0.97+
print(classification_report(y_test, y_pred, target_names=iris.target_names))
输出:
准确率: 0.973
precision recall f1-score support
setosa 1.00 1.00 1.00 10
versicolor 1.00 0.92 0.96 13
virginica 0.88 1.00 0.93 7
四、可视化:看模型怎么分
import numpy as np
import matplotlib.pyplot as plt
from sklearn.inspection import DecisionBoundaryDisplay
# 取两个特征画决策边界
X_vis = X[:, [0, 2]] # 花萼长度 + 花瓣长度
model_vis = DecisionTreeClassifier(max_depth=3).fit(X_vis, y)
# 画图
plt.figure(figsize=(8, 6))
DecisionBoundaryDisplay.from_estimator(
model_vis, X_vis, cmap='Pastel1', response_method="predict"
)
plt.scatter(X_vis[:, 0], X_vis[:, 1], c=y, edgecolor='k', cmap='Set1')
plt.xlabel('Sepal length')
plt.ylabel('Petal length')
plt.title('Decision Tree Decision Boundary')
plt.show()
你会看到 3 个清晰区域,模型学会了“用长度区分花”!
五、进阶:5 行代码训练 6 种模型对比
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
models = {
'逻辑回归': LogisticRegression(),
'KNN': KNeighborsClassifier(),
'SVM': SVC(),
'决策树': DecisionTreeClassifier(),
'随机森林': RandomForestClassifier(n_estimators=100),
'朴素贝叶斯': GaussianNB()
}
for name, model in models.items():
model.fit(X_train, y_train)
acc = model.score(X_test, y_test)
print(f"{name:10}: {acc:.3f}")
输出示例:
逻辑回归 : 1.000
KNN : 1.000
SVM : 1.000
决策树 : 0.973
随机森林 : 0.973
朴素贝叶斯 : 1.000
六、实战项目:预测房价(回归)
from sklearn.datasets import fetch_california_housing
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
# 加载加州房价数据
data = fetch_california_housing()
X, y = data.data, data.target
# 训练线性回归
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
model = LinearRegression()
model.fit(X_train, y_train)
pred = model.predict(X_test)
print("MSE:", mean_squared_error(y_test, pred))
print("预测第一套房价格:", pred[0], "万美元")
七、保存与加载模型(部署第一步)
import joblib
# 保存
joblib.dump(model, 'house_price_model.pkl')
# 加载使用
loaded_model = joblib.load('house_price_model.pkl')
print(loaded_model.predict(X_test[:1]))
八、学习路线图(30 天入门)
| 天数 | 目标 | 任务 |
|---|---|---|
| 1–3 | Python 基础 | 变量、列表、函数、类 |
| 4–7 | NumPy + Pandas | 数据加载、清洗、统计 |
| 8–12 | Scikit-learn | 分类、回归、评估 |
| 13–18 | 可视化 | Matplotlib、Seaborn |
| 19–25 | Kaggle 实战 | Titanic / House Prices |
| 26–30 | 部署 | Flask / Streamlit 小网页 |
九、推荐资源(免费)
| 类型 | 资源 |
|---|---|
| 课程 | Andrew Ng – Machine Learning |
| 书籍 | 《Python 数据科学手册》 |
| 平台 | Kaggle(数据 + 比赛) |
| 工具 | Google Colab(免费 GPU) |
你现在可以做什么?
复制下面完整代码,保存为 ml_start.py,直接运行:
# ===== 机器学习 5 分钟入门完整代码 =====
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
from sklearn.inspection import DecisionBoundaryDisplay
# 1. 数据
iris = load_iris()
X, y = iris.data[:, [0, 2]], iris.target # 只用两个特征方便画图
# 2. 分割
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# 3. 模型
model = DecisionTreeClassifier(max_depth=3)
model.fit(X_train, y_train)
# 4. 预测与评估
pred = model.predict(X_test)
print("准确率:", accuracy_score(y_test, pred))
# 5. 可视化决策边界
plt.figure(figsize=(8,6))
DecisionBoundaryDisplay.from_estimator(model, X, cmap='Pastel1')
plt.scatter(X[:, 0], X[:, 1], c=y, edgecolor='k', cmap='Set1')
plt.xlabel('Sepal length')
plt.ylabel('Petal length')
plt.title('Iris Classification - Decision Tree')
plt.show()
下一步想学什么?
- [ ] 用 神经网络 识别手写数字(MNIST)
- [ ] 做 Kaggle 比赛(泰坦尼克号生存预测)
- [ ] 把模型部署成 网页(Streamlit)
- [ ] 学习 深度学习(PyTorch)
回复数字 1–4,我立刻带你实战!