PostgreSQL DISTINCT 关键字
PostgreSQL DISTINCT 关键字完全指南
DISTINCT是去重的“利刃”,用于消除 重复行,支持 单列、多列、表达式、ON 子句、与聚合函数结合 等高级用法。
本文涵盖 语法、性能优化、索引支持、常见陷阱、DISTINCT ON黑魔法、替代方案 等全部内容。
一、DISTINCT 基本语法
SELECT DISTINCT column1, column2, ...
FROM table_name
[WHERE ...]
[ORDER BY ...];
| 变体 | 说明 |
|---|---|
DISTINCT | 去重 整行 |
DISTINCT ON (expr) | 去重 指定表达式,保留第一行(需 ORDER BY) |
二、核心去重规则
| 场景 | 结果 |
|---|---|
DISTINCT col1 | 去重 col1 列 |
DISTINCT col1, col2 | 去重 (col1, col2) 组合 |
DISTINCT * | 去重整行(等价于 DISTINCT) |
DISTINCT ON (expr) | 按 expr 去重,保留 ORDER BY 的第一行 |
三、准备测试数据
CREATE TABLE orders (
id BIGSERIAL PRIMARY KEY,
customer_id BIGINT,
product TEXT,
amount NUMERIC(10,2),
order_date DATE,
status TEXT
);
-- 插入含重复的数据
INSERT INTO orders (customer_id, product, amount, order_date, status) VALUES
(101, 'iPhone', 1200, '2025-01-15', 'completed'),
(101, 'iPhone', 1200, '2025-01-16', 'pending'),
(102, 'MacBook', 2500, '2025-01-17', 'completed'),
(101, 'AirPods', 250, '2025-01-18', 'completed'),
(103, 'iPhone', 1200, '2025-01-19', 'completed'),
(102, 'iPhone', 1200, '2025-01-20', 'pending'),
(101, 'iPhone', 1200, '2025-01-21', 'completed');
四、基础 DISTINCT 用法
| 需求 | SQL |
|---|---|
| 去重客户 ID | SELECT DISTINCT customer_id FROM orders; |
| 去重产品名 | SELECT DISTINCT product FROM orders; |
| 去重 (客户, 产品) 组合 | SELECT DISTINCT customer_id, product FROM orders; |
-- 去重 (客户, 产品) 组合
SELECT DISTINCT customer_id, product
FROM orders
ORDER BY customer_id, product;
输出:
customer_id | product
-------------+---------
101 | AirPods
101 | iPhone
102 | MacBook
102 | iPhone
103 | iPhone
五、DISTINCT ON:PostgreSQL 独有黑魔法
保留每组的“第一行”(按
ORDER BY排序)
SELECT DISTINCT ON (customer_id)
customer_id, product, amount, order_date
FROM orders
ORDER BY customer_id, order_date DESC;
输出(每个客户保留最新订单):
customer_id | product | amount | order_date
-------------+----------+--------+------------
101 | iPhone | 1200 | 2025-01-21
102 | iPhone | 1200 | 2025-01-20
103 | iPhone | 1200 | 2025-01-19
必备条件:
DISTINCT ON的表达式 必须出现在ORDER BY的最前面- 否则报错!
六、DISTINCT 与聚合函数
-- 每个客户有多少独特产品
SELECT customer_id, COUNT(DISTINCT product) AS unique_products
FROM orders
GROUP BY customer_id;
-- 总销售额(去重订单?不,通常不用)
-- 错误用法:DISTINCT 不适用于金额去重
SELECT SUM(DISTINCT amount) FROM orders; -- 错误逻辑!
七、表达式去重
-- 去重邮箱域名
SELECT DISTINCT substring(email FROM '@(.+)$') AS domain
FROM users;
-- 去重年份
SELECT DISTINCT EXTRACT(YEAR FROM order_date) AS year
FROM orders;
八、性能优化:索引是关键
问题:DISTINCT 需要 排序 + 去重 → 慢!
解决方案:为去重列建索引
-- 单列去重
CREATE INDEX idx_orders_customer ON orders(customer_id);
-- 多列去重
CREATE INDEX idx_orders_cust_prod ON orders(customer_id, product);
-- DISTINCT ON 优化
CREATE INDEX idx_orders_cust_date ON orders(customer_id, order_date DESC);
黄金法则:索引顺序 =
DISTINCT/ORDER BY顺序
九、DISTINCT vs GROUP BY
| 方式 | 性能 | 适用场景 |
|---|---|---|
DISTINCT | 稍慢 | 简单去重 |
GROUP BY | 更快 | 去重 + 聚合 |
-- 两者等价,但 GROUP BY 更快
SELECT DISTINCT customer_id FROM orders;
-- 等价于
SELECT customer_id FROM orders GROUP BY customer_id;
推荐:纯去重用
GROUP BY,性能更好!
十、常见陷阱与解决方案
| 陷阱 | 说明 | 解决方案 |
|---|---|---|
DISTINCT * 性能差 | 全行比较 | 用具体列 |
DISTINCT ON 忘记 ORDER BY | 结果随机 | 必须加 ORDER BY |
DISTINCT 与 LIMIT 无序 | 分页跳跃 | 加 ORDER BY |
NULL 被视为相同 | 多个 NULL → 1 行 | 正常行为 |
DISTINCT 列无索引 | 全表排序 | 建索引 |
-- 错误:结果随机
SELECT DISTINCT ON (customer_id) * FROM orders;
-- 正确
SELECT DISTINCT ON (customer_id) *
FROM orders
ORDER BY customer_id, order_date DESC;
十一、最佳实践 Checklist
| 项目 | 建议 |
|---|---|
纯去重用 GROUP BY | 性能更优 |
DISTINCT ON 必加 ORDER BY | 控制保留行 |
| 建去重索引 | 加速排序 |
避免 DISTINCT * | 指定列 |
| 表达式去重 → 表达式索引 | 性能 |
EXPLAIN 验证计划 | 确保走索引 |
十二、一键客户行为分析
-- 每个客户最新订单 + 独特产品数
WITH latest_orders AS (
SELECT DISTINCT ON (customer_id)
customer_id, product, amount, order_date, status
FROM orders
ORDER BY customer_id, order_date DESC
),
product_counts AS (
SELECT customer_id, COUNT(DISTINCT product) AS unique_products
FROM orders
GROUP BY customer_id
)
SELECT
lo.customer_id,
lo.product AS latest_product,
lo.amount,
lo.order_date,
pc.unique_products
FROM latest_orders lo
JOIN product_counts pc ON lo.customer_id = pc.customer_id;
十三、快速实战:5 分钟掌握 DISTINCT
-- 1. 去重产品名
SELECT DISTINCT product FROM orders;
-- 2. 每个客户最新订单
SELECT DISTINCT ON (customer_id)
customer_id, product, order_date
FROM orders
ORDER BY customer_id, order_date DESC;
-- 3. 去重 + 计数
SELECT customer_id, COUNT(DISTINCT product)
FROM orders
GROUP BY customer_id;
-- 4. 建索引优化
CREATE INDEX idx_orders_cust_date ON orders(customer_id, order_date DESC);
EXPLAIN ANALYZE
SELECT DISTINCT ON (customer_id) * FROM orders
ORDER BY customer_id, order_date DESC;
现在就动手:
在 orders 表中:
- 查询 所有独特 (customer_id, product) 组合
- 查询 每个客户最早的订单(用
DISTINCT ON) - 统计 每个 status 有多少独特 customer_id
- 建索引,验证
EXPLAIN是否走 Index Scan
十四、性能对比(EXPLAIN)
-- 慢:无索引
EXPLAIN ANALYZE SELECT DISTINCT customer_id FROM orders;
-- 快:有索引
CREATE INDEX idx_orders_customer ON orders(customer_id);
EXPLAIN ANALYZE SELECT DISTINCT customer_id FROM orders;
需要我生成:
- 10万条订单数据 + 去重压力测试?回复
测试数据 DISTINCT ON替代ROW_NUMBER()性能对比?回复性能对比- Python 动态 DISTINCT 查询?回复
Python DISTINCT - EXPLAIN 可视化分析?回复
EXPLAIN 分析
随时告诉我!