PostgreSQL DISTINCT 关键字

PostgreSQL DISTINCT 关键字完全指南

DISTINCT 是去重的“利刃”,用于消除 重复行,支持 单列、多列、表达式、ON 子句、与聚合函数结合 等高级用法。
本文涵盖 语法、性能优化、索引支持、常见陷阱、DISTINCT ON 黑魔法、替代方案 等全部内容。


一、DISTINCT 基本语法

SELECT DISTINCT column1, column2, ...
FROM table_name
[WHERE ...]
[ORDER BY ...];
变体说明
DISTINCT去重 整行
DISTINCT ON (expr)去重 指定表达式,保留第一行(需 ORDER BY

二、核心去重规则

场景结果
DISTINCT col1去重 col1
DISTINCT col1, col2去重 (col1, col2) 组合
DISTINCT *去重整行(等价于 DISTINCT
DISTINCT ON (expr)expr 去重,保留 ORDER BY 的第一行

三、准备测试数据

CREATE TABLE orders (
    id BIGSERIAL PRIMARY KEY,
    customer_id BIGINT,
    product TEXT,
    amount NUMERIC(10,2),
    order_date DATE,
    status TEXT
);

-- 插入含重复的数据
INSERT INTO orders (customer_id, product, amount, order_date, status) VALUES
(101, 'iPhone', 1200, '2025-01-15', 'completed'),
(101, 'iPhone', 1200, '2025-01-16', 'pending'),
(102, 'MacBook', 2500, '2025-01-17', 'completed'),
(101, 'AirPods', 250, '2025-01-18', 'completed'),
(103, 'iPhone', 1200, '2025-01-19', 'completed'),
(102, 'iPhone', 1200, '2025-01-20', 'pending'),
(101, 'iPhone', 1200, '2025-01-21', 'completed');

四、基础 DISTINCT 用法

需求SQL
去重客户 IDSELECT DISTINCT customer_id FROM orders;
去重产品名SELECT DISTINCT product FROM orders;
去重 (客户, 产品) 组合SELECT DISTINCT customer_id, product FROM orders;
-- 去重 (客户, 产品) 组合
SELECT DISTINCT customer_id, product
FROM orders
ORDER BY customer_id, product;

输出

 customer_id | product
-------------+---------
         101 | AirPods
         101 | iPhone
         102 | MacBook
         102 | iPhone
         103 | iPhone

五、DISTINCT ON:PostgreSQL 独有黑魔法

保留每组的“第一行”(按 ORDER BY 排序)

SELECT DISTINCT ON (customer_id) 
       customer_id, product, amount, order_date
FROM orders
ORDER BY customer_id, order_date DESC;

输出(每个客户保留最新订单):

 customer_id | product  | amount | order_date
-------------+----------+--------+------------
         101 | iPhone   |   1200 | 2025-01-21
         102 | iPhone   |   1200 | 2025-01-20
         103 | iPhone   |   1200 | 2025-01-19

必备条件

  • DISTINCT ON 的表达式 必须出现在 ORDER BY 的最前面
  • 否则报错!

六、DISTINCT 与聚合函数

-- 每个客户有多少独特产品
SELECT customer_id, COUNT(DISTINCT product) AS unique_products
FROM orders
GROUP BY customer_id;
-- 总销售额(去重订单?不,通常不用)
-- 错误用法:DISTINCT 不适用于金额去重
SELECT SUM(DISTINCT amount) FROM orders;  -- 错误逻辑!

七、表达式去重

-- 去重邮箱域名
SELECT DISTINCT substring(email FROM '@(.+)$') AS domain
FROM users;

-- 去重年份
SELECT DISTINCT EXTRACT(YEAR FROM order_date) AS year
FROM orders;

八、性能优化:索引是关键

问题:DISTINCT 需要 排序 + 去重 → 慢!

解决方案:为去重列建索引

-- 单列去重
CREATE INDEX idx_orders_customer ON orders(customer_id);

-- 多列去重
CREATE INDEX idx_orders_cust_prod ON orders(customer_id, product);

-- DISTINCT ON 优化
CREATE INDEX idx_orders_cust_date ON orders(customer_id, order_date DESC);

黄金法则:索引顺序 = DISTINCT / ORDER BY 顺序


九、DISTINCT vs GROUP BY

方式性能适用场景
DISTINCT稍慢简单去重
GROUP BY更快去重 + 聚合
-- 两者等价,但 GROUP BY 更快
SELECT DISTINCT customer_id FROM orders;
-- 等价于
SELECT customer_id FROM orders GROUP BY customer_id;

推荐:纯去重用 GROUP BY,性能更好!


十、常见陷阱与解决方案

陷阱说明解决方案
DISTINCT * 性能差全行比较用具体列
DISTINCT ON 忘记 ORDER BY结果随机必须加 ORDER BY
DISTINCTLIMIT 无序分页跳跃ORDER BY
NULL 被视为相同多个 NULL → 1 行正常行为
DISTINCT 列无索引全表排序建索引
-- 错误:结果随机
SELECT DISTINCT ON (customer_id) * FROM orders;

-- 正确
SELECT DISTINCT ON (customer_id) *
FROM orders
ORDER BY customer_id, order_date DESC;

十一、最佳实践 Checklist

项目建议
纯去重用 GROUP BY性能更优
DISTINCT ON 必加 ORDER BY控制保留行
建去重索引加速排序
避免 DISTINCT *指定列
表达式去重 → 表达式索引性能
EXPLAIN 验证计划确保走索引

十二、一键客户行为分析

-- 每个客户最新订单 + 独特产品数
WITH latest_orders AS (
    SELECT DISTINCT ON (customer_id)
           customer_id, product, amount, order_date, status
    FROM orders
    ORDER BY customer_id, order_date DESC
),
product_counts AS (
    SELECT customer_id, COUNT(DISTINCT product) AS unique_products
    FROM orders
    GROUP BY customer_id
)
SELECT 
    lo.customer_id,
    lo.product AS latest_product,
    lo.amount,
    lo.order_date,
    pc.unique_products
FROM latest_orders lo
JOIN product_counts pc ON lo.customer_id = pc.customer_id;

十三、快速实战:5 分钟掌握 DISTINCT

-- 1. 去重产品名
SELECT DISTINCT product FROM orders;

-- 2. 每个客户最新订单
SELECT DISTINCT ON (customer_id) 
       customer_id, product, order_date
FROM orders
ORDER BY customer_id, order_date DESC;

-- 3. 去重 + 计数
SELECT customer_id, COUNT(DISTINCT product)
FROM orders
GROUP BY customer_id;

-- 4. 建索引优化
CREATE INDEX idx_orders_cust_date ON orders(customer_id, order_date DESC);
EXPLAIN ANALYZE 
SELECT DISTINCT ON (customer_id) * FROM orders 
ORDER BY customer_id, order_date DESC;

现在就动手
orders 表中:

  1. 查询 所有独特 (customer_id, product) 组合
  2. 查询 每个客户最早的订单(用 DISTINCT ON
  3. 统计 每个 status 有多少独特 customer_id
  4. 建索引,验证 EXPLAIN 是否走 Index Scan

十四、性能对比(EXPLAIN)

-- 慢:无索引
EXPLAIN ANALYZE SELECT DISTINCT customer_id FROM orders;

-- 快:有索引
CREATE INDEX idx_orders_customer ON orders(customer_id);
EXPLAIN ANALYZE SELECT DISTINCT customer_id FROM orders;

需要我生成:

  • 10万条订单数据 + 去重压力测试?回复 测试数据
  • DISTINCT ON 替代 ROW_NUMBER() 性能对比?回复 性能对比
  • Python 动态 DISTINCT 查询?回复 Python DISTINCT
  • EXPLAIN 可视化分析?回复 EXPLAIN 分析

随时告诉我!

类似文章

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注