MongoDB 监控（Monitoring）全面指南

目标：提前发现问题、容量规划、性能调优、故障告警

一、监控核心维度（6 大类）

维度	关键指标	阈值建议
1. 系统资源	CPU、内存、磁盘 IO、网络	>80% 告警
2. MongoDB 进程	`mongod` / `mongos` 状态、连接数	连接 > 90% 最大值
3. 复制（Replica Set）	复制延迟（repl lag）、oplog 窗口	lag > 5s 告警
4. 分片（Sharding）	chunk 分布、balancer、jumbo chunks	分布不均 > 20%
5. 数据库操作	QPS、慢查询、锁、页面错误	慢查询 > 100ms
6. 存储	索引大小、数据压缩、磁盘空间	磁盘 > 85%

二、原生监控命令（实时查看）

命令	用途
`db.serverStatus()`	全量状态（最重要）
`rs.status()`	副本集健康
`sh.status()`	分片集群状态
`db.currentOp()`	当前操作（杀慢查询）
`db.stats()`	数据库统计
`db.collection.stats()`	集合详情

// 一键查看关键指标
db.serverStatus({
  metrics: 1,
  locks: 1,
  tcmalloc: 1,
  wiredTiger: 1,
  repl: 1
})

三、关键指标详解（必看）

1. 连接数

db.serverStatus().connections
// current / available / totalCreated

阈值：current > 80% of maxIncomingConnections
默认最大连接：取决于 ulimit -n，约 64k

2. 复制延迟（Replication Lag）

rs.status().members.forEach(m => {
  if (m.stateStr === "SECONDARY") {
    print(`${m.name}: ${m.optimeDate} (lag: ${(new Date() - m.optimeDate)/1000}s)`)
  }
})

告警：lag > 5 秒（生产），> 60 秒（严重）

3. Oplog 窗口（可回滚时间）

rs.printReplicationInfo()
// 例如：configured oplog size: 5% of disk → 48 hours

建议：至少保留 24 小时 oplog

4. 页面错误（Page Faults）

db.serverStatus().extra_info.page_faults

硬页面错误（读磁盘）→ 工作集 > 内存
解决：加内存、优化索引、减少扫描

5. 慢查询（Slow Queries）

db.getLog('global')  // 查看日志
db.setProfilingLevel(1, { slowms: 100 })  // 记录 >100ms 查询
db.system.profile.find().sort({ ts: -1 }).limit(10)

6. 锁与队列

db.serverStatus().globalLock
// activeClients, currentQueue

currentQueue > 0 → 写阻塞
activeClients.readers/writers 高 → 并发压力

7. 缓存命中率（WiredTiger）

db.serverStatus().wiredTiger.cache
// "bytes currently in the cache" / "maximum bytes configured"

健康值：> 70%
低命中 → 索引/查询不优

8. 分片均衡度

sh.status()
// 查看 chunks per shard
db.collection.getShardDistribution()

告警：最大/最小 shard 差 > 20%

四、监控系统推荐

工具	类型	特点	推荐场景
MongoDB Atlas	云原生	自动告警、可视化、PITR	Atlas 用户
MongoDB Ops Manager	自托管	企业级、自动化备份	自建集群
Percona Monitoring and Management (PMM)	开源	免费、Grafana 集成	推荐
Prometheus + mongodb_exporter	开源	灵活、云原生	K8s 环境
Datadog / New Relic	商业	集成 APM	企业级
Zabbix	开源	通用监控	传统运维

五、Prometheus + Grafana 监控方案（推荐）

1. 部署 `mongodb_exporter`

# docker-compose.yml
version: '3.8'
services:
  mongodb_exporter:
    image: percona/mongodb_exporter:0.40
    command:
      - --mongodb.uri=mongodb://user:pass@mongo:27017
      - --collect-all
    ports: ["9216:9216"]

2. Prometheus 配置

scrape_configs:
  - job_name: 'mongodb'
    static_configs:
      - targets: ['exporter:9216']

3. Grafana 导入仪表盘

ID	名称
`2589`	MongoDB Overview
`1551`	MongoDB Replica Set
`13615`	MongoDB Sharded Cluster

六、告警规则（Prometheus 示例）

groups:
  - name: mongodb.alerts
    rules:
      - alert: MongoDBHighReplicationLag
        expr: mongodb_replset_member_optime_date - mongodb_replset_member_last_applied_optime_date > 5
        for: 1m
        labels: { severity: critical }
        annotations:
          summary: "副本延迟 > 5s"

      - alert: MongoDBHighConnections
        expr: mongodb_connections_current / mongodb_connections_max > 0.8
        for: 2m
        labels: { severity: warning }

      - alert: MongoDBDiskFull
        expr: (node_filesystem_free_bytes{mountpoint="/data/db"} / node_filesystem_size_bytes{mountpoint="/data/db"}) < 0.15
        for: 5m
        labels: { severity: critical }

七、自动化健康检查脚本（每日巡检）

#!/bin/bash
# health_check.sh

echo "=== MongoDB Health Check $(date) ==="

echo -e "\n1. 连接数"
mongosh --quiet --eval "db.serverStatus().connections" | jq

echo -e "\n2. 复制延迟"
mongosh --quiet --eval "rs.status().members.forEach(m=>{if(m.stateStr=='SECONDARY')print(m.name + ': ' + (new Date() - m.optimeDate)/1000 + 's')})"

echo -e "\n3. Oplog 窗口"
mongosh --quiet --eval "rs.printReplicationInfo()"

echo -e "\n4. 磁盘使用"
df -h /data/db

echo -e "\n5. 慢查询（最近10条）"
mongosh --quiet --eval "db.system.profile.find().sort({ts:-1}).limit(10).pretty()"

# cron 每天 6 AM 执行
0 6 * * * /opt/mongo_health_check.sh >> /var/log/mongo_health.log 2>&1

八、生产环境监控最佳实践

实践	说明
所有节点部署 exporter	避免单点
设置基线（Baseline）	正常 QPS、内存使用
慢查询日志 + 采样	`slowms: 100` + `sampleRate: 0.1`
告警收敛	避免告警风暴
定期演练	模拟主节点宕机、磁盘满
结合 APM	追踪慢查询来源

九、慢查询分析流程

graph TD
    A[发现慢查询] --> B{db.currentOp() + system.profile}
    B --> C{是否命中索引?}
    C -->|否| D[创建索引]
    C -->|是| E[检查扫描文档数]
    E --> F{是否可优化查询}
    F -->|是| G[重写查询 / 聚合]
    F -->|否| H[加硬件 / 分片]

十、学习资源

官方文档：Monitoring
Percona PMM：https://pmm.percona.com
Grafana 仪表盘：grafana.com/dashboards
MongoDB 性能调优：Performance Advisor

你想监控什么？

场景	我可以提供
单节点开发环境	简单 `serverStatus` 脚本
生产副本集	PMM + 告警配置
分片集群	均衡度 + Balancer 监控
K8s 部署	Helm Chart + Prometheus Operator
成本优化	工作集分析 + 索引精简

请告诉我：

部署方式（单机/副本集/分片/K8s）
是否已使用监控工具
最关心的指标（延迟？磁盘？QPS？）

我可以为你交付 完整监控方案 + 仪表盘 JSON + 告警规则！

MongoDB 监控

MongoDB 监控（Monitoring）全面指南

一、监控核心维度（6 大类）

二、原生监控命令（实时查看）

三、关键指标详解（必看）

1. 连接数

2. 复制延迟（Replication Lag）

3. Oplog 窗口（可回滚时间）

4. 页面错误（Page Faults）

5. 慢查询（Slow Queries）

6. 锁与队列

7. 缓存命中率（WiredTiger）

8. 分片均衡度

四、监控系统推荐

五、Prometheus + Grafana 监控方案（推荐）

1. 部署 `mongodb_exporter`

2. Prometheus 配置

3. Grafana 导入仪表盘

六、告警规则（Prometheus 示例）

七、自动化健康检查脚本（每日巡检）

八、生产环境监控最佳实践

九、慢查询分析流程

十、学习资源

likuolei

发表回复取消回复

2026 年 6 月
一	二	三	四	五	六	日
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

MongoDB 监控（Monitoring）全面指南

一、监控核心维度（6 大类）

二、原生监控命令（实时查看）

三、关键指标详解（必看）

1. 连接数

2. 复制延迟（Replication Lag）

3. Oplog 窗口（可回滚时间）

4. 页面错误（Page Faults）

5. 慢查询（Slow Queries）

6. 锁与队列

7. 缓存命中率（WiredTiger）

8. 分片均衡度

四、监控系统推荐

五、Prometheus + Grafana 监控方案（推荐）

1. 部署 mongodb_exporter

2. Prometheus 配置

3. Grafana 导入仪表盘

六、告警规则（Prometheus 示例）

七、自动化健康检查脚本（每日巡检）

八、生产环境监控最佳实践

九、慢查询分析流程

十、学习资源

likuolei

发表回复 取消回复

相关文章

1. 部署 `mongodb_exporter`

发表回复取消回复