C++ 标准库 regex

下面对 C++11 起引入的 <regex> 头文件及其正则表达式支持做一次系统、深入的梳理，包括正则语法、主要类型、常用函数、标志选项、性能特性与实践建议。

一、概述

<regex> 提供了基于 ECMAScript、POSIX、基本/扩展 POSIX、awk、grep 五种语法的正则表达式引擎。
主要类模板有：
- std::basic_regex<CharT,Traits>（常用别名 std::regex）——编译后的正则模式对象
- std::match_results<Iterator>（别名 std::smatch, std::cmatch）——匹配结果容器
- std::regex_iterator<It> / std::regex_token_iterator<It>——迭代匹配子串或分割结果
核心函数有：
- std::regex_match：整个目标串是否匹配
- std::regex_search：子串匹配
- std::regex_replace：替换符合模式的子串

二、正则表达式语法模式（`std::regex_constants::syntax_option_type`）

语法类型	描述
`ECMAScript`（默认）	类似 JavaScript 的正则语法
`basic` / `extended` POSIX	传统 POSIX 风格，转义规则不同
`awk`	类 awk 工具的正则语法
`grep` / `egrep`	GNU grep / egrep 语法

常见元字符（ECMAScript）：

元字符	含义
`.`	匹配除换行外的任意单字符
`^` / `$`	分别匹配目标串的起始 / 结束
`[...]`	字符集，如 `[a-z0-9_]`
`[^...]`	排除字符集
`\d`/`\w`/`\s`	数字 / 单词字符 / 空白符
`*`/`+`/`?`	前一项的重复：0+、1+、0-or-1
`{n}`, `{n,}`, `{n,m}`	精确 / 至少 / 范围重复
`()`	捕获分组
`(?:...)`	非捕获分组
`	`
`(?=...)` / `(?!...)`	前瞻 / 负前瞻
`(?<=...)` / `(?<!...)`	后顾 / 负后顾

三、主要类型

1. `std::regex`

std::regex::regex(const std::string& pattern,
                  std::regex_constants::syntax_option_type flags = std::regex_constants::ECMAScript);

编译模式：构造时会解析并编译正则，若语法错误抛 std::regex_error。
标志选项（可位或组合）：
- icase：忽略大小写
- nosubs：不保留子表达式匹配结果（加速内存）
- optimize：尽量优化内部实现（花更多预处理时间）
- collate, ECMAScript, basic, extended, awk, grep, egrep

2. `std::match_results<It>`

存储一次匹配的所有子表达式结果。
关键成员：
- size()：捕获组数（含组0，即整个匹配）
- operator[](i)：第 i 号捕获结果，类型 sub_match<It>
- prefix() / suffix()：目标串中匹配前后剩余部分

常用别名：

using smatch = std::match_results<std::string::const_iterator>;
using cmatch = std::match_results<const char*>;

3. `std::regex_iterator<It>` 与 `std::regex_token_iterator<It>`

regex_iterator：在文本中迭代每一次完整的模式匹配（含子组）。
regex_token_iterator：
- 用于 提取子组 或 分割文本。
- 构造时可指定要提取的子组索引列表，或者用 -1 表示取分割产生的非匹配区段。

四、常用函数

1. `std::regex_match`

bool regex_match(BidirIt first, BidirIt last,
                 std::match_results<BidirIt>& m,
                 const std::regex& re,
                 regex_constants::match_flag_type flags = match_default);

用途：整个区间 [first,last) 是否完整匹配模式。
若匹配成功且提供 m，则可通过 m[i] 访问第 i 个子组。

std::smatch m;
if (std::regex_match(s, m, std::regex(R"(\d{4}-\d{2}-\d{2})"))) {
    // m[0] 整体，m[1..] 捕获组
}

2. `std::regex_search`

bool regex_search(BidirIt first, BidirIt last,
                  std::match_results<BidirIt>& m,
                  const std::regex& re,
                  match_flag_type flags = match_default);

用途：在区间内查找首次出现的符合模式的子串。
可在循环中配合 m.suffix().first 继续搜索下一个。

std::regex word_re(R"(\w+)");
auto it = s.cbegin(), end = s.cend();
std::smatch m;
while (std::regex_search(it, end, m, word_re)) {
    std::cout << m.str() << "\n";
    it = m.suffix().first;
}

3. `std::regex_replace`

std::string regex_replace(const std::string& s,
                          const std::regex& re,
                          const std::string& fmt,
                          regex_constants::match_flag_type flags = match_default);

用途：返回将所有（或部分）匹配子串替换为格式串后的新字符串。
格式串中可用 $&、$1…$& 表示整个匹配，$n 表示第 n 个捕获组。

std::string s = "2025-07-12";
std::string out = std::regex_replace(s,
    std::regex(R"((\d{4})-(\d{2})-(\d{2}))"),
    "$3/$2/$1");
// out == "12/07/2025"

五、标志与匹配选项（`match_flag_type`）

标志	含义
`match_default`	默认
`match_not_bol`	不把当前位置视为行首
`match_not_eol`	不把当前位置视为行尾
`match_not_bow`	忽略单词边界（`\b`）
`match_not_eow`	忽略单词边界（`\b`）
`match_any`	搜索任意子串时允许空串匹配
`format_default`	默认格式
`format_no_copy`	不拷贝未匹配文字
`format_first_only`	替换时只处理第一次匹配

六、性能与实现

编译成本：std::regex 构造（编译）开销较高，推荐将经常使用的模式静态存储或复用；
匹配成本：单次 regex_search/match 属于 NFA 算法，平均线性，最坏情况存在回溯爆炸风险（尤其含大量回溯的复杂模式）；
优化：
- 使用 nosubs 可减少子组存储开销；
- optimize 提示实现做尽量优化；
- 简单模式（无分支、无回溯陷阱）性能最好；
替代：对性能敏感场景，可考虑第三方库（如 RE2、Boost.Regex、Hyperscan）或手写有限状态机。

七、实践建议

模式复用
- 将 std::regex 对象声明为 static const 或全局/成员复用，避免频繁重编译。
限制回溯
- 避免使用 .* 与后续强制匹配结合的“邪恶”模式；可使用非贪婪量词（.*?）或显式字符集限定。
合理拆分
- 复杂解析可先做简单分割（regex_token_iterator）再对子串细化匹配，减少一次性匹配压力。
捕获 vs 非捕获
- 如果不需要提取子组，用 (?:…) 或传入 nosubs 节省内存与时间。
边界与多行
- 使用 ^/$ 匹配行首行尾时，注意输入须启用多行模式（match_not_bol/eol 标志）或自行按行分割。
替换效率
- regex_replace 每次返回新字符串；若在大文本上多次替换，可改用 std::regex_iterator + ostringstream 手动拼接，或先分块处理。

通过对 <regex> 中正则语法、主要类型、核心函数与性能特性的全面梳理，结合实战建议，能够帮助你在文本解析、日志处理、格式验证等场景中高效、稳健地使用标准库正则功能。祝编码顺利！

一、概述

二、正则表达式语法模式（`std::regex_constants::syntax_option_type`）

三、主要类型

1. `std::regex`

2. `std::match_results<It>`

3. `std::regex_iterator<It>` 与 `std::regex_token_iterator<It>`

四、常用函数

1. `std::regex_match`

2. `std::regex_search`

3. `std::regex_replace`

五、标志与匹配选项（`match_flag_type`）

六、性能与实现

七、实践建议

likuolei

发表回复取消回复

归档

分类

2025 年 12 月
一	二	三	四	五	六	日
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31

一、概述

二、正则表达式语法模式（std::regex_constants::syntax_option_type）

三、主要类型

1. std::regex

2. std::match_results<It>

3. std::regex_iterator<It> 与 std::regex_token_iterator<It>

四、常用函数

1. std::regex_match

2. std::regex_search

3. std::regex_replace

五、标志与匹配选项（match_flag_type）

六、性能与实现

七、实践建议

likuolei

发表回复 取消回复

相关文章

二、正则表达式语法模式（`std::regex_constants::syntax_option_type`）

1. `std::regex`

2. `std::match_results<It>`

3. `std::regex_iterator<It>` 与 `std::regex_token_iterator<It>`

1. `std::regex_match`

2. `std::regex_search`

3. `std::regex_replace`

五、标志与匹配选项（`match_flag_type`）

发表回复取消回复