正则表达式

写在前面

写这篇文章的初衷是解决一些简单的字符串模拟题目，对于特定的某些题目，有时候用C++11的正则表达式会方便很多。

本文主要总结一下常见的一些正则表达式写法，以及如何使用C++11的regex库，最后以几个具体题目举例。

总的来说，正则表达式最简单的应用是判断一个字符串中是否包含特定字符串。正则表达式是一种文本模式，由普通字符和元字符组成。

常用的元字符

. 匹配除 \n 之外的任何单个字符。
^ 匹配输入字符串的开始位置，不匹配任何字符，要匹配 ^ 字符本身，需使用 \^ ；同样的，$ 匹配输入字符串的结束位置。
[xyz] 字符集，匹配其中包含的任一字符。
| 两个匹配条件逻辑或。
\w 匹配字母或数字或下划线；\W 匹配任意不是字母、数字、下划线的字符。
\d 匹配任意一个数字；\D 匹配任意非数字字符。
\s 匹配任意的空白符，包括空格、制表符、换页符等空白字符的其中任意一个，与[ \f\n\r\t\v] 等效；\S 匹配任意不是空白符的字符。
\b 匹配一个单词边界；\B 匹配非单词边界。
* 0次或多次匹配前面的字符或子表达式；+ 1次或多次匹配前面的字符或子表达式；? 0次或1次匹配前面的字符或子表达式。
{n} 正好匹配n次，{n,} 至少匹配n次，{n,m} 匹配n到m次。

举一些例子：

him|her 匹配 him 和 her，也可写作 h(im|er)

\bthe\b 匹配 in the war 中的 the，但不匹配 other 中的 the

分组、捕获、反向引用

( ) 可以将 ( 和 ) 之间的表达式定义为组，并且将匹配这个表达式的字符保存到一个临时区域。匹配后的各组按照左括号出现的顺序分别存到 $1,$2,$3... 中。

比如“2018-04-25”，我们用 (\d{4})-(\d\d)-(\d\d) 去匹配，

那么

$1 = “2018”
$2 = “04”
$3 = “25”

如果想让某个括号里的内容不被捕获到，需要用到非捕获性分组

比如 h(im|er) ，就要改为 h(?:im|er)

然后一个很重要的问题来了，如果想要匹配 “看了看”、“研究研究” 这样的重叠结构怎么办呢？

这时候需要用到反向引用，用 \1,\2,\3... 表示

上面的例子可以用 (..?)了?\1去匹配。

语法就先总结到这里，实际还有很多，估计是用不到了。

C++11 regex

http://www.cplusplus.com/reference/regex/

最常用的两个类

regex Regex (class )

smatch match_results for string objects (class )

以及三个函数

regex_match、regex_search、regex_replace

要注意的一点就是在C++ 中 \ 需要转义，即 \d 在C++中要写成\\d，诸如此类。

下面用具体的代码介绍这三个函数

regex_match 判断一个正则表达式是否能匹配整个字符串

#include <iostream>
#include <string>
#include <regex>
using namespace std;
int main ()
{
    if (regex_match ("subject", regex("(sub)(.*)") ))
        cout << "string literal matched\n";
    string s ("subject");
    regex e ("(sub)(.*)");
    if (regex_match (s,e))
        cout << "string object matched\n";
    if (regex_match ( s.begin(), s.end(), e ) )
        cout << "range matched\n";
    smatch sm; // same as std::match_results<string::const_iterator> sm;
    regex_match (s,sm,e);
    cout << "string object with " << sm.size() << " matches\n";
    regex_match ( s.cbegin(), s.cend(), sm, e);
    cout << "range with " << sm.size() << " matches\n";
    cout << "the matches were: ";
    for (unsigned i=0; i<sm.size(); ++i)
        cout << "[" << sm[i] << "] ";
    cout << endl;
    return 0;
} 
/* 
string literal matched 
string object matched
range matched 
string object with 3 matches
range with 3 matches
the matches were: [subject] [sub] [ject] 
*/

regex_search 来查找第一个能匹配正则表达式的子串

#include <iostream>
#include <string>
#include <regex>
using namespace std;
int main ()
{
    string s ("this subject has a submarine as a subsequence");
    smatch m;
    regex e ("\\b(sub)([^ ]*)");   // matches words beginning by "sub"

    cout << "Target sequence: " << s << std::endl;
    cout << "Regular expression: /\\b(sub)([^ ]*)/" << std::endl;
    cout << "The following matches and submatches were found:" << std::endl;

    while (regex_search (s,m,e))
    {
        for (auto x:m) cout << x << " ";
        cout << endl;
        s = m.suffix().str();
    }

    return 0;
}
/*
Target sequence: this subject has a submarine as a subsequence
Regular expression: /\b(sub)([^ ]*)/
The following matches and submatches were found:
subject sub ject
submarine sub marine
subsequence sub sequence
*/

regex_replace 匹配并替换

// regex_replace example
#include <iostream>
#include <string>
#include <regex>
#include <iterator>
using namespace std;

int main ()
{
    string s ("there is a subsequence in the string\n");
    regex e ("\\b(sub)([^ ]*)");   // matches words beginning by "sub"

    // using string/c-string (3) version:
    cout << regex_replace (s,e,"sub-$2");

    // using range/c-string (6) version:
    string result;
    regex_replace (back_inserter(result), s.begin(), s.end(), e, "$2");
    cout << result;

    // with flags:
    cout << regex_replace (s,e,"$1 and $2",regex_constants::format_no_copy);
    cout << endl;

    return 0;
}
/*
there is a sub-sequence in the string
there is a sequence in the string
sub and sequence
*/

题目

基本的工具基本介绍完了，可以来切一些水题了！

Two strings

from 2017 Multi-University Training Contest

给两个字符串A、B，其中A只包含大小写字母，B只包含大小写字母和两个特殊符号 . 、\。

. 可以匹配任意字母，*表示前一个字符可以出现任意次。保证 * 不会在字符串开头，不会有两个连续的 *。问 A 与 B 能否匹配。

首先，这里的 .* 和正则表达式里介绍的概念不一样。因为 * 使用的条件是前一个字符确定。即.* 不能匹配 ab，但是可以匹配 aa 或 bbb 这样。

这里用反向引用就好了，先将.\* 换成 (\w)\1* 即可。

AC代码只有这么短：

#include<bits/stdc++.h>
using namespace std;
string str1,str2;
int main()
{
    int T;
    scanf("%d",&T);
    while(T--)
    {
        cin>>str1>>str2;
        regex reg_rep("\\.\\*");
        str2=regex_replace(str2,reg_rep,"(\\w)\\1*");
        regex reg(str2);
        if(regex_match(str1,reg)) puts("yes");
        else puts("no");
    }
}

Abbreviation

from 2016-2017 ACM-ICPC Northeastern European Regional Contest (NEERC 16)

题意：大概是给一段包含大小写字母和逗号句号的文本，要求缩写连续的大写字母开头的单词。当时这个模拟写的十分难受，结果题解上写了一句有队伍用正则表达式很快就A了这个题目。

样例输入：

This is ACM North Eastern European Regional Contest,
sponsored by International Business Machines.
The. Best. Contest. Ever.
A Great Opportunity for all contestants.

样例输出：

This is ACM NEERC (North Eastern European Regional Contest),
sponsored by IBM (International Business Machines).
The. Best. Contest. Ever.
A GO (Great Opportunity) for all contestants.

AC代码（赛后补的）：

#include<bits/stdc++.h>
using namespace std;
int main()
{
    freopen("abbreviation.in","r",stdin);
    freopen("abbreviation.out","w",stdout);
    regex reg("\\b([A-Z][a-z]+ )+([A-Z][a-z]+)\\b");
    smatch reg_match;
    string str;
    while(getline(cin,str))
    {
        int len=str.length();
        while(regex_search(str,reg_match,reg))
        {
            string sub=reg_match[0];
            int sublen=sub.length();
            int pos=str.find(sub);
            for(int i=0;i<pos;++i) putchar(str[i]);
            for(int i=0;i<sublen;++i)
            if(isupper(sub[i]))
                putchar(sub[i]);
            putchar(' ');
            putchar('(');
            for(int i=0;i<sublen;++i) putchar(sub[i]);
            putchar(')');
            str=reg_match.suffix().str();
        }
        for(auto &ch:str) putchar(ch);
        puts("");
    }
}

赛中写的字符串模拟：

#include<bits/stdc++.h>
using namespace std;
string s;
vector<string> vec;
bool ok(string &s)
{
    int len=s.length();
    if(len<=1) return false;
    if(!isupper(s[0])) return false;
    for(int i=1;i<len;++i)
        if(!islower(s[i]))
            return false;
    return true;
}
bool output(int l,int r)
{
    for(int i=l;i<=r;++i)
        putchar(s[i]);
}
int main()
{
    freopen("abbreviation.in","r",stdin);
    freopen("abbreviation.out","w",stdout);
    while(getline(cin,s))
    {
        int len=s.length();
        int i=0;
        while(i<len)
        {
           // cout<<"len="<<len<<endl;
            if(!isalpha(s[i])) putchar(s[i++]);
            else
            {
                vec.clear();
                while(i<len)
                {
                    string word;
                    while(i<len && isalpha(s[i])) word+=s[i],i++;
                    vec.push_back(word);
                    if(s[i]!=' ')
                        break;
                    else i++;
                    //cout<<word<<endl;
                }
                //cout<<i<<endl;
                int sz=vec.size(),num;
                //cout<<"sz="<<sz<<endl;
                for(int k=0;k<sz;++k)
                {
                    int now=k;
                    while(now<sz && ok(vec[now])) ++now;

                    if(now==k)
                    {
                        cout<<vec[k];
                        if(k!=sz-1) cout<<" ";
                    }
                    else if(k+1==now)
                    {
                        --now;
                        cout<<vec[k];
                        if(k!=sz-1) cout<<" ";
                    }
                    else
                    {
                        for(int j=k;j<now;++j) putchar(vec[j][0]);
                        putchar(' ');
                        putchar('(');
                        for(int j=k;j<now;++j)
                        {
                            cout<<vec[j];
                            if(j<now-1) cout<<" ";
                        }
                        putchar(')');
                        --now;
                        if(now!=sz-1) cout<<" ";

                    }
                    k=now;
                }
                //cout<<endl;
               // cout<<"ok"<<endl;
                while(i<len && !isalpha(s[i])) putchar(s[i++]);
            }
        }
        puts("");
    }
    return 0;
}

命名规范问题

from 第十六届北京师范大学程序设计竞赛现场决赛

题意：给一些变量名，将符合(题中描述的) 驼峰命名法规范的变
量名转换为下划线命名法。不符合的原样输出。

好吧，这个题其实是我出的，标程用正则表达式，20 行左右搞定。

应该并不毒瘤吧，~~看大家模拟的挺开心的~~。

代码：

#include<bits/stdc++.h>
using namespace std;
int main()
{
    regex reg("\\b[A-Za-z][a-z]+([A-Z][a-z]+)+\\b");
    regex cap("[A-Z]");
    int T;
    cin>>T;
    while(T--)
    {
        string now;
        cin>>now;
        if(regex_match(now,reg))
        {
            now=regex_replace(now,cap,"_$0");
            transform(now.begin(),now.end(),now.begin(),::tolower);
        }
        if(now[0]=='_') now.erase(0,1);
        cout<<now<<endl;
    }
}

算法学习

#regex

正则表达式

https://izard.space/2018/04/25/正则表达式/

作者

forever_you

发布于

2018年4月25日

许可协议

莫队算法总结上一篇

树链剖分下一篇