python regex matachar

正则表达式

~精通正则表达式 (豆瓣)

术语

regex 正则
matching 匹配
metacharacter 元字符
flavor 流派
subexpression 子表达式
character 字符

识别对象

行
字符

Python re 模块

~ 6.2. re — Regular expression operations — Python 3.6.6 documentation

Q python 中正则表达式re模块元字符的含义
NO flag
- default mode
flag
- DOTALL
- MULTILINE
O Default 应该就够用了...

Basic matachar

.
- matches any character except a newline.
- 任意字符
^
- the start of the string
- 行首
$
- the end of the string or just before the newline at the end of the string
- 行尾
*
- match 0 or more repetitions of the preceding RE
- 重复0+次
+
- match 1 or more repetitions of the preceding RE
- 重复1+次
?
- match 0 or 1 repetitions of the preceding RE
- 匹配0或1次
*?, +?, ??
- *, +, ?
  - greedy: match as much as ppissible
  - 后加一个 ? => non-greedy: as few characters as possible will be matched
- 例子:
  - <a> b <c>
    - <.*> => <a> b <c>
    - <.*?> => <a>
{m}
- exactly m copies of the previous RE should be matched
- 指定次数重复
{m,n}
- match from m to n repetitions of the preceding RE
- greedy : as many repetitions as possible
- 指定重复次数区间
  - 省略 m,下限为0
  - 省略 n,上限无穷大
  - , 不能省略
[]
- 字符组
  - 内部有自己的规则
    - 元字符在[]失去特殊意义,变为匹配的普通字符
      - [(+*)] => (, +, *, or )
  - 或 的关系
    - [amk] => a or m or k
  - - 表范围
    - [0-9]
    - [a-z]
  - ^表非集
    - [^5] => 任何非5的字符
    - [^^] => 匹配任何非^的字符
  - [
    - 在字符组内匹配[
      - 转义\[
        
        [()[\]{}]
      - 放在 beginning
        
        []()[{}

|
- 任意REs表达式的或关系

()
- group

matachar Extension

(?...)
- extension notation
(?aiLmsux)
- aiLmsux 任意字符组合, set Flags for the entire regular expression.:
  - re.A (ASCII-only matching),
  - re.I (ignore case),
  - re.L (locale dependent), [w]
  - re.M (multi-line),
  - re.S (dot matches all),
  - re.U (Unicode matching),
  - re.X (verbose), [w]
  - [w]=>6.2. re — Regular expression operations — Python 3.6.6 documentation

(?:...)
- 匹配在括号内的任何字符(非捕获版本),字符串无法操作
- 类似 (...), 但是不表示一个组
  - (<)?(\w+@\w+(?:\.\w+)+)(?(1)>|$)
  - 需要用嵌套(())时使用
(?imsx-imsx:...)
- 3.6 only
- (?imx: re) 在括号中使用i, m, 或 x 可选标志
- (?-imx: re) 在括号中不使用i, m, 或 x 可选标志
(?P<name>...)
- 给 group 命名
  - (?P<quote>['"]).*?(?P=quote)
    - (?P<quote>['"])
      - 内容为['"] 命名为name的组
    - (?P=quote)
      - 调用这个组
(?P=name)
- 存储调用"name"的 group(即复杂对应()内内容)
  - 调用方式
    - 同一个 pattern 内调用
      - (?P=name)
      - \1
    - 匹配对象的调用( match object m)
      - m.group('quote')
      - m.end('quote')
      - ...
    - re.rub()函数的 repl 变量内调用
      - \g<quote>
      - \g<1>
      - \1
(?#...)
- A comment, ignore
(?=...)
- lookahead assertion
  - Isaac (?=Asimov)
    - 只匹配跟在Asimov 后的 Isaac
(?!...)
- = 的非集
  - Isaac (?!Asimov)
    - 只匹配不跟在Asimov 后的 Isaac
(?<=...),,,
- a positive lookbehind assertion
  - 搜前有...的,,,字符串
    - O 匹配...,,,但是返回的结果只有,,,
- ... only match strings of some fixed length
  - 只能匹配固定长度字符( py3.5+ 支持)
  - O abc or a|b
  - X a* , a{3,4}
- ?<=abc)def match abcdef ,match objectdef
  1
  2
  3
  4
  >>> import re
  >>> m = re.search('(?<=abc)def', 'abcdef')
  >>> m.group(0)
  'def'
(?<!...)
- a negative lookbehind assertion.
- (?<=...),,,的非集
(?(id/name)yes-pattern|no-pattern)
e.g
- 邮箱匹配 (<)?(\w+@\w+(?:\.\w+)+)(?(1)>|$)
  - (?(1)>|$)
    - 如果(1)存在,则>存在,否则就是结尾$
    - (1) 是第一个括号内容即(<)
  - O <user@host.com> as well as user@host.com
  - X <user@host.com nor user@host.com>.

Special \ matechar

\Number
- 指代第 Number 个() group 中的内容
  - (.+) \1
    - O 'the the' or '55 55'
    - X but not 'thethe'
    - 注意空格
  - \1...\9 匹配第n个分组的内容.
\A
- 匹配字符串开始
\b
- 匹配一个单词边界的空字符,也就是指单词和空格间的位置.
- 'er\b' 可以匹配"never" 中的 'er',但不能匹配 "verb" 中的 'er'.
\B
- 匹配非单词边界. '
- 'er\B' 能匹配 "verb" 中的 'er',但不能匹配 "never" 中的 'er'.
\d
- 匹配任意数字,等价于 [0-9].
\D
- 匹配任意非数字
\w
- 匹配字母数字及下划线
\W
- 匹配非字母数字及下划线
\s
- 匹配任意空白字符,等价于 [\t\n\r\f].
\S
- 匹配任意非空字符

\Z
- 匹配字符串结束,如果是存在换行,只匹配到换行前的结束字符串.
\z
- 匹配字符串结束
\G
- 匹配最后匹配完成的位置.

其他

search() vs. match()

match 搜索字符首位
search 搜索任意位置

1
2
3

>>> re.match("c", "abcdef")    # No match
>>> re.search("c", "abcdef")   # Match
<_sre.SRE_Match object; span=(2, 3), match='c'>

>>> re.match("c", "abcdef")    # No match
>>> re.search("^c", "abcdef")  # No match
>>> re.search("^a", "abcdef")  # Match
<_sre.SRE_Match object; span=(0, 1), match='a'>

span()起始位置

1
2
3

import re
print(re.match('www', 'www.runoob.com').span())  # 在起始位置匹配
print(re.match('com', 'www.runoob.com'))         # 不在起始位置匹配