^ (caret) and $ (dollar) represent the start and the end of a line, respectively.
Matching any one of several characters: use square brackets [...]. This is called a character class.
Dash - indicates a range of characters, e.g. [0-9] matches digits, [-a-z] matches lowercase characters.
Dash at the beginning of a character class means dash , e.g. [-123] matches dash and 1,2,3.
Not match any one of several characters: use negate at the beginning of a character class: [^...]
. (dot) matches any character.
| (bar) means or within parentheses. e.g., (first|1st) matches first or 1st.
Match word boundary: use meta sequences \< and \> . e.g., <cat> matches the word cat.
Quantifiers
Optional item: use ? (question mark) , e.g., colou?r matches color or colour. It can match a long expression in parentheses, e.g. 4(th)? matches 4 or 4th
+ means one or more preceding items.
* means any number of preceding items. So helllo +worldmatcheshello world, hellllo world` and similar sentences.
Lazy quantifiers
Some tools allow lazy quantifiers ,?, +?, ??, {num,num}?
This means try to match as little as possible
Positive quantifiers
Only supported by Java
Try to match as much as possible and never giveup.
Backreference
Backreferencing is the regex feature that allows you to match new text that is the same as some text matched earlier in the expression.
\1, \2, \3 refers to the first, second, third matched sets. For instance
# for detecting repeated word we can use
\<([A-Za-z]+) +\1\>
Character short hand
\a alert character
\b backspace inside character class.
\e escape character
\f form feed, ASCII FF
\n newline
\r return
\t tab
\v vertical tab
Octal escape
From \000 to \377
\033 for ESC
\015\012 for CR/LF
Hex and unicode sequence
Some version allows \x to escape hexadecimal values such as \x0D\x0A matches CR/LF
Sometimes use \u
Control character
\cchar maybe used to match control character sequence in some flavors.
\cH matches Control-H
Unicode combining character sequence
\X is a short hand for \P{M}\p{M}, which matches a base character possibly followed by several combining characters.
Unicode properties, scripts and blocks
\p{quality} matches a character with quality while \P{quality} matches a chracter that does not.
\p{L} - \p{Letter} - things considered letters.
\p{M}\p{Mark} – Various characters that are not meant to appear by themselves,
\p{Z}\p{Separator} – Characters that separate things, but have no visual representation (various kinds of spaces . . . ).
\p{S}\p{Symbol} – Various types of Dingbats and symbols.
\p{N}\p{Number} – Any kind of numeric character.
\p{P}\p{Punctuation} – Punctuation characters.
\p{C}\p{Other} – Catch-all for everything else (rarely used for normal characters).
\p{Script} matches characters from specific writing system. For instance, \p{Hebrew}
Class shorthands
\s is the generic whitespace which match space, tab, newline, carriage return.
\S is anything but \s
\w matches [a-zA-Z0-9R]
\W anything not ! \w , i.e., ![ˆa-zA-Z0-9R]
\d matches [0-9]", i.e., a digit
\D anything not ! \d , i.e., ![ˆ0-9]
Class operation
.Net and Java offers class subtraction in different flavors
[[a-z]-[aeiou]] in .net
[[a-z]&&[^aeiou]] in Java
Class set operation can be mimicked with look around
(?!\p{Cn})\p{InThai} is the same as [p{InThai} && [^p{Cn}]]
Posix character class
[:alnum:] alphabetic characters and numberic characters
[:alpha:] is alphabetic characters
[:blank:] space and tab
[:cntrl:] control characters
[:digit:] digits
[:graph:] non-blank characters
and so on
The advantage of Posix character class is they are locale dependent.
There is also Posix "collating sequence" for sorting sequence of characters
There is also Posix "character euquivalents" to indicate certain characters should be considered identical for sorting and such.
Popular modifier
i modifier comes after m/patter/i to do match in case-insensitive manner.
g modifier comes after s/pattern1/pattern2/g to do global replace.
s modifier replace pattern1 by pattern 2 in s/pattern1/pattern2/
Example: get 3 digits after the decimal point if the third is not a zero, otherwise just take 2
$fraction =~ s/(\.\d\d[1-9]?)\d*/$1/
m modifier:enhanced line anchor match mode
x modifier: free format
Look around: positive look ahead look behind
Look ahead is done with (?=..) and (?<=..) is look behind. Look around does not consume text, instead they mark positions.
For instance, to match Jeff only if it is part of Jeffrey
(?=Jeffrey)Jeff will match 'by Jeffrey Friedl' but not 'by Jefferson Friedl'
# insert ' to Jeffs
s/(?<=\bJeff)(?=s\b)/'/g
# insert , to group of 3 digits such as 123,456,789
$pop =~ s/\b(?<=\d)(?=(\d\d\d)+\b)/,/g;
Look around: negative look ahead and look behind
Negative look ahead: ?! successful if cannot match to the right
Negative look behind ?<! successful if cannot match to the left
Enhanced line anchor
Logical line oriented can be matched by enhanced line anchor mode. In Perl this is the m modifier, like this s/pattern1/pattern2/mg
Anchors and other zero-width assertions
Start of line/string: ^, \A
End of line/string: $, \z, \Z
Start of match or end of previous match \G. If a match is not successful, the location at which ! \G" matches is reset back to the beginning of the string.
Word boundary
\b, \<, \>, ... are word boundary. If \b is supported, maybe \B not word boundary is also supported too.
Note that \w and \b maynot agree if there are Unicode. In that case, use \p{L} to detect word.
Mode modifier
In the form (?modifier), such as (?i)or (?-i)
(?i) turns on case insensitivity and (?-i) turns it off.
x free spacing adn comment regex mode
s dot matches all match mode
m enhanced line-anchor match mode
Literal text span
\Q , \E turns off all regex character between them except for \E
Grouping
Grouping and capturing by (...)
?: groups but do not capture. For instance,
(?:[0-9]*)
# groups but do not capture a group of digits
Some supports name capture. such as
(?P<name>...) in Python and PHP and can be refered to as (?P=Area)
(?<name>...) in .NET and can be refered by \k.
Atomic grouping (?>...) means once the subexpression matches, what it matches become fixed.
Conditional matching
(?if then | else)
Match a word optional wrapped in <>
(<)?\w+(?(1)>)
It can use lookaround as test
# look for a digit after NUM:
(?(?<=NUM:)\d+<\w+)
Mechanics of regular expressions processing
Types of engine: NFA, POSIX NFA, DFA, Hybrid NFA/DFA
How they work? TODO
egrep
egrep: do regular expressions on a list of files like this
egrep 'regex' file1 file2 ...
sed
sed is perhaps faster than awk but not as powerful. awk can do what sed can and more
Awk
Good for one liner task, but a full awk program may not be worth it since you already have Perl or Python
Perl
See [[programming:perl]] for the Perl programming language
Do what sed does
perl -p -e "s/pattern1/pattern2/g" file
qr operator defines regex object
# match a host name
$HostnameRegex = qr/[-a-z0-9]+(\.[-a-z0-9]+)+\.(com;edu;info)/i;
ASCII escape sequence \e[7m and \e[m to hilight
\e[7m$1\e[m then $1 will be hilighted
\N{name} accesses an unicode by its name, e.g. \N{INVERTED EXCLAMATION MARK}
$/ is a variable defining chunk mode ending. Undefine it with undef $/.
$ARGV is the file name
HOWTOS
Removing the leading path from filename
# for unix file name
s{^.*/}{}
# for windows filename, may need aditional slash in Java or PHP
s{^.*\\}{}
Get the file name from a path
# everything at the end that is not a backslash or forward slash
([^/]*$)
# windows
([^\\]*$)
Get both filename and path
ˆ(.*)/(.*)$
# or a more verbose version
ˆ(.*)/([ˆ/]*)$
# note that if these don't match, it means there is only the filename and no leading path, we should take care of that in the code