You are on page 1of 8

Matching an IP address is another good example of a trade-off between regex

complexity and exactness. \b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b will match any IP


address just fine, but will also match 999.999.999.999 as if it were a valid IP
address. Whether this is a problem depends on the files or data you intend to apply
the regex to. To restrict all 4 numbers in the IP address to 0..255, you can use
the following regex. It stores each of the 4 numbers of the IP address into a
capturing group. You can use these groups to further process the IP number. Free-
spacing mode allows this to fit the width of the page.

\b(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.
(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.
(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.
(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\b

If you don't need access to the individual numbers, you can shorten the regex with
a quantifier to:

\b(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}
(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\b

Similarly, you can shorten the quick regex to \b(?:\d{1,3}\.){3}\d{1,3}\b

Trimming Whitespace

You can easily trim unnecessary whitespace from the start and the end of a string
or the lines in a text file by doing a regex search-and-replace. Search for ^[ \t]+
and replace with nothing to delete leading whitespace (spaces and tabs). Search for
[ \t]+$ to trim trailing whitespace. Do both by combining the regular expressions
into ^[ \t]+|[ \t]+$. Instead of [ \t] which matches a space or a tab, you can
expand the character class into [ \t\r\n] if you also want to strip line breaks. Or
you can use the shorthand \s instead.

Grabbing HTML Tags

<TAG\b[^>]*>(.*?)</TAG> matches the opening and closing pair of a specific HTML


tag. Anything between the tags is captured into the first backreference. The
question mark in the regex makes the star lazy, to make sure it stops before the
first closing tag rather than before the last, like a greedy star would do. This
regex will not properly match tags nested inside themselves, like in
<TAG>one<TAG>two</TAG>one</TAG>.

<([A-Z][A-Z0-9]*)\b[^>]*>(.*?)</\1> will match the opening and closing pair of any


HTML tag. Be sure to turn off case sensitivity. The key in this solution is the use
of the backreference \1 in the regex. Anything between the tags is captured into
the second backreference. This solution will also not match tags nested in
themselves.

Deleting Duplicate Lines From a File

If you have a file in which all lines are sorted (alphabetically or otherwise), you
can easily delete (consecutive) duplicate lines. Simply open the file in your
favorite text editor, and do a search-and-replace searching for ^(.*)(\r?\n\1)+$
and replacing with \1. For this to work, the anchors need to match before and after
line breaks (and not just at the start and the end of the file or string), and the
dot must not match newlines.

Here is how this works. The caret will match only at the start of a line. So the
regex engine will only attempt to match the remainder of the regex there. The dot
and star combination simply matches an entire line, whatever its contents, if any.
The parentheses store the matched line into the first backreference.

Next we will match the line separator. I put the question mark into \r?\n to make
this regex work with both Windows (\r\n) and UNIX (\n) text files. So up to this
point we matched a line and the following line break.

Now we need to check if this combination is followed by a duplicate of that same


line. We do this simply with \1. This is the first backreference which holds the
line we matched. The backreference will match that very same text.

If the backreference fails to match, the regex match and the backreference are
discarded, and the regex engine tries again at the start of the next line. If the
backreference succeeds, the plus symbol in the regular expression will try to match
additional copies of the line. Finally, the dollar symbol forces the regex engine
to check if the text matched by the backreference is a complete line. We already
know the text matched by the backreference is preceded by a line break (matched
by \r?\n). Therefore, we now check if it is also followed by a line break or if it
is at the end of the file using the dollar sign.

The entire match becomes line\nline (or line\nline\nline etc.). Because we are
doing a search and replace, the line, its duplicates, and the line breaks in
between them, are all deleted from the file. Since we want to keep the original
line, but not the duplicates, we use \1 as the replacement text to put the original
line back in.

Matching Whole Lines of Text

Often, you want to match complete lines in a text file rather than just the part of
the line that satisfies a certain requirement. This is useful if you want to delete
entire lines in a search-and-replace in a text editor, or collect entire lines in
an information retrieval tool.

To keep this example simple, let's say we want to match lines containing the word
"John". The regex John makes it easy enough to locate those lines. But the software
will only indicate John as the match, not the entire line containing the word.

The solution is fairly simple. To specify that we need an entire line, we will use
the caret and dollar sign and turn on the option to make them match at embedded
newlines. In software aimed at working with text files like EditPad Pro and
PowerGREP, the anchors always match at embedded newlines. To match the parts of the
line before and after the match of our original regular expression John, we simply
use the dot and the star. Be sure to turn off the option for the dot to match
newlines.

The resulting regex is: ^.*John.*$. You can use the same method to expand the match
of any regular expression to an entire line, or a block of complete lines. In some
cases, such as when using alternation, you will need to group the original regex
together using parentheses.

Finding Lines Containing or Not Containing Certain Words

If a line can meet any out of series of requirements, simply use alternation in the
regular expression. ^.*\b(one|two|three)\b.*$ matches a complete line of text that
contains any of the words "one", "two" or "three". The first backreference will
contain the word the line actually contains. If it contains more than one of the
words, then the last (rightmost) word will be captured into the first
backreference. This is because the star is greedy. If we make the first star lazy,
like in ^.*?\b(one|two|three)\b.*$, then the backreference will contain the first
(leftmost) word.
If a line must satisfy all of multiple requirements, we need to use lookahead. ^(?
=.*?\bone\b)(?=.*?\btwo\b)(?=.*?\bthree\b).*$ matches a complete line of text that
contains all of the words "one", "two" and "three". Again, the anchors must match
at the start and end of a line and the dot must not match line breaks. Because of
the caret, and the fact that lookahead is zero-length, all of the three lookaheads
are attempted at the start of the each line. Each lookahead will match any piece of
text on a single line (.*?) followed by one of the words. All three must match
successfully for the entire regex to match. Note that instead of words like
\bword\b, you can put any regular expression, no matter how complex, inside the
lookahead. Finally, .*$ causes the regex to actually match the line, after the
lookaheads have determined it meets the requirements.

If your condition is that a line should not contain something, use negative
lookahead. ^((?!regexp).)*$ matches a complete line that does not match regexp.
Notice that unlike before, when using positive lookahead, I repeated both the
negative lookahead and the dot together. For the positive lookahead, we only need
to find one location where it can match. But the negative lookahead must be tested
at each and every character position in the line. We must test that regexp fails
everywhere, not just somewhere.

Finally, you can combine multiple positive and negative requirements as follows:
^(?=.*?\bmust-have\b)(?=.*?\bmandatory\b)((?!avoid|illegal).)*$. When checking
multiple positive requirements, the .* at the end of the regular expression full of
zero-length assertions made sure that we actually matched something. Since the
negative requirement must match the entire line, it is easy to replace the .* with
the negative test.

How to Find or Validate an Email Address

The regular expression I receive the most feedback, not to mention "bug" reports
on, is the one you'll find right on this site's home page: \b[A-Z0-9._%+-]+@[A-Z0-
9.-]+\.[A-Z]{2,4}\b. This regular expression, I claim, matches any email address.
Most of the feedback I get refutes that claim by showing one email address that
this regex doesn't match. Usually, the "bug" report also includes a suggestion to
make the regex "perfect".

As I explain below, my claim only holds true when one accepts my definition of what
a valid email address really is, and what it's not. If you want to use a different
definition, you'll have to adapt the regex. Matching a valid email address is a
perfect example showing that (1) before writing a regex, you have to know exactly
what you're trying to match, and what not; and (2) there's often a trade-off
between what's exact, and what's practical.

The virtue of my regular expression above is that it matches 99% of the email
addresses in use today. All the email address it matches can be handled by 99% of
all email software out there. If you're looking for a quick solution, you only need
to read the next paragraph. If you want to know all the trade-offs and get plenty
of alternatives to choose from, read on.

If you want to use the regular expression above, there's two things you need to
understand. First, long regexes make it difficult to nicely format paragraphs. So I
didn't include a-z in any of the three character classes. This regex is intended to
be used with your regex engine's "case insensitive" option turned on. (You'd be
surprised how many "bug" reports I get about that.) Second, the above regex is
delimited with word boundaries, which makes it suitable for extracting email
addresses from files or larger blocks of text. If you want to check whether the
user typed in a valid email address, replace the word boundaries with start-of-
string and end-of-string anchors, like this: ^[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]
{2,4}$.

The previous paragraph also applies to all following examples. You may need to
change word boundaries into start/end-of-string anchors, or vice versa. And you
will need to turn on the case insensitive matching option.

Trade-Offs in Validating Email Addresses

Yes, there are a whole bunch of email addresses that my pet regex doesn't match.
The most frequently quoted example are addresses on the .museum top level domain,
which is longer than the 4 letters my regex allows for the top level domain. I
accept this trade-off because the number of people using .museum email addresses is
extremely low. I've never had a complaint that the order forms or newsletter
subscription forms on the JGsoft websites refused a .museum address (which they
would, since they use the above regex to validate the email address).

To include .museum, you could use ^[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,6}$.


However, then there's another trade-off. This regex will match john@mail.office.
It's far more likely that John forgot to type in the .com top level domain rather
than having just created a new .office top level domain without ICANN's permission.

This shows another trade-off: do you want the regex to check if the top level
domain exists? My regex doesn't. Any combination of two to four letters will do,
which covers all existing and planned top level domains except .museum. But it will
match addresses with invalid top-level domains like asdf@asdf.asdf. By not being
overly strict about the top-level domain, I don't have to update the regex each
time a new top-level domain is created, whether it's a country code or generic
domain.

^[A-Z0-9._%+-]+@[A-Z0-9.-]+\.(?:[A-Z]{2}|com|org|net|edu|gov|mil|
biz|info|mobi|name|aero|asia|jobs|museum)$ could be used to allow any two-letter
country code top level domain, and only specific generic top level domains. By the
time you read this, the list might already be out of date. If you use this regular
expression, I recommend you store it in a global constant in your application, so
you only have to update it in one place. You could list all country codes in the
same manner, even though there are almost 200 of them.

Email addresses can be on servers on a subdomain, e.g.


john@server.department.company.com. All of the above regexes will match this email
address, because I included a dot in the character class after the @ symbol.
However, the above regexes will also match john@aol...com which is not valid due to
the consecutive dots. You can exclude such matches by replacing [A-Z0-9.-]+\. with
(?:[A-Z0-9-]+\.)+ in any of the above regexes. I removed the dot from the character
class and instead repeated the character class and the following literal dot.
E.g. \b[A-Z0-9._%+-]+@(?:[A-Z0-9-]+\.)+[A-Z]{2,4}\b will match
john@server.department.company.com but not john@aol...com.

Another trade-off is that my regex only allows English letters, digits and a few
special symbols. The main reason is that I don't trust all my email software to be
able to handle much else. Even though John.O'Hara@theoharas.com is a syntactically
valid email address, there's a risk that some software will misinterpret the
apostrophe as a delimiting quote. E.g. blindly inserting this email address into a
SQL will cause it to fail if strings are delimited with single quotes. And of
course, it's been many years already that domain names can include non-English
characters. Most software and even domain name registrars, however, still stick to
the 37 characters they're used to.

The conclusion is that to decide which regular expression to use, whether you're
trying to match an email address or something else that's vaguely defined, you need
to start with considering all the trade-offs. How bad is it to match something
that's not valid? How bad is it not to match something that is valid? How complex
can your regular expression be? How expensive would it be if you had to change the
regular expression later? Different answers to these questions will require a
different regular expression as the solution. My email regex does what I want, but
it may not do what you want.

Regexes Dont Send Email

Don't go overboard in trying to eliminate invalid email addresses with your regular
expression. If you have to accept .museum domains, allowing any 6-letter top level
domain is often better than spelling out a list of all current domains. The reason
is that you don't really know whether an address is valid until you try to send an
email to it. And even that might not be enough. Even if the email arrives in a
mailbox, that doesn't mean somebody still reads that mailbox.

The same principle applies in many situations. When trying to match a valid date,
it's often easier to use a bit of arithmetic to check for leap years, rather than
trying to do it in a regex. Use a regular expression to find potential matches or
check if the input uses the proper syntax, and do the actual validation on the
potential matches returned by the regular expression. Regular expressions are a
powerful tool, but they're far from a panacea.

The Official Standard: RFC 5322

Maybe you're wondering why there's no "official" fool-proof regex to match email
addresses. Well, there is an official definition, but it's hardly fool-proof.

The official standard is known as RFC 5322. It describes the syntax that valid
email addresses must adhere to. You can (but you shouldn't--read on) implement it
with this regular expression:

(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*
| "(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]
| \\[\x01-\x09\x0b\x0c\x0e-\x7f])*")
@ (?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?
| \[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}
(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:
(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]
| \\[\x01-\x09\x0b\x0c\x0e-\x7f])+)
\])

This regex has two parts: the part before the @, and the part after the @. There
are two alternatives for the part before the @: it can either consist of a series
of letters, digits and certain symbols, including one or more dots. However, dots
may not appear consecutively or at the start or end of the email address. The other
alternative requires the part before the @ to be enclosed in double quotes,
allowing any string of ASCII characters between the quotes. Whitespace characters,
double quotes and backslashes must be escaped with backslashes.

The part after the @ also has two alternatives. It can either be a fully qualified
domain name (e.g. regular-expressions.info), or it can be a literal Internet
address between square brackets. The literal Internet address can either be an IP
address, or a domain-specific routing address.

The reason you shouldn't use this regex is that it only checks the basic syntax of
email addresses. john@aol.com.nospam would be considered a valid email address
according to RFC 5322. Obviously, this email address won't work, since there's no
"nospam" top-level domain. It also doesn't guarantee your email software will be
able to handle it. Not all applications support the syntax using double quotes or
square brackets. In fact, RFC 5322 itself marks the notation using square brackets
as obsolete.

We get a more practical implementation of RFC 5322 if we omit the syntax using
double quotes and square brackets. It will still match 99.99% of all email
addresses in actual use today.

[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*@
(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?

A further change you could make is to allow any two-letter country code top level
domain, and only specific generic top level domains. This regex filters dummy email
addresses like asdf@adsf.adsf. You will need to update it as new top-level domains
are added.

[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*@
(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+(?:[A-Z]{2}|com|org|net|edu|gov|mil|
biz|info|mobi|name|aero|asia|jobs|museum)\b

So even when following official standards, there are still trade-offs to be made.
Don't blindly copy regular expressions from online libraries or discussion forums.
Always test them on your own data and with your own applications.

Domain names edit

(First shot)
^[a-zA-Z]([a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?\.[a-zA-Z]([a-zA-Z0-9-]{0,61}[a-zA-Z0-
9])?(\.[a-zA-Z]([a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)?$
This code does NOT attempt, obviously, to ensure that the last level of the regular
expression matches a known domain...

XML-like data edit

To match something similar to XML-tags you can use regular-expressions, too. Let's
assume we have this text:
% set text {<bo>s</bo><it><bo>M</bo></it>}
We can match the body of bo with this regexp:
% regexp "<(bo)>(.*?)</bo>" $text dummy tag body
Now we extend our XML-text with some attributes for the tags, say:
% set text2 {<bo h="m">s</bo><it><bo>M</bo></it>}
If we try to match this with:
% regexp "<(bo)\\s+(.+?)>(.*?)</bo>" $text2 dummy tag attributes body
it won't work anymore. This is because \\s+ is greedy (in contrary to the non-
greedy (.+?) and (.*?)) and that (the one greedy-operator) makes the whole
expression greedy.
See Henry Spencer's reply in tcl 8.2 regexp not doing non-greedy matching
correctly, comp.lang.tcl, 1999-09-20.
The correct way is:
% regexp "<(bo)\\s+?(.+?)>(.*?)</bo>" $text2 dummy tag attributes body
Now we can write a more general XML-to-whatever-translater like this:
Substitute [ and ] with their corresponding [ and ] to avoid confusion with "subst"
in 3.
Substitute the tags and attributes with commands
Do a "subst" on the whole text, thereby calling the inserted commands
proc xml2whatever {text userCallback} {
set text [string map {[ [ ] ]} $text]
# replace all tags with a call to userCallback
# this has to be done multiple times, because of nested tags
# match each tag (everything not space after <)
# and all the attributes (everything behind the tag until >)
# than match body and the end-tag (which should be the same as the
# first matched one (\1))
while {[regsub -all {<(\S+?)(.*?)>(.*?)</\1>} $text "\[[list
$userCallback \\1 \\2 \\3]\]" text]} {
# do nothing
}
return [subst -novariables -nobackslashes $text]
}

# is called from xml2whatever with


# element: the xml-element
# attributes: the attributes of xml-element
# body: body of xml-element
proc myTranslate {element attributes body} {
# map bo - b; it - i (leave rest alone)
# do a subst for the body, because of possible nested tags
switch -- $element {
bo { return "<b>[subst -novariables -nobackslashes $body]</b>"}
it { return "<i>[subst -novariables -nobackslashes $body]</i>"}
default { return "<$element$attributes>[subst -novariables -nobackslashes
$body]</$element>" }
}
}
Call the parser with:
% xml2whatever $text2 myTranslate
You have to be careful, though. Don't do this for large texts or texts with many
nested xml-tags because the regular-expression-machine is not the the right tool to
parse large,nested files efficiently. (Stefan Vogel)
DKF - I agree with that last point. If you are really dealing with XML, it is
better to use a proper tool like TclDOM or tDOM.

You can't negate a regexp, but you CAN negate a regexp that is only a simple
string. Logically, it's the following:
match any single char except first letter in the string.
match the first char in string if followed by any letter except the 2nd
match the first two if followed by any but the third, et cetera
Then the only thing more is to allow a partial match of the string at end of line.
So for a regexp that matches
any line that DOES NOT have the word ''foo'':

set exp {^([^f]|f[^o]|fo[^o])*.{0,2}$}


The following proc will build the expression for any given string
proc reg_negate {str} {
set partial ""
set branches [list]
foreach c [split $str ""] {
lappend branches [format {%s[^%s]} $partial $c]
append partial $c
}
set exp [format {^(%s)*.{0,%d}$} [join $branches "|"] \
[expr [string length $str] -1]]
}
Donal Fellows followed up with:
That's just set me thinking; you can do this by specifying that the whole string
must be either not the character of the antimatch*, or the first character of the
antimatch so long as it is not followed by the rest of the antimatch. This leads to
a fairly simply expressed pattern.
set exp {^(?:[^f]|f(?!oo))*$}
In fact, this allows us to strengthen what you say above to allow the matching of
any negated regexp directly so long as the first component of the antimatch is a
literal, and the rest of the antimatch is expressible in an ERE lookahead
constraint (which imposes a number of restrictions, but still allows for some
fairly sophisticated patterns.)
* Anything's better than overloading 'string' here!

Matching Optional Parts edit

elfring 2004-07-05: Does anybody know problems and solutions to match optional
parts with regular expressions?
See Matching of optional parts in regular expressions, comp.lang.tcl, 2004-07-01.
MG 2004-07-17: The problem with the regexp there seems to be that one of the parts
to match optional white space is in the wrong place, and is matching too much. If
you use this regexp instead, it works for me, on Win XP with Tcl 8.4.6. (The change
is that, after </S_URI> and before <P_URI>, .*? has been moved inside (?: ... )
set pattern {<name>(.+)</name>(?:.*?<scope>(SYSTEM|PUBLIC)</scope>.*?<S_URI>(.
+)</S_URI>(?:.*?<P_URI>(.+)</P_URI>)?)?(?:.*?<definition>(.*?)</definition>)?(?:.*?
<attributes>(.*?)</attributes>)?.*?<content>(.*)</content>\s*$}

set string {<name>gruss</name>


<scope>SYSTEM</scope>
<S_URI>http://XXX/Hallo.dtd</S_URI>
<P_URI>http://YYY/Leute.dtd</P_URI>
<definition><!ELEMENT gruss (#PCDATA)></definition>
<attributes>Versuch="1"</attributes>
<content><h1>Guten Tag!</h1></content>}

regexp $pattern $str z name scope system public definition attributes content

You might also like