Professional Documents
Culture Documents
\b(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.
(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.
(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.
(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\b
If you don't need access to the individual numbers, you can shorten the regex with
a quantifier to:
\b(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}
(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\b
Trimming Whitespace
You can easily trim unnecessary whitespace from the start and the end of a string
or the lines in a text file by doing a regex search-and-replace. Search for ^[ \t]+
and replace with nothing to delete leading whitespace (spaces and tabs). Search for
[ \t]+$ to trim trailing whitespace. Do both by combining the regular expressions
into ^[ \t]+|[ \t]+$. Instead of [ \t] which matches a space or a tab, you can
expand the character class into [ \t\r\n] if you also want to strip line breaks. Or
you can use the shorthand \s instead.
If you have a file in which all lines are sorted (alphabetically or otherwise), you
can easily delete (consecutive) duplicate lines. Simply open the file in your
favorite text editor, and do a search-and-replace searching for ^(.*)(\r?\n\1)+$
and replacing with \1. For this to work, the anchors need to match before and after
line breaks (and not just at the start and the end of the file or string), and the
dot must not match newlines.
Here is how this works. The caret will match only at the start of a line. So the
regex engine will only attempt to match the remainder of the regex there. The dot
and star combination simply matches an entire line, whatever its contents, if any.
The parentheses store the matched line into the first backreference.
Next we will match the line separator. I put the question mark into \r?\n to make
this regex work with both Windows (\r\n) and UNIX (\n) text files. So up to this
point we matched a line and the following line break.
If the backreference fails to match, the regex match and the backreference are
discarded, and the regex engine tries again at the start of the next line. If the
backreference succeeds, the plus symbol in the regular expression will try to match
additional copies of the line. Finally, the dollar symbol forces the regex engine
to check if the text matched by the backreference is a complete line. We already
know the text matched by the backreference is preceded by a line break (matched
by \r?\n). Therefore, we now check if it is also followed by a line break or if it
is at the end of the file using the dollar sign.
The entire match becomes line\nline (or line\nline\nline etc.). Because we are
doing a search and replace, the line, its duplicates, and the line breaks in
between them, are all deleted from the file. Since we want to keep the original
line, but not the duplicates, we use \1 as the replacement text to put the original
line back in.
Often, you want to match complete lines in a text file rather than just the part of
the line that satisfies a certain requirement. This is useful if you want to delete
entire lines in a search-and-replace in a text editor, or collect entire lines in
an information retrieval tool.
To keep this example simple, let's say we want to match lines containing the word
"John". The regex John makes it easy enough to locate those lines. But the software
will only indicate John as the match, not the entire line containing the word.
The solution is fairly simple. To specify that we need an entire line, we will use
the caret and dollar sign and turn on the option to make them match at embedded
newlines. In software aimed at working with text files like EditPad Pro and
PowerGREP, the anchors always match at embedded newlines. To match the parts of the
line before and after the match of our original regular expression John, we simply
use the dot and the star. Be sure to turn off the option for the dot to match
newlines.
The resulting regex is: ^.*John.*$. You can use the same method to expand the match
of any regular expression to an entire line, or a block of complete lines. In some
cases, such as when using alternation, you will need to group the original regex
together using parentheses.
If a line can meet any out of series of requirements, simply use alternation in the
regular expression. ^.*\b(one|two|three)\b.*$ matches a complete line of text that
contains any of the words "one", "two" or "three". The first backreference will
contain the word the line actually contains. If it contains more than one of the
words, then the last (rightmost) word will be captured into the first
backreference. This is because the star is greedy. If we make the first star lazy,
like in ^.*?\b(one|two|three)\b.*$, then the backreference will contain the first
(leftmost) word.
If a line must satisfy all of multiple requirements, we need to use lookahead. ^(?
=.*?\bone\b)(?=.*?\btwo\b)(?=.*?\bthree\b).*$ matches a complete line of text that
contains all of the words "one", "two" and "three". Again, the anchors must match
at the start and end of a line and the dot must not match line breaks. Because of
the caret, and the fact that lookahead is zero-length, all of the three lookaheads
are attempted at the start of the each line. Each lookahead will match any piece of
text on a single line (.*?) followed by one of the words. All three must match
successfully for the entire regex to match. Note that instead of words like
\bword\b, you can put any regular expression, no matter how complex, inside the
lookahead. Finally, .*$ causes the regex to actually match the line, after the
lookaheads have determined it meets the requirements.
If your condition is that a line should not contain something, use negative
lookahead. ^((?!regexp).)*$ matches a complete line that does not match regexp.
Notice that unlike before, when using positive lookahead, I repeated both the
negative lookahead and the dot together. For the positive lookahead, we only need
to find one location where it can match. But the negative lookahead must be tested
at each and every character position in the line. We must test that regexp fails
everywhere, not just somewhere.
Finally, you can combine multiple positive and negative requirements as follows:
^(?=.*?\bmust-have\b)(?=.*?\bmandatory\b)((?!avoid|illegal).)*$. When checking
multiple positive requirements, the .* at the end of the regular expression full of
zero-length assertions made sure that we actually matched something. Since the
negative requirement must match the entire line, it is easy to replace the .* with
the negative test.
The regular expression I receive the most feedback, not to mention "bug" reports
on, is the one you'll find right on this site's home page: \b[A-Z0-9._%+-]+@[A-Z0-
9.-]+\.[A-Z]{2,4}\b. This regular expression, I claim, matches any email address.
Most of the feedback I get refutes that claim by showing one email address that
this regex doesn't match. Usually, the "bug" report also includes a suggestion to
make the regex "perfect".
As I explain below, my claim only holds true when one accepts my definition of what
a valid email address really is, and what it's not. If you want to use a different
definition, you'll have to adapt the regex. Matching a valid email address is a
perfect example showing that (1) before writing a regex, you have to know exactly
what you're trying to match, and what not; and (2) there's often a trade-off
between what's exact, and what's practical.
The virtue of my regular expression above is that it matches 99% of the email
addresses in use today. All the email address it matches can be handled by 99% of
all email software out there. If you're looking for a quick solution, you only need
to read the next paragraph. If you want to know all the trade-offs and get plenty
of alternatives to choose from, read on.
If you want to use the regular expression above, there's two things you need to
understand. First, long regexes make it difficult to nicely format paragraphs. So I
didn't include a-z in any of the three character classes. This regex is intended to
be used with your regex engine's "case insensitive" option turned on. (You'd be
surprised how many "bug" reports I get about that.) Second, the above regex is
delimited with word boundaries, which makes it suitable for extracting email
addresses from files or larger blocks of text. If you want to check whether the
user typed in a valid email address, replace the word boundaries with start-of-
string and end-of-string anchors, like this: ^[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]
{2,4}$.
The previous paragraph also applies to all following examples. You may need to
change word boundaries into start/end-of-string anchors, or vice versa. And you
will need to turn on the case insensitive matching option.
Yes, there are a whole bunch of email addresses that my pet regex doesn't match.
The most frequently quoted example are addresses on the .museum top level domain,
which is longer than the 4 letters my regex allows for the top level domain. I
accept this trade-off because the number of people using .museum email addresses is
extremely low. I've never had a complaint that the order forms or newsletter
subscription forms on the JGsoft websites refused a .museum address (which they
would, since they use the above regex to validate the email address).
This shows another trade-off: do you want the regex to check if the top level
domain exists? My regex doesn't. Any combination of two to four letters will do,
which covers all existing and planned top level domains except .museum. But it will
match addresses with invalid top-level domains like asdf@asdf.asdf. By not being
overly strict about the top-level domain, I don't have to update the regex each
time a new top-level domain is created, whether it's a country code or generic
domain.
^[A-Z0-9._%+-]+@[A-Z0-9.-]+\.(?:[A-Z]{2}|com|org|net|edu|gov|mil|
biz|info|mobi|name|aero|asia|jobs|museum)$ could be used to allow any two-letter
country code top level domain, and only specific generic top level domains. By the
time you read this, the list might already be out of date. If you use this regular
expression, I recommend you store it in a global constant in your application, so
you only have to update it in one place. You could list all country codes in the
same manner, even though there are almost 200 of them.
Another trade-off is that my regex only allows English letters, digits and a few
special symbols. The main reason is that I don't trust all my email software to be
able to handle much else. Even though John.O'Hara@theoharas.com is a syntactically
valid email address, there's a risk that some software will misinterpret the
apostrophe as a delimiting quote. E.g. blindly inserting this email address into a
SQL will cause it to fail if strings are delimited with single quotes. And of
course, it's been many years already that domain names can include non-English
characters. Most software and even domain name registrars, however, still stick to
the 37 characters they're used to.
The conclusion is that to decide which regular expression to use, whether you're
trying to match an email address or something else that's vaguely defined, you need
to start with considering all the trade-offs. How bad is it to match something
that's not valid? How bad is it not to match something that is valid? How complex
can your regular expression be? How expensive would it be if you had to change the
regular expression later? Different answers to these questions will require a
different regular expression as the solution. My email regex does what I want, but
it may not do what you want.
Don't go overboard in trying to eliminate invalid email addresses with your regular
expression. If you have to accept .museum domains, allowing any 6-letter top level
domain is often better than spelling out a list of all current domains. The reason
is that you don't really know whether an address is valid until you try to send an
email to it. And even that might not be enough. Even if the email arrives in a
mailbox, that doesn't mean somebody still reads that mailbox.
The same principle applies in many situations. When trying to match a valid date,
it's often easier to use a bit of arithmetic to check for leap years, rather than
trying to do it in a regex. Use a regular expression to find potential matches or
check if the input uses the proper syntax, and do the actual validation on the
potential matches returned by the regular expression. Regular expressions are a
powerful tool, but they're far from a panacea.
Maybe you're wondering why there's no "official" fool-proof regex to match email
addresses. Well, there is an official definition, but it's hardly fool-proof.
The official standard is known as RFC 5322. It describes the syntax that valid
email addresses must adhere to. You can (but you shouldn't--read on) implement it
with this regular expression:
(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*
| "(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]
| \\[\x01-\x09\x0b\x0c\x0e-\x7f])*")
@ (?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?
| \[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}
(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:
(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]
| \\[\x01-\x09\x0b\x0c\x0e-\x7f])+)
\])
This regex has two parts: the part before the @, and the part after the @. There
are two alternatives for the part before the @: it can either consist of a series
of letters, digits and certain symbols, including one or more dots. However, dots
may not appear consecutively or at the start or end of the email address. The other
alternative requires the part before the @ to be enclosed in double quotes,
allowing any string of ASCII characters between the quotes. Whitespace characters,
double quotes and backslashes must be escaped with backslashes.
The part after the @ also has two alternatives. It can either be a fully qualified
domain name (e.g. regular-expressions.info), or it can be a literal Internet
address between square brackets. The literal Internet address can either be an IP
address, or a domain-specific routing address.
The reason you shouldn't use this regex is that it only checks the basic syntax of
email addresses. john@aol.com.nospam would be considered a valid email address
according to RFC 5322. Obviously, this email address won't work, since there's no
"nospam" top-level domain. It also doesn't guarantee your email software will be
able to handle it. Not all applications support the syntax using double quotes or
square brackets. In fact, RFC 5322 itself marks the notation using square brackets
as obsolete.
We get a more practical implementation of RFC 5322 if we omit the syntax using
double quotes and square brackets. It will still match 99.99% of all email
addresses in actual use today.
[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*@
(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?
A further change you could make is to allow any two-letter country code top level
domain, and only specific generic top level domains. This regex filters dummy email
addresses like asdf@adsf.adsf. You will need to update it as new top-level domains
are added.
[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*@
(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+(?:[A-Z]{2}|com|org|net|edu|gov|mil|
biz|info|mobi|name|aero|asia|jobs|museum)\b
So even when following official standards, there are still trade-offs to be made.
Don't blindly copy regular expressions from online libraries or discussion forums.
Always test them on your own data and with your own applications.
(First shot)
^[a-zA-Z]([a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?\.[a-zA-Z]([a-zA-Z0-9-]{0,61}[a-zA-Z0-
9])?(\.[a-zA-Z]([a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)?$
This code does NOT attempt, obviously, to ensure that the last level of the regular
expression matches a known domain...
To match something similar to XML-tags you can use regular-expressions, too. Let's
assume we have this text:
% set text {<bo>s</bo><it><bo>M</bo></it>}
We can match the body of bo with this regexp:
% regexp "<(bo)>(.*?)</bo>" $text dummy tag body
Now we extend our XML-text with some attributes for the tags, say:
% set text2 {<bo h="m">s</bo><it><bo>M</bo></it>}
If we try to match this with:
% regexp "<(bo)\\s+(.+?)>(.*?)</bo>" $text2 dummy tag attributes body
it won't work anymore. This is because \\s+ is greedy (in contrary to the non-
greedy (.+?) and (.*?)) and that (the one greedy-operator) makes the whole
expression greedy.
See Henry Spencer's reply in tcl 8.2 regexp not doing non-greedy matching
correctly, comp.lang.tcl, 1999-09-20.
The correct way is:
% regexp "<(bo)\\s+?(.+?)>(.*?)</bo>" $text2 dummy tag attributes body
Now we can write a more general XML-to-whatever-translater like this:
Substitute [ and ] with their corresponding [ and ] to avoid confusion with "subst"
in 3.
Substitute the tags and attributes with commands
Do a "subst" on the whole text, thereby calling the inserted commands
proc xml2whatever {text userCallback} {
set text [string map {[ [ ] ]} $text]
# replace all tags with a call to userCallback
# this has to be done multiple times, because of nested tags
# match each tag (everything not space after <)
# and all the attributes (everything behind the tag until >)
# than match body and the end-tag (which should be the same as the
# first matched one (\1))
while {[regsub -all {<(\S+?)(.*?)>(.*?)</\1>} $text "\[[list
$userCallback \\1 \\2 \\3]\]" text]} {
# do nothing
}
return [subst -novariables -nobackslashes $text]
}
You can't negate a regexp, but you CAN negate a regexp that is only a simple
string. Logically, it's the following:
match any single char except first letter in the string.
match the first char in string if followed by any letter except the 2nd
match the first two if followed by any but the third, et cetera
Then the only thing more is to allow a partial match of the string at end of line.
So for a regexp that matches
any line that DOES NOT have the word ''foo'':
elfring 2004-07-05: Does anybody know problems and solutions to match optional
parts with regular expressions?
See Matching of optional parts in regular expressions, comp.lang.tcl, 2004-07-01.
MG 2004-07-17: The problem with the regexp there seems to be that one of the parts
to match optional white space is in the wrong place, and is matching too much. If
you use this regexp instead, it works for me, on Win XP with Tcl 8.4.6. (The change
is that, after </S_URI> and before <P_URI>, .*? has been moved inside (?: ... )
set pattern {<name>(.+)</name>(?:.*?<scope>(SYSTEM|PUBLIC)</scope>.*?<S_URI>(.
+)</S_URI>(?:.*?<P_URI>(.+)</P_URI>)?)?(?:.*?<definition>(.*?)</definition>)?(?:.*?
<attributes>(.*?)</attributes>)?.*?<content>(.*)</content>\s*$}
regexp $pattern $str z name scope system public definition attributes content