In this blog, we will know about regular expression operations in python.
Regular expressions (called REs, or regexes, or regex patterns) are essentially a tiny, highly specialized programming language embedded inside Python and made available through the re module. Using this language, we can specify the rules for the set of possible strings that we want to match; this set might contain English sentences, or e-mail addresses, or TeX commands, or anything you like.
A regular expression (or RE) specifies a set of strings that matches it; the functions in this module let you check if a particular string matches a given regular expression (or if a given regular expression matches a particular string, which comes down to the same thing).
.
(Dot.) In the default mode, this matches any character except a newline. If the DOTALL flag has been specified, this matches any character including a newline.
^
(Caret.) Matches the start of the string, and in MULTILINE mode also matches immediately after each newline.
$
Matches the end of the string or just before the newline at the end of the string, and in MULTILINE mode also matches before a newline.
*
Causes the resulting RE to match 0 or more repetitions of the preceding RE, as many repetitions as are possible. ab* will match 'a', 'ab', or 'a' followed by any number of 'b's.
+
Causes the resulting RE to match 1 or more repetitions of the preceding RE. ab+ will match 'a' followed by any non-zero number of 'b's; it will not match just 'a'.
?
Causes the resulting RE to match 0 or 1 repetitions of the preceding RE. ab? will match either 'a' or 'ab'.
*?, +?, ??
The '*', '+', and '?' qualifiers are all greedy; they match as much text as possible. Sometimes this behaviour isn’t desired; if the RE <.*> is matched against '<a> b <c>', it will match the entire string, and not just '<a>'. Adding ? after the qualifier makes it perform the match in non-greedy or minimal fashion; as few characters as possible will be matched. Using the RE <.*?> will match only '<a>'.
{m}
Specifies that exactly m copies of the previous RE should be matched; fewer matches cause the entire RE not to match. For example, a{6} will match exactly six 'a' characters, but not five.
{m,n}
Causes the resulting RE to match from m to n repetitions of the preceding RE, attempting to match as many repetitions as possible. For example, a{3,5} will match from 3 to 5 'a' characters. Omitting m specifies a lower bound of zero, and omitting n specifies an infinite upper bound. As an example, a{4,}b will match 'aaaab' or a thousand 'a' characters followed by a 'b', but not 'aaab'. The comma may not be omitted or the modifier would be confused with the previously described form.
{m,n}?
Causes the resulting RE to match from m to n repetitions of the preceding RE, attempting to match as few repetitions as possible. This is the non-greedy version of the previous qualifier. For example, on the 6-character string 'aaaaaa', a{3,5} will match 5 'a' characters, while a{3,5}? will only match 3 characters.
\
Either escapes special characters (permitting you to match characters like '*', '?', and so forth), or signals a special sequence; special sequences are discussed below.
[]
Used to indicate a set of characters. In a set:
1. Characters can be listed individually, e.g. [amk] will match 'a', 'm', or 'k'.
2. Ranges of characters can be indicated by giving two characters and separating them by a '-', for example [a-z] will match any lowercase ASCII letter, [0-5][0-9] will match all the two-digits numbers from 00 to 59, and [0-9A-Fa-f] will match any hexadecimal digit. If - is escaped (e.g. [a\-z]) or if it’s placed as the first or last character (e.g. [-a] or [a-]), it will match a literal '-'.
3. Special characters lose their special meaning inside sets. For example, [(+*)] will match any of the literal characters '(', '+', '*', or ')'.
4. Character classes such as \w or \S are also accepted inside a set, although the characters they match depends on whether ASCII or LOCALE mode is in force.
5. Characters that are not within a range can be matched by complementing the set. If the first character of the set is '^', all the characters that are not in the set will be matched. For example, [^5] will match any character except '5', and [^^] will match any character except '^'. ^ has no special meaning if it’s not the first character in the set.
|
A|B, where A and B can be arbitrary REs, creates a regular expression that will match either A or B. An arbitrary number of REs can be separated by the '|' in this way.
\d
Matches a digit[0-9]
\D
Matches any character which is (a non-digit) not a decimal digit. This is the opposite of \d. If the ASCII flag is used this becomes the equivalent of [^0-9].
\s
Matches Unicode whitespace characters (which includes [ \t\n\r\f\v], and also many other characters,whitespace (tab, space, newline, etc.)
\S
Matches any character which is not a whitespace character. This is the opposite of \s.
If the ASCII flag is used this becomes the equivalent of [^ \t\n\r\f\v].
\w
Matches Unicode word characters; this includes most characters that can be part of a word in any language, as well as numbers and the underscore [a-zA-Z0-9_].
\W
Matches any character which is not a word character. This is the opposite of \w. If the ASCII flag is used this becomes the equivalent of [^a-zA-Z0-9_]
^
Matches start of string, or line
\A
Matches only at the start of the string.
\Z
Matches only at the end of the string.
\b
Matches the empty string, but only at the beginning or end of a word.
\B
Matches the empty string, but only when it is not at the beginning or end of a word. \B is just the opposite of \b.
import re
re.search(pattern, string, flags=0)
Scan through string looking for the first location where the regular expression pattern produces a match, and return a corresponding match object. Return None if no position in the string matches the pattern; note that this is different from finding a zero-length match at some point in the string.
pattern = 'Machine Learning'
text = 'Data Science is forcing every business to act differently. The decision making today is far more \
complex and driven by AI and Machine Learning Models. The Business Intelligence tools of yesterday are being rewritten to incorporate Data Science and Machine Learning.'
match = re.search(pattern, text)
if match:
print('found a match!')
else:
print('no match')
match.group(0)
The start() and end() methods give the indexes into the string showing where the text matched by the pattern occurs.
start = match.start()
end = match.end()
print('Found "{}" \n in "{}"\nfrom {} to {} ("{}")'.format(match.re.pattern, match.string, start, end, text[start:end]))
re.match() function of re in Python will search the regular expression pattern and return the first occurrence. The Python RegEx Match method checks for a match only at the beginning of the string. So, if a match is found in the first line, it returns the match object. But if a match is found in some other line, the Python RegEx Match function returns null.
m = re.match(pattern, text)
print(m)
The literal text "Machine Learning" does not appear at the start of the input text, it is not found using match().
pattern = 'Data Science'
text = 'Data Science is forcing every business to act differently. The decision making today is far more \
complex and driven by AI and Machine Learning Models. The Business Intelligence tools of yesterday are being rewritten to incorporate Data Science and Machine Learning.'
m = re.match(pattern, text)
print(m)
Now this time matched the pattern because it is at the beginning of the string.
The fullmatch() method requires that the entire input string match the pattern.
pattern = 'Machine Learning'
text = 'Data Science is forcing every business to act differently. The decision making today is far more \
complex and driven by AI and Machine Learning Models. The Business Intelligence tools of yesterday are being rewritten to incorporate Data Science and Machine Learning.'
s = re.fullmatch(pattern, text)
print('Full match :', s)
It didnot match, so returned 'None'.
pattern = "This"
text = "This"
s = re.fullmatch(pattern, text)
print('Full match :', s)
pattern = "This"
text = "This is a flower."
s = re.fullmatch(pattern, text)
print('Full match :', s)
re.findall(pattern, string, flags=0)
Return all non-overlapping matches of pattern in string, as a list of strings or tuples. The string is scanned left-to-right, and matches are returned in the order found. Empty matches are included in the result.
The result depends on the number of capturing groups in the pattern. If there are no groups, return a list of strings matching the whole pattern. If there is exactly one group, return a list of strings matching that group. If multiple groups are present, return a list of tuples of strings matching the groups. Non-capturing groups do not affect the form of the result.
pattern = 'Machine Learning'
text = 'Data Science is forcing every business to act differently. The decision making today is far more \
complex and driven by AI and Machine Learning Models. The Business Intelligence tools of yesterday are being rewritten to incorporate Data Science and Machine Learning.'
matches = re.findall(pattern, text)
for match in matches:
print('Found {!r}'.format(match))
pass
#\b - Matches the empty string, but only at the beginning or end of a word.
#M - It should match 'M'
#[a-z] - set of character from a to z
#* - It should be zero or more
pattern = r'\bM[a-z]*'
text = 'Data Science is forcing every business to act differently. The decision making today is far more \
complex and driven by AI and Machine Learning Models. The Business Intelligence tools of yesterday are being rewritten to incorporate Data Science and Machine Learning. '
re.findall(pattern, text)
The finditer() function returns an iterator that produces Match instances instead of the strings returned by findall().
pattern = 'Machine Learning'
text = 'Data Science is forcing every business to act differently. The decision making today is far more \
complex and driven by AI and Machine Learning Models. The Business Intelligence tools of yesterday are being rewritten to incorporate Data Science and Machine Learning. '
for match in re.finditer(pattern, text):
s = match.start()
e = match.end()
print('Found "{}" \n from {} to {} ("{}")'.format(pattern, s, e, text[s:e]))
regexes = [
re.compile(p)
for p in ['Data', 'Machine', 'test']
]
text = 'Data Science is forcing every business to act differently. The decision making today is far more \
complex and driven by AI and Machine Learning Models. The Business Intelligence tools of yesterday are being rewritten to incorporate Data Science and Machine Learning. '
print('Text: {}\n'.format(text))
for regex in regexes:
print('Searching "{}" ->'.format(regex.pattern),
end=' ')
if regex.search(text):
print('match!')
else:
print('no match')
re.sub(pattern, repl, string, count=0, flags=0)
Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement repl. If the pattern isn’t found, string is returned unchanged. repl can be a string or a function; if it is a string, any backslash escapes in it are processed. That is, \n is converted to a single newline character, \r is converted to a carriage return, and so forth. Unknown escapes of ASCII letters are reserved for future use and treated as errors. Other unknown escapes such as \& are left alone. Backreferences, such as \6, are replaced with the substring matched by group 6 in the pattern.
pattern = 'Machine Learning'
text = 'Data Science is forcing every business to act differently. The decision making today is far more \
complex and driven by AI and Machine Learning Models. The Business Intelligence tools of yesterday are being rewritten to incorporate Data Science and Machine Learning. '
# replacing Machine Learning with ML
result = re.sub(pattern, 'ML', text)
print(result)
text = "john xxx john yyy"
#check either xxx or yyy
pattern = "xxx|yyy"
replacing_string = "john"
result = re.sub(pattern, replacing_string, text)
print(result)
pattern = r'\bM[a-z]*\sL[a-z]*'
text = 'Data Science is forcing every business to act differently. The decision making today is far more \
complex and driven by AI and Machine Learning Models. The Business Intelligence tools of yesterday are being rewritten to incorporate Data Science and Machine Learning. '
replacing_string = "ML"
result = re.sub(pattern, replacing_string, text)
print(result)
pattern = r'\bM[a-z]*\sL[a-z]*'
text = 'Data Science is forcing every business to act differently. The decision making today is far more \
complex and driven by AI and Machine Learning Models. The Business Intelligence tools of yesterday are being rewritten to incorporate Data Science and Machine Learning. '
replacing_string = "ML"
result = re.subn(pattern, replacing_string, text)
print(result)
We can see result we got in tuple. First item in tuple is new text which is replaced by pattern and second item is count of replacement.
type(result)
print(len(result))
print(result[0])
print(result[1])
That's it in this blog. Thanks for reading.