Regular expressions in Python

Coding with Python

🕑 This lesson will take about 40 minutes

A regular expression (also called REs, regexes, or regex patterns) is a special sequence of characters that can be used to match or find other strings or sets of strings, following a specialised syntax held in a pattern. Regular expressions are often used to check, for example, if email addresses are in a valid format, or phone numbers are in a valid format, etc. They can also be used to check that URLs or web requests are formatted correctly in web applications.

To use regular expressions, you can import the re module. The re module will raise the exception re.error if an error occurs while using a regular expression.

Expressions

The following are different types of expressions that can be used to check for a match. These expressions can be combined to check if a string meets certain conditions (for example, if a string contains numbers only, or if a password contains a mix of letters, numbers, and symbols):

  • a, X, 9, < - ordinary characters will match themselves exactly.

  • . (period) - will match any single character except the newline '\n' character

  • \w - will match a "word" character: a letter or digit or underbar [a-zA-Z0-9_].

  • \W - will match any non-word character.

  • \b - a boundary between word and non-word

  • \s - will match a single whitespace character (space, newline, return, tab)

  • \S - will match any non-whitespace character.

  • \t, \n, \r - tab, newline, return

  • \d - matches any decimal digit [0-9]

  • \D - matches any non-digit character

Some basic modifiers

  • + - match 1 or more

  • ? - match 0 or 1 repetitions

  • * - match 0 or more repetitions

  • $ - matches at the end of string

  • ^ - matches start of a string

  • | - matches either/or. Eg. x|y = will match either x or y

  • [] - range, or "variance"

White space

  • \n - new line

  • \s - space

  • \t - tab

  • \e - escape

  • \f - form feed

  • \r - carriage return

Remember to escape these characters if used...

. + * ? [ ] $ ^ ( ) { } | \

Brackets

  • [] = co[rl]d = will find either cord, or cold. [a-z] = will return any lowercase letter a-z

  • [a-z] = will return any lowercase letter a-z

  • [1-5a-qA-Z] = will return all numbers 1-5, lowercase letters a-q, and uppercase letters A-Z

The match() function

The match() function will attempt to match regular expression patterns to a string, with an optional flag (that allows you to modify some aspects of how regular expressions function). The syntax is as follows:

re.match(pattern, string, flags = 0)

The pattern is the regular expression that is to be matched. The string is searched to match the regular expression at the beginning of the string. You can specify optional flags using the | operator (OR operator, a vertical bar character).

Example - matching a specific word at the start of a string

The sample code below imports the re module and initialises a variable called sentence with a string value. re.match() will check for a match using the regular expression pattern specified (it is looking for the word peter at the beginning of the string) and store the result (either True or False) in matchPattern. It will only check if the string starts with the specific pattern.

If we change the pattern to pattern = r'peck' then there will be no match (because ‘peck’ does exist in the sentence, but it is not at the start of the sentence).

The search() function

The search() function is similar to the match() function, except that search() doesnt restrict us to only finding matches at the beginning of the string. The search() function will check for a match anywhere in the given string.

Example - searching for a specific word

Searching for the word "peck" in the sentence "peter piper picked a peck of pickled peppers" using the search() function will give a match.

The sub() function (search and replace)

You can also use the sub() function to search for a matching pattern and replace the match with another string.

Example - find and replace

In the example below, we will find anything that is not a digit from a phone number (replace anything that is not a number with an empty string).

Regular expression examples

Here are some more examples of regular expressions for different scenarios in a program.

Example 1 - check if a string contains numbers only

In this example, a regular expression is used to check if a string contains only positive integers. The regular expression ("^\d+$") matches strings that contain one or more digits (\d+) from the beginning (^) to the end ($) of the string. The function will return True if the string contains only positive integers and False if it contains any characters that are not positive integers.

Example 2 - check if an email address is valid

In this example, a regular expression is used to check if a provided email address is valid. Email addresses are valid if they contain a username, followed by the @ symbol, followed by a domain eg. gmail.com (or a domain with a sub-domain eg. vip.gmail.com ).

The regular expression ("^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$") checks if a string matches the following pattern:

  • ^[a-zA-Z0-9_.+-]+: checks if the beginning of the string has one or more characters that can be letters (both uppercase and lowercase), digits, underscores, dots, plus signs, or hyphens before the @ symbol.

  • @: Checks for the "@" symbol.

  • [a-zA-Z0-9-]+: checks for or more characters (that can be uppercase or lowercase, digits, or hyphens) for the domain name after the @ symbol.

  • \.: The "." character (escaped with a backslash) is used to check for a . after the domain name (eg. gmail) and before the domain name suffix (eg. .com)

  • [a-zA-Z0-9-.]+$: checks for one or more characters that can be uppercase or lowercase letters, digits, dots, or hyphens for the domain name suffix (eg. .com) at the end of the string.

This regular expression can be used to cover most standard email address formats.

Example 3 - check if a password is valid

In this example, a regular expression is used to check if a provided password meets the following conditions:

  • must contain at least one uppercase letter

  • must contain at least one lowercase letter

  • must contain at least one number

  • must not contain any other characters

Note: Passwords are much stronger when they contain other characters too (such as !, @, #, $, %, etc.), however, for the sake of this example, we will keep things simple and only allow password that have uppercase letters, lowercase letters, and numbers.

The regular expression used in this example checks that the password contains at least one lowercase letter ((?=.*[a-z])), one uppercase letter ((?=.*[A-Z])), and one digit ((?=.*\d)). The rest of the pattern [A-Za-z\d]+ matches the actual allowable characters for the password.

These are just some examples of how you can use regular expressions in Python. If you’d like to know more about Python regular expressions, you can check out the official documentation on the Python website or check out examples on the W3Schools website.


Next lesson: Multithreaded programming