regex hygiene

If you’re new to regular expressions there’s some things to keep in mind when authoring a new regex.


give an example

    zipcode_re = re.compile(r"\d{5}")  # e.g. "02108" in Boston

The comment will prove invaluable to maintainers.

Also, naming a compiled regex gives a test suite an opportunity to try a few matches. That’s a great place to offer additional instructive examples.

Compile the regex just once, outside of a loop that might scan each line in a text file.

Always use raw strings when writing a regex, even if it’s not strictly needed. Just do it habitually. Then later maintenance will be easier.


anchor

Part of “being conservative”, so we don’t get spurious matches, is to explicitly anchor your regex.

    re.compile(r"\d{5}") 

    re.compile(r"^\d{5}$") 

Now a call to zipcode_re.search() will only match a string if it is exactly five characters long and is entirely numeric.

The ^ caret will only match start-of-string, and $ dollar will only match end-of-string (assuming you don’t wander off into re.MULTILINE territory.)


be specific

Suppose each address line starts with city + state that we should skip over. We could write

    re.compile(r"(\d{5})$")

but a more conservative approach would describe both the part we skip and the part we capture:

    re.compile(r"^[\w -]+, +[A-Z]{2} +(\d{5})$")  # "New York, NY 10101"

That way if unexpected data shows up it will error and we will notice so we fix the data stream.


name each capture group

Once you have four or more groups it’s really time to start naming them.

    re.compile(r"^(?P<city>[\w -]+), +(?P<state>[A-Z]{2}) +(?P<zip>\d{5})$")
prev

Copyright 2023 John Hanley. MIT licensed.