If you’re new to regular expressions there’s some things to keep in mind when authoring a new regex.
zipcode_re = re.compile(r"\d{5}") # e.g. "02108" in Boston
The comment will prove invaluable to maintainers.
Also, naming a compiled regex gives a test suite an opportunity to try a few matches. That’s a great place to offer additional instructive examples.
Compile the regex just once, outside of a loop that might scan each line in a text file.
Always use raw strings when writing a regex, even if it’s not strictly needed. Just do it habitually. Then later maintenance will be easier.
Part of “being conservative”, so we don’t get spurious matches, is to explicitly anchor your regex.
re.compile(r"\d{5}")
re.compile(r"^\d{5}$")
Now a call to
zipcode_re.search()
will only match a
string if it is exactly five characters long and is entirely
numeric.
The
^
caret will only match start-of-string, and
$
dollar will only match end-of-string (assuming you don’t
wander off into
re.MULTILINE
territory.)
Suppose each
address
line starts with city + state that
we should skip over. We
could
write
re.compile(r"(\d{5})$")
but a more conservative approach would describe both the part we skip and the part we capture:
re.compile(r"^[\w -]+, +[A-Z]{2} +(\d{5})$") # "New York, NY 10101"
That way if unexpected data shows up it will error and we will notice so we fix the data stream.
Once you have four or more groups it’s really time to start naming them.
re.compile(r"^(?P<city>[\w -]+), +(?P<state>[A-Z]{2}) +(?P<zip>\d{5})$")
prev
Copyright 2023 John Hanley. MIT licensed.