Changes: Regular expressions

Latest revision as of 02:13, 9 May 2022

Regular expressions, or "regex" for short, are strings of characters that can be used to search text files. Usually on a wiki, you use them to search wiki pages' source and make changes to a page. This is most frequently done using the tool AutoWikiBrowser, or AWB. This page will talk about regular expressions within the context of AWB specifically, but keep in mind that different regex engines, such as Notepad++ and SublimeText might treat some things differently. Also, if you're writing modules, Lua's string patterns are significantly different from regex. This page won't go into tons of detail, so all of the main concepts should apply regardless of what program you're using, but do keep in mind that some stuff might have a slightly different syntax elsewhere.

Resources for practicing or testing regex

Examples

Copy-pasting and formatting a list of files from Special:UncategorizedFiles into a page list that AutoWikiBrowser can use: https://regex101.com/r/BfRS2f/1

How to use this guide

This guide is written for people who feel very uncomfortable reading technical write-ups. If you don't mind reading technical docs, the site regular-expressions.info has a much more complete and concise explanation of everything you could possibly want to know about regex - however it comes at the cost of being technical from the beginning, and it may feel overwhelming until you have at least a bit of experience. You also may find the Regular expressions reference on the Wikipedia AWB guide useful as a "cheat sheet" reference.

The goal of this page is to prepare you for using regular-expressions.info as a resource, not to teach you everything you need to know about regex.

AWB-specific regular expressions

One quick note before we start....

What do "multiline" and "singleline" in AWB mean? You might not understand this yet but this is a FAQ about regex in AWB and so I wanted to include it near the start. Feel free to skip this section for now and come back later.

If you are using AutoWikiBrowser, there are some options associated to regular expressions, "multiline" and "singleline." At first it might seem like "multiline" means that you're inputting multiple lines, and "singleline" means you're inputting only one line, but actually these two terms have very specific meanings. If you click "Test" and hover them inside of the AWB Regex Tester, a tooltip will pop up in case you forget which is which.

Multiline - By default, the characters ^ and $ match the beginning and end of the entire wiki page. If you check this, they change meaning to match the beginning and end of each line - or, in other words, it treats your wikitext as having multiple lines you want to match on.
Singleline - By default, the character . matches everything except for the new line character. The newline character is given by \n. If you check this box, though, . changes its meaning so that it DOES match new lines. Or, in other words, it will treat your wikipage as if it's a single line for the purposes of matching .* or another quantifier.

Exploring the concept of a capture group

Imagine the following scenario:

You're five years old, and you're playing Minesweeper on your parents' computer. After about 10 minutes, you get bored and decide to start looking at open windows. One of them is a Notepad++ file with a shopping list! You know that Mommy always writes shopping lists and then Daddy does the shopping, and he always buys what's on the list. A brilliant idea begins to form. What if you changed every single item on the shopping list to say "Candy"—think of how much candy you'd get to eat!

Right now, the shopping list looks like this:

GS - flour
GS - evaporated milk
GS - sugar
GS - pumpkin
GS - all-spice
OSS - toner
AS - surprise

You move the cursor to the first line and change "flour" to "candy." But you're only five years old—typing is so hard! "Gosh," you say. "There must be a better way to make all of these replacements!" You know about find-replace, but you don't see how that could POSSIBLY help here. In despair, you turn to your robot friend. "Robot friend, can you help?" you say. But your robot friend is a plastic wind-up toy and doesn't say anything.

"Ah, I know the problem!" you think. "Robot Friend needs specific instructions because she's a robot!" You think for a second. "Robot Friend, change every item to say candy!" Robot Friend still does nothing. Again you wonder why. "Hmm," you say. "Maybe it's because Robot Friend doesn't know what an item is!"

That's a pretty good guess! (It clearly has nothing to do with the fact that you are talking to a plastic toy.) In the grocery list, each item is an abbreviation of a store type, and then what to buy from it. GS is Grocery Store, OSS is Office Supply Store…what is AS? Well, whatever they probably have candy there too.

"Robot Friend," you say. "In each line of this text file, replace all the content after the dash with the word candy." Wow, that's really specific! Unfortunately, Robot Friend is still a plastic toy and does nothing.

"Okay," you say. "I'm gonna google this!" So you go to Google and you search "How to change everything in my parents' shopping list to say candy???" The first result says, "Learn regular expressions to get candy"

"That looks promising," you say, and you open the link and start reading. The first thing you need to do, is open the find-replace window in Notepad++. "Wow, so convenient that my parents use such a nice text editor for their shopping lists," you think. You press Ctrl+H to pull up the Replace dialog directly, and you click the option that says "Regular expression."

The first thing you need to do, the website says, is to figure out what to find. "But that's obvious, it's the thing after the dash!" you think to yourself. But then you remember Robot Friend and her precise instruction requirement. "Ohhh," you think. "Find-replace is like a virtual Robot Friend!" Pleased that you have TWO Robot Friends now, you think about how to describe where you want to make the change more precisely.

"So, each line starts with a location. I want to keep that the same from the find to the replace. Then there's a dash, and I want to keep that the same too. And I guess I'll keep the space after the dash as well. But, then everything after that, I need to delete all of it! It needs to say candy!"

Ok well, that's a good explanation to a human, but not to a Robot Friend. How to precisely say, "keep everything up to and including the space after the dash the same, and change the rest of the line to the string candy?"

You read more of the website. "Using Capture Groups To Keep Parts Of Your String The Same" says the next heading. If you enclose part of your search/find string in () then it will be captured! Ah, perfect!

In normal find-replace, you can keep something the same by just retyping it out in the replace part. If you wanted to change "cat" to "cat" you'd just write "cat" twice and it would make all the replacements, but nothing would change. The difficulty here isn't that you need to keep something the same from find to replace, it's that you don't know what it is that you're keeping the same.

Ah, there is a solution for that! If you can somehow represent Everything up to and including the space after the - formally, regular expressions allows you to CAPTURE the characters that your formal definition matches, and then access them in the replacement string with the code $1! And then after that, you just want to print the word "candy"…that shouldn't be too hard!

So let's see…the find string will look something like this:

([Formal way to specify everything through dash+space])[formal way to specify everything else]

(In the above, the parentheses () are meaningful characters that are actually part of regular expression syntax. The brackets [] are just there to make it easier to read and have no meaning.)

And then the replace string will be exactly this:

$1candy

While it's not relevant to this example, you also notice that you could put MORE THAN ONE CAPTURE GROUP if you needed to, and then you could access them with $1, $2, etc., in the replacement string. You could even change the order of your captures by accessing $2 before $1! Wow, this is super exciting.

But how to finish this problem?? You look at the example on the website. It says that .* will match ANYTHING. Oh, great! You think. I'll write (.*).* and replace it with $1candy!

But when you try that, you get the following:

GS - flourcandy
GS - evaporated milkcandy
GS - sugarcandy
GS - pumpkincandy
GS - all-spicecandy
OSS - tonercandy
AS - surprisecandy

Yikes, not that! You quickly realize your mistake. Robot Friend always does EXACTLY what you tell it to do…and you never told it to stop at the dash! You try another time:

Find: (.*) - .*
Replace: $1candy

GScandy
GScandy
GScandy
GScandy
GScandy
OSScandy
AScandy

Uh oh, still not quite…Let's apply what we learned about having two capture groups!

Find: (.*)( - ).*
Replace: $1$2candy

GS - candy
GS - candy
GS - candy
GS - candy
GS - candy
OSS - candy
AS - candy

Ahhh, perfect! You think for a second, though. Do you really need two capture groups? What if instead you tried this:

Find: (.*) - .*
Replace: $1 - candy

GS - candy
GS - candy
GS - candy
GS - candy
GS - candy
OSS - candy
AS - candy

Yay, that works too! Pleased with yourself, and now exhausted, you save the new shopping list and take Robot Friend to sleep.

The next day, Mommy says, "I'm so sorry, we were going to make a pumpkin pie today but then Daddy just got a bunch of Halloween candy instead! So we'll have a lot of candy to give out to trick-or-treaters, but we can't make pie." Oh no, what have you done? You used your Robot Friend powers for evil and now you don't get pie??? "But there's something to make up for it," Mommy continues. "Meet your new kitten friend that we adopted and Daddy picked up from the Animal Shelter! We are calling her Candy since that's what the shopping list said!"

Ohh, so that's what AS stood for! You forget about the pie completely and play with your new kitten. The end.

So what have we learned so far?

Regular expressions are a way to precisely express relatively complicated find-replace instructions
.* matches anything
Enclosing part of a regular expression in parentheses "captures" it for the replace string
Capture groups are sequentially assigned the numbers 1, 2, 3, 4, etc
The replace string can reference capture groups with $1, $2, etc. The order is determined solely by the find string, and the replace string can reference capture groups in any order.

Character classes, escaping, and quantifiers

See Repetition with Star and Plus on regular-expressions.info for quantifiers.
See Character Classes or Character Sets on regular-expressions.info for character classes.

Say you keep a diary of how many kittens you see every day:

Sunday - saw 5 kittens
Monday - saw 3 kittens
Tuesday - saw 7 kittens
Wednesday - saw 2 kittens

You decide one day that you want to start writing in complete sentences. So you'd rather this say, "On Sunday, I saw 5 kittens." So you'll want to capture a word (the day of the week) as well as a number (the number of kittens you saw that day), and put these inside of your replacement string. In our first example we just used .* to select the entire portion of text. But let's get more precise about what we're capturing!

A "character class" is a precise way of telling the program what kind of characters are allowed to be in a particular group of characters. For example, "one digit" can be written as \d. "One letter" can be written as \w. "One character" can be written as . - yes, a lot of characters have special meaning in regular expression syntax! Fortunately, the program knows you might need to match an actual period in the lookup text, and it gives you the ability to do that - if you type \., then the backslash "escapes" the period.

Internationalization note - \w won't match all unicode characters, be careful if using this with non-ASCII letters!

Notice that I said "one digit" and "one letter" - but in the statement of what we want to do, it says "one number" and "one word"! All of the days of the week are a lot of letters long. And what if I saw 10 kittens on a single day? Then the number is 2 digits long!

Well, there's an answer for that too. If you write \d+, that matches "one or more digits" - or in other words, a number. If you write \w+, that means "one or more letters." And, if we write .+, that means "one or more characters of any type."

Types of numbers note - \d+ is not smart enough to know that sometimes there are commas or periods in the middle of numbers. So it will only match positive integers without any place-value separators!

(\w+) - saw (\d+) kittens becomes:
On $1, I saw $2 kittens.

Notice that in the replace string, we do NOT need to escape the period. That's because our replacement string needs to be 100% exactly defined. Telling the program, "replace the word 'Sunday' with any set of characters" makes no sense at all, whereas "find any set of characters between the word 'kittens' and 'cute'" does.

What if the data might be incomplete? Maybe on Thursday I forgot to write the number of kittens I saw, so my string is just Thursday - saw kittens. I want to convert this to the sentence On Thursday, I saw kittens. in the same action as I replace everything else. But, \d+ matches 1 or more digits, and...there's 0 digits here. Turns out there's a way to do that too!

(\w+) - saw (\d* *)kittens becomes:
On $1, I saw $2kittens.

Notice! We also included the space along with the \d*, because if the digit isn't there, the following space won't be there either!

What we've done here is change our + into a *. The * character is similar to the + character, except it can match 0 or more characters. So this find-replace will properly convert both "Thursday - saw kittens and Wednesday - saw 10 kittens at the same time.

The characters * and + are called "quantifiers" because they are ways to "quantify" the number of characters you're matching. There's one more thing to mention about quantifiers.

Suppose our starting text looked like this instead:

Sunday - saw 2 kittens and a bunch of bunnies and 3 puppies
Monday - saw a couple llamas and a few kittens and 2 elephants
Tuesday - saw 5 bears and a unicorn

We want to reformat this as follows:

On Sunday, I saw 2 kittens. I also saw a bunch of bunnies and 3 puppies.
On Monday, I saw a couple llamas. I also saw a few kittens and 2 elephants.
On Tuesday, I saw 5 bears. I also saw a unicorn.

Before, the number of animals we saw was always a collection of digits, or \d+. But now we have some English phrases mixed in too! So we can't just use \d. Fortunately, we discovered the character class . as well. So let's use that:
(\w+) - saw (.+) and (.*) (https://regex101.com/r/Fe033j/1/) becomes
On $1, I saw $2. I also saw $3.

But wait! There's a problem! This is the result we actually get:

On Sunday, I saw 2 kittens and a bunch of bunnies. I also saw 3 puppies.
On Monday, I saw a couple llamas and a few kittens. I also saw 2 elephants.
On Tuesday, I saw 5 bears. I also saw a unicorn.

See the problem? Instead of $1 going from the - to the first and, we went all the way to the second! Oh no, this is not what we want at all. The quantifiers + and * are "greedy" - they take as many characters for themselves as possible before letting you move onto the next part of your "find" string. We want the opposite - to take as few as possible.

The opposite of "greedy" might be "generous" but we're all about vices. So we're going to call the version that takes as few characters as possible, "lazy." So there's "greedy" quantifiers that take as much as possible, and "lazy" quantifiers that take as little as possible. The notation for lazy quantifiers is as follows:

+? matches 1 or more characters, as few as possible
*? matches 0 or more characters, as few as possible

So instead let's match the following:
(\w+) - saw (.+?) and (.*) becomes
On $1, I saw $2. I also saw $3.

And now we get the result we want!

To review:

"Character classes" are groups of types of characters, for example "digits" or "letters." There's a LOT of character classes we didn't cover here.
"Quantifiers" let you match more than one character at a time. We learned about 4 quantifiers - +, *, +?, and *?.
Some special characters, like . need to be escaped with a \ if you want to search for them inside a string, because they otherwise have special meanings.

Again, remember that we aren't going to come close to exhaustively covering every aspect of regex within these topics. Check out the resources linked at the start of the section for more information!

"Alternation," aka "Or"

Suppose you happened to have a particularly exciting week, where some days you saw kittens, some days you saw bunnies, and some days you saw puppies! Since you have good taste in cute baby animals, you think kittens and bunnies are way cuter than puppies. Your data looks like this:

Sunday - saw 5 kittens
Monday - saw 3 bunnies
Tuesday - saw 7 bunnies
Wednesday - saw 2 puppies

You want to discard the information about kittens vs bunnies and replace both of these words with "cute animals." Without regex, you could do this with 2 separate searches: first kittens -> cute animals and then bunnies -> cute animals. But with regex, we can do it in just one!

kittens|bunnies becomes
cute animals

The character | is called an "alternator" which is a really fancy way of saying "or." It lets you match the text on the left OR the text on the right.

Let's slightly modify our goal. Instead of just replacing "kittens" and "bunnies" with "cute animals," we want to add some more excitement! Let's try and rephrase this so that Sunday - saw 5 kittens becomes Sunday - hooray, 5 cute animals!. But since puppies aren't as cute, we'll leave that one alone. We can easily use the "alternation" we just learned:

- saw (\d) kittens|- saw (\d) bunnies becomes......

Well, wait, how do we replace this to get what we want? Remember that capture groups are numbered left to right? So our number of animals could be either $1 or $2 and we don't know which. But actually, one of these will be empty, so we could just put them next to each other like so:

- hooray, $1$2 cute animals!.

You can try this example yourself if you want to see proof that it works and play further.

If that thing with $1$2 and one of them being empty seemed kinda confusing & convoluted, well, that's because it is. Because what we really want to do, is have an "or" apply only to the last word of the line, leaving the first part alone. How can we do that?

Non-capturing groups

Wow, what a cliffhanger the last section left us on. First let's finish the example from there. Turns out, if you put an alternator (remember that's a fancy way of saying "or," and it's written like this: |) inside of a capture group, you can apply the "or" to only part of the match! Let's try it:

- saw (\d) (kittens|bunnies) becomes
- hooray, saw $1 cute animals!

NICE! Parentheses can do two things, bundle an operation together, and ALSO "capture" the contents to use in the output! But, do we always want to do both of these things? Actually, we aren't using the contents of that second capturing group, there's no $2 in the replacement text. So let's invent some syntax to say, "I want the grouping part of this capturing group but I don't actually want to capture it." In other words, we want a group that doesn't capture, or....a non-capturing group!

- saw (\d) (?:kittens|bunnies) becomes
- hooray, saw $1 cute animals!

See that ?: at the start of the group? That means "I need a group but I don't want it to capture." This can be pretty useful in particular when you want to do a lot of alternation, and also if you're ever editing a regex and need to add a group but don't want to have to change your replace term just because you're adding more grouping.

If it's really confusing to you to use the ?: syntax, no problem - non-capturing groups don't actually add any new functionality to regex that you don't have without them, they just make it easier to work with your capturing groups.

Zero-width assertions

Note - this is NOT the same topic as "zero-length matches" on regular-expressions.info!

Like "alternator," this is a scary-seeming phrase that has a simple meaning. But before defining the term let's observe a few things about regular expressions. If you want to match the string caaaaat you might write ca+t. Or you could write caa*t (this is equivalent) (take a moment to figure out why those regex are the same if it's not obvious to you). The c matches the c, then the a+ or aa* matches all the a's, and then the t matches the t. The important thing to notice here is that each part of the text is matched by only one thing in the regex. So like, once you've matched the c, that's just it. The c is done. There's no more c-matching to be allowed to be done in the regex.

(This isn't an accident by the way. If you want to go down a giant rabbit hole, check out regular languages or deterministic finite automata on Wikipedia. I'm serious about it being a rabbit hole. But it's pretty cool.)

Sooooo what if we wanted to match some stuff more than once? Like for example, say we want to find the word "cat" but NOT the word "catastrophe" or "cats." No problem, just check for cat right? Wrong! What if the string said cat. or "cat" or even cat!? Uhhhhhh ok maybe this isn't so good. So what we want to do is say like, look for cat and then look for a non-letter character. (If you read supplemental information about character classes you'll know how to do this - if not check it out!) But what if cat is the very last thing in the document?

So here's where "zero-width assertions" come in. This phrase means "something to require be true about our match without actually advancing our search by any amount." Or another way to put it is, "a way to check some property of the string that doesn't actually contain any characters in the string."

The simplest zero-width assertions are ^ which means "beginning of line or file" and $ which means "end of line or file." The choice between line and file depends on the settings you've chosen, see AWB-specific regular expressions. The one we'd use in the cat example is cat\b. \b means "boundary" and requires that a word be beginning or ending at that location.

Lookaround

For a full treatment of this topic, see regular-expressions.info.

A special kind of zero-width assertion is called "lookaround." This section is included here because people ask about it a lot, but the truth is this is needlessly complicated and nearly every regex implementation has an easier way of accomplishing the same thing. Lookaround lets you add a zero-width assertion to your regex that checks behind or ahead for some condition to be true or false, and if your specification is violated it cancels the regex's ability to match. Sound confusing? Well, it's a complicated syntax and a complicated concept that can be replaced by using the "If..." tab in AWB's advanced search, which is extremely straightforward and makes sense so........yeah. But, if you like the idea of not having to click one button to use the "If" tab in exchange for learning a convoluted syntax that's even more difficult to use, read on!

Let's say you want to match Kittens are cute and replace it with Yes, kittens are cute. But, oh no!!! Someone really mean and bad vandalized your beautiful sentence to make it sometimes say Kittens are cute (not!!!). Who would do such a thing omg. We CERTAINLY don't want to write Yes, kittens are cute (not!!!), that would be terrible! So what we want to do is find places where it says Kittens are cute<AND THEN DOESNT SAY (not!!!) HERE>. Now, we don't know what comes after the sentence in nice places - it could be a period, it could be a !, it could be an end of line, or something else entirely. So we can't make any guarantee about what IS there, just about what ISN'T. Sounds like a task for a zero-width assertion!

The one we're about to use is called NEGATIVE LOOKAHEAD. Negative because want it to NOT be here, and Lookahead because we're looking....ahead. The other types are POSITIVE LOOKAHEAD, NEGATIVE LOOKBEHIND, and POSITIVE LOOKBEHIND. All of these 4 are similar in both concept and use, and so they're lumped into the umbrella term "lookaround."

So, let's first look at the syntax. As mentioned earlier, you should use regular-expressions.info for full information, we'll only go through negative lookahead here.

Kittens are cute(?! $not!!!$)

The first part we understand perfectly, Kittens are cute matches, well, Kittens are cute. But what about that second part?? First of all, let's notice that $ and $ are just the literal ( and ) characters, they need to be escaped. and not!!! is just the literal characters not!!!. That leaves us with the wrapper, (?!SOMETHING GOES HERE). Well like we learned earlier, parentheses can mean either "a group of stuff together" or "a capture group." In this case it means "a group of stuff together." The characters ?! means "check out in front and make sure that the contents of this is NOT the next thing in the string!!!!" So this regex will do what we want it to.

The stuff inside of the negative lookahead can be itself a regular expression. For example, if our nefarious anti-cat vandal tried to thwart our efforts by using a different number of exclamation points in different places, we could use this instead:

Kittens are cute(?! $not!+$)

In this case, we're matching 1 or more !s after the not.

If this seems super easy to you, go ahead and use it! But if not, really don't worry about it. We also could have used the "If" panel in AWB to make sure our selection DIDN'T include the literal phrase Kittens are cute (not!!!), which would have saved a lot of work and syntax memorization.

Go read regular-expressions.info!

Just one word of warning - this site is crazy complete, and in particular they include a LOT of things that aren't actually going to be relevant to you. For example, ignore balancing groups, recursion, and subroutines. Also, depending on where you are writing your regex, the syntax might be slightly different. If you're doing a regex search in the wiki, some things like \b won't work. Python uses a different syntax for referencing capture groups. Lua's string pattern matching isn't regular expressions at all (it's very limited in scope), and you should read the Lua docs for what they do. If you understood most stuff on this page, though, you're in a great position to learn stuff on your own! Feel free to use common sense if something seems too complicated to be worth learning, and skip stuff as you like. Remember, these are a tool to make your life easier - so if it's making your life harder, maybe you should try another approach. But regex are still super powerful - so go have fun!

@@ Line 2: / Line 2: @@
 {{Tocright}}
-== Resources for practicing regex ==
+== Resources for practicing or testing regex ==
 * https://regex101.com
 * https://regex-vis.com