SEO
1664348402

Regular Expressions (RegEx): A Beginner's Guide

I used to use regular expressions only (.*) :) A few friends urged me to figure it out. But not knowing where they could be used, I put it off until better times.

That all changed when I had to work more closely with Google Analytics and Google Tag Manager in Netpeak. Without understanding regular expressions, it's hard to imagine normal settings for filters, custom segments in GA, or rules in GTM.

And now let's figure out where a beginner should start learning regular expressions.

Table of contents:

  1. What is RegEx (regular expressions)
  2. Syntax and symbols of RegEx
    2.1. Starting and Ending Patterns
    2.2. Special Characters
    2.3. Alternate Characters
    2.4. Groups in regular expressions
    2.5. Character Sets and Ranges in RegEx
    2.6. Repeating Characters in regular expressions
    2.7. Metacharacters
  3. Five ways to test your knowledge of regular expressions
    3.1. Learn regular expressions in a text editor
    3.2. Testing your knowledge of regular expressions in Regex
    3.3. Testing different types regular expressions with Jsfiddle
    3.4. Checking errors in regular expressions with Google Analytics
    3.5. Non-standard methods of mastering regular expressions
  4. Greedy and lazy quantifiers
  5. Where to use regular expressions
  6. What else to read about regular expressions
  7. Conclusions
  8. FAQ

What is RegEx (regular expressions)

Regular expressions (RegEx) are character sets used to search for text strings matching the required conditions. The result of a regular expression is a subset of data, selected according to the logic inherent in the expression. Regular expressions are used in any data set search tasks, for which you need to get a squeeze according to certain rules.

Syntax and symbols of RegEx

Most characters in regular expressions represent themselves, except for the group of special symbols "[ ] / ^ $ . | ? * + ( ) { }". If these symbols are to be represented as text characters, they must be escaped with a backslash "".

If these wildcards occur without a backslash, then they have special meanings in regular expressions:

Starting and Ending Patterns

  • "^" — carriage, circumflex, or just a checkmark. The beginning of the string;
  • "$" is a dollar sign. End of the string;

Special Characters

  • "." — dot. Any symbol (any one character, including a dot);
  • "*" — multiplication sign, asterisk. Any number of previous character (asterisk is a repetition of a character any number of times, including zero, i.e. that character may not exist at all and the regular will still work);
  • "+" — plus. 1 or more previous symbols. For example, .+ corresponds to one or an unlimited number of characters ("b", "bb" and "bgs5.62#d" will all match);
  • "?" — question mark. 0 or 1 previous symbol;

Alternate Characters

  • "|" is a vertical line. OR operator. For example:
    • The pattern “cat|mouse” will match both “cat” and “mouse” strings.

Groups in regular expressions

  • "( )" — parentheses. Grouping constructions. For example:
    • The pattern “123|45” corresponds to either "123" or "45", while "12(3|4)5" corresponds to either "1235" or "1245".

Character Sets and Ranges in RegEx

  • "[ ]" — square brackets. Any of the listed character sets, range. If the first character in this construction is "^", the array works the other way around — the character being checked must not be the same as the one listed in brackets. For example:
    • The pattern “[a-z]” corresponds to any character from a to z;
    • The pattern "[A-Z]" will correspond to a range of uppercase letters;
    • The pattern “[a-z][a-z][a-z]” will match all three-character words. But if you want to pick up a four- or six-character word, it is better to use "[]" in combination with "{}".

Repeating Characters in regular expressions

  • "{ }" — curly braces. Repetition of the symbol several times. For example:
    • The pattern “b{3}” will correspond to "bbb";
    • The pattern “B{3}” will correspond to "BBB";
    • The pattern “{3,5}” searches for at least 3 and at most 5 occurrences of the previous expression. Example “[a-c]{2,4}” will correspond to any letter from a to c, only if such letters occur from 2 to 4 times in the string. Thus, this expression corresponds to the symbols “ab” and “abc”, but not “aabbc”;
    • The pattern “[a-z]{6}” will correspond to any six-letter word (letter).
  • "" — backslash. Shielding of service characters.

Metacharacters

There are also special Metacharacters, they can replace some of the finished designs:

  •  — does not represent a symbol, but a boundary between symbols;
  • d is a numeric symbol;
  • D is a non-digital character;
  • s — whitespace character;
  • S is a non-space character;
  • w is an alphabetic or numeric character or underscore character;
  • W is any character other than a letter or number character or an underscore character.

Five ways to test your knowledge of regular expressions

When learning regular expressions, practice is very important. The more you practice, the faster you begin to build the right constructs and solve the problems with different types of regex.

1. Learn regular expressions in the text editor

I recommend that almost all beginners immediately install the text editor NotePad++ and start practicing with it. Why this text editor?

  • in most cases, wildcards do not need to be escaped;
  • Notepad++ saves the constructions of previous queries;
  • the "Marks" function clearly shows the result of a search for a given construct and allows you to quickly make corrections:

Learn regular expressions in a text editor

2. Test your regular expressions skills in Regex

The regex101.com online service allows you to enter a data set and a regular expression. The strings corresponding to the expression are then highlighted in the dataset. A special Explanation window parses the regular expression piece by piece:

Testing your knowledge of regular expressions in Regex

Let's practice: the task is to illuminate the whole line. You have to check the "multi line" checkbox.

the task is to illuminate the whole line

Test data:

https://www.site.ua/www.site.com/search?q=widget+thinger
https://www.site.com/page1/page2/page3/https://www.site.com/index.php
https://www.site.com/products/100.php
https://www.site.us/products/101.php
https://www.site.com/products/102.php
https://www.site.ua/duals/index.html
https://www.site.com/ourteam/index.phphttps://www.site.com/
https://site.com/profilehttps://www.site.ru/ua/index.php
https://www.site.com/ua/producty/100.php
https://www.site.com/ua/producty/101.php
https://www.site.com/ua/producty/102.php
https://1.site.com/search?q=widget
https://www.site.com/search?q=widget+thinger
https://www.site.com/search?q=smidges
https://www.site.com/index/yy.jpg

Different types of regex to check knowledge:

  • select all pages;
    • (.*) — any number of any characters;
  • select all pages with https;
    • ^https.* — all URLs that begin with https;
  • all pages are in Ukrainian;
    • .*/ua/.* — all pages with /ua/ in their URL. If you use just ua, the sample will get https://www.site.com/duals/index.html, it is unnecessary;
  • all index pages;
    • .*index.(php|html) — similar to the last expression, you cannot just use the index;
  • all product cards (for Russian and Ukrainian versions);
    • .*product(s|y).* or  .*product[sy].* — both options are fine.

3. Testing different types regular expressions with Jsfiddle

Jsfiddle is a tool for experimenting with JavaScript. In it, you can check the conditions to run a function or show the desired results.

This example shows how, based on regular expressions, it first determines if the clicked element is a link to a .pdf or .jpg file. Then, for items that are not file links, the name and price of the item are determined. All of this is figured out based on the text content of the items.

4. Checking errors in regular expressions with Google Analytics

The fastest way to test your knowledge of regular expressions in Google Analytics is to use filters in standard reports. Go to your account and in any report where filters are available, try sampling any set of data.

5. Non-standard methods of mastering regular expressions

For those who like interactivity:

Greedy and lazy quantifiers

Quantifiers in regular expressions allow you to define a part of a pattern that must be repeated several times in a row. The "greedy" quantizer tries to grab the biggest chunk of text it can. The "lazy" version (the modifier character "?" is added) looks for the smallest possible occurrence.

The greedy quantifier (*) captures everything from the first quotation marks to the last:

Greedy and lazy quantifiers

The lazy version of the quantizer (*?) looks for the smallest match, so it will find each substring individually:

it will find each substring individually

Where to use regular expressions

SEO professionals resort to regular expressions when working with Google Analytics, RewriteRule in .htaccess, in text editors, when working with crawlers (Netpeak Spider).

I will tell you about a few regular expressions that often help me.

  1. Highlight everything except the domain:
.*://|/.*

I use it when I have a large list of URLs (e.g. external links) and need to isolate only the domain for analysis. In NotePad++, I use the replace function to change it to an empty string and get a clean list of domains:

I use the replace function to change it to an empty string and get a clean list of domains

  1. Select the URL of the given nesting:
.*://site.com/.*?/.*?/.*?/

Here (/.*?/) means one level of nesting.

I use this expression when I need to set the maximum allowed URL nesting when crawling a site in Netpeak Spider.

To scan all URLs on the first nesting level only, you should specify the following settings in the service:

you should specify the following settings in the service

What else to read about regular expressions

Conclusions

Regular expressions — a useful, powerful and completely free tool for processing string data and simplifying work in various services.

It is quite difficult to master, even more difficult to learn how to use regular expressions correctly. In return, it will make your work much easier and much more efficient.

Let's sketch our favorite regular expressions in the comments, shall we?

FAQ

How do I learn regex expressions?

A regular expression is a pattern that is matched to a subject line from left to right. They are used to replace text in a string, check forms, extract a substring from a string based on a pattern match, and much more. You can get started learning regular expressions with our tutorial. Try running some of our examples with a text editor and, once you understand the basic syntax of regular expressions, move on to testing what you've learned.

Are there different versions of regex?

The syntax and semantics of regular expressions are standardized. But, there are many non-standard versions used in programming languages. Although the differences are minimal, programmers using regexes must know which variant is used by the engine in order to work correctly.

Where are regular expressions used?

Regular expressions are used when working with large amounts of data: tables and databases, texts, search engines, lexical analysis and much more. At the moment, most general-purpose programming languages support chi, for example: Python, C++, Java, JavaScript, Rust, etc.

What's the difference between () and [] in regular expression patterns?

“[ ]” — any of the listed character sets, range. “( )” is a grouping of constructions. For example:

  • The pattern “[a-z]” – оne character that is in the range of a-z;
  • The pattern “(a-z)” – explicit capture of a-z, no ranges.

Topics:
8
2
Found a mistake? Select it and press Ctrl + Enter