Regular expressions for JS developers

JS developers need to know regular expressions

I've lived a long life before I knew regular expression. Although it's not on the list of what I should know, without it, I feel a lack of something. So many times, when I look at the code base with hell-long regular expressions, my instinct is to ignore it but I know that good FE developers should know this trick so in this blog post, you and I will walk through it and master it together

Let's have a list of the main benefits of regular expression

  • Manipulating strings of HTML nodes

  • Locating partial selectors within a CSS selector expression

  • Determining whether an element has a specific class name

  • Input validation

  • And more

Why

Let's say we want to validate that a string, perhaps entered into a form by users, follows the format of a nine-digit US postal code. The rule is:

88888-8888

Each 9 represents a decimal digital, and the format is 5 decimal digits, following a hyphen, followed by 4 decimal digits

function isThatAZipCode(candidate) {
  if (typeof candidate !== string || candidate.length != 10) 
    { return false; }
  for (let n = 0 ; n < candiate.length ; n++) {
    let c = candidate[n];
    switch(n) {
        case 0 : case 1 : case 2 : case 3 : case 4 
        case 6 : case 7 : case 8 : case 9
        if( c < '0' || c > '9') return false;
        break;
        case 5 :
        if( c! == "-") return false;
        break;
    }
   }
   return true; 
}
💡
What if we can tell the computer we want the first five characters are a number followed by - and then last four characters are number
function isThisAZipCode(candidate) {
   return /^\d{5}-\d{4}$/.test(candidate);
}

Regular expressions in JS

A regular expression is a type of object. It can be either constructed with the RegExp constructor or written as a literal value.

  • Via a regular expression literal, /test/ (forward slash)

    💡
    /something/ is a syntax to create a regular expression
  • By constructing an instance of RexExp, new RegExp("test");

💡
Test,exec for regular expression

Regular expression's methods

Exec

Exec and match are similar but match are for string and exec for regular expression object

Exec in simple form

let match = /\d+/.exec("one two 100 200");
console.log(match);
// ["100"]
console.log(match.index);
// 8

Exec with group parentheses

let quotedText = /'([^']*)'/;
quotedText.exec("she said 'Hello'");
console.log(quoteText);
// ["'hello'","hello"];

quoteText = /bad(ly)?/;
quotedText.exec("bad");
console.log(quoteText);
//["bad",undefined];

Groups can be useful for extracting parts of a string

let input = "A string with 3 numbers in it... 42 and 88.";
let number = /\b\d+\b/g;
let match;
while ( match = number.exec(input)) {
    console.log("Found", match[0] , "at" , match.index);
}
// → Found 3 at 14
//   Found 42 at 33
//   Found 88 at 40

Terms and operators

Exact matching

Any character that's not a special character or operator must appear literally in the expression. For example, in our /test/ regex, that must match exactly the test keyword

A single match from a class of characters

💡
This is very common, in real world examples we usually say that I want to take a set of combination of a,b and c and followed by another set of combination c,d and f [abc][cdf]

A finite set of characters (one of)

[abc] means we want to match either a,b or c in a single character

A finite set of multiple characters (one of) plus quantities

Examples of valid matches:

  • a

  • b

  • ab

  • ba

  • aaa

  • bbb

  • abababab, etc.

💡
It's a large amount set of characters can match this statement

Let's try

[ab]{2}

So, there are 4 possible strings that match [ab]{2}:

  1. aa

  2. ab

  3. ba

  4. bb

Anything but a finite set of characters

[^abc] means we want to match anything but either a, b or c

From a range [a-m]

Instead of writing a long [abcdefghijklm], we can write [a-m]

💡
[a-z] [a-zA-Z] is okay, can't write it like this [a-Z], even better than \w+ (more specific is good)

💡
A range is especially helpful

But what about we want to match

💡
const str = "cat or dog";

A single match wouldn't work well here ??? How do we express to match either cat or dog ?

Parentheses for grouping

Escaping (backslash)

Sometimes, we need to express our searching in special characters like $ and \ ^ [] But these characters have special meanings in Regex; how can we tell them? "Hey, I want these to match these exact special characters." So backslash are a way to make it a literal match. A double backslash // matches a single backslash

Some of the special characters we have in our natural language are: asterisk (*), ampersand (&), braces{}, comma (,), brackets ([]),hyphen (-), equal-size (=), parenthesis (()), semicolon (;), slash (/), etc.

Begins and Ends

Unfortunately, most of the time, when we use /test/, it will match anywhere in the string, so in this case, i'matest will start the match at the end of the string. And if we want it to start at a complete new string, not a substring?

/^test$/

Using both ^ and $ indicates that the specified pattern must encompass the entire candidate string

Apply with strict comparison.

const number= "1234-5678-123456";
/\d{4}-\d{4}-\d{5}/.test(number); //true
//Apply strict comparision with ^ and $
/^\d{4}-\d{4}-d{5}$/.test(number); //false
console.log(/cat/.test("concatenate"));
// → true console.log(/\bcat\b/.test("concatenate")); // → false

Quantifiers

💡
Any of these repetition operators can become greedy or non-greedy. By default, they're greedy; they will consume every possible character to make up a match

Combine greedy character (+) with non-greedy character +?

Predefined character classes

Regular expression provides some common set of characters that we often want to match

💡
Instead of writing [0–9], we can just use \d. And regular expression has a lot of utilities classes for us to use :)
PredefinedMatches
\tHorizon tab
\bBackspace
\nNewline
.Any character (more powerful than \w) can take * & ^ % (many more special characters) except for white space (\s)
\dAny digital number
\DAny character but the number
\wAny alphanumeric character, including underscores, is equivalent to [A-Za-z0-9_]. Note: white-space is not alphanumeric character
\WAny character but alphanumeri, including underscore characters; equivalent to [^A-Za-Z0-9_]
\sAny whitespace character (space, tab, form feed, and so on) [ \t\n\r\f\v] (white-space is not a character)
\SAny character but a whitespace character
[\s\S]Get any character, including white-space (more powerful than any character )
\bA word boundary
\BA non-word boundary
💡
The difference between \w and . is (\w doesn't contain special character like $,-,^,etc)
💡
The difference between . and [\s\S] is that . doesn't contain white-space

Visual studio code

In visual code, \s doesn't contain \n. Details are here

https://github.com/microsoft/vscode/issues/108368

The problem with \s in visual studio code is \s is very broad concept, so when you declare \s (it only means searching for white-space) but not including new-line(\s) so if you really need to include \s and \S you have to explicitly tell VS code to do so

<Form[\s\S]*

Capturing matching segments(phân đoạn)

Perform a single captures

Say we want to extract a value that's embedded in a complex string. A good example of such a string is the value of the CSS transform property, through which we can modify the visual position of an HTML element

💡
Unfortunately, browsers don't provide an API to easily extract the amount of value from which an element is translated.So we create our own function
<html>
    <div style="transform: translateX(15px)">
      This is a simple HTML + CSS template!
    </div>
    <div id="test1"></div>
    <div id="test2"></div>
    <div id="test3"></div>
  </body>
  <script>
    const transformValue = styleElement.style.transform;
    if (transformValue) {
      const match = transformValue.match(/translateX\(([^/)]+)\)/);
      match[0]="transform: translateX(15px);
      match[1]= 15px
    }
  </script>
</html>
💡
Match behaves exactly like match to an array of matching regular expressions

Match segment

Using a local regular expression without /g. The string object match method will return an array containing

  • entire matched string

  • along with matches captured in regular expressions but for the first match only

const html = "<div class='test'><b>Hello</b> <i>world!</i></div>";
const result = html.match(/<(\/?)(\w+)([^>]+?)>/);
result: [ '<div class=\'test\'>', '', 'div', ' class=\'test\'' ]

Match segment doesn't work with global

💡
() or without() is the same thing

Using match with speard JS operation

JS has a standard class for representing dates

console.log(new Date());
//Wed Dec 27 2023 11:09:17 GMT+0700 (Indochina Time)

// console.log(new Date(2009, 11, 9));
// → Wed Dec 09 2009 00:00:00 GMT+0100 (CET) 

//console.log(new Date(2009, 11, 9, 12, 59, 59, 999)); 
// → Wed Dec 09 2009 12:59:59 GMT+0100 (CET)Referencing captures
💡
JS uses a convention where the month number starts at 0 (11 is December) and the date number starts at 1. It's really confusing

We can do it better

function getDate(string) {
    const [_,month,day,year]= /(\d{1,2})-(\d{1,2})-(\d{4})/.exec(string);
    return new Date(year,month-1,day);
}
console.log(getDate("1-30-2003"));
// → Thu Jan 30 2003 00:00:00 GMT+0100 (CET)

Capture reference within the replace string

In this code, the value of the first capture (in this case, the capital letter F) is referenced in the replace string (via $1). This allows us to specify a replacement string without even knowing what its value will be until matching time. That’s a powerful ninja-esque weapon to wield.

"fontFamily".replace(/([A-Z])/g,"-$1").toLowerCase();
//font-family
💡
$1 is the first capture group

Replace string

The replace method of the String object is a powerful and versatile method. When a regular expression is provided as first parameter to replace, it will cause a replacement on a match ( or matches if the regex is global) to the pattern rather than a fixed string

Simple form

//Replace with simple string
console.log("papa".replace("p","m"));
// mapa
"ABCDEF".replace(/[A-Z]/,"X");
// XBCDEF
"ABCDEF".replace(/[A-Z]/g,"X";
//XXXXXX

Swap string

"Liskov, Barbara\nMcCarthy, John\nWadler, Philip".replace(/(\w+), (\w+)/g,"$2 $1")
// Barbara Liskov
// John McCarthy
// Philip Wadler

Regular expressions with function

"border-bottom-width".replace(/-(\w)/g,(all,letter)=>{
   return letter.toUpperCase();
});
// borderBottomWidth
// the function will get called twice , each with all = -b , letter = b
// second is all = -w , letter = w

We can use replace function to iterate over a string as well

💡
If we use a local match with this exact regular expression, we can only get foo and "1.". And if we use global match, we will get foo=1, foo=2, blah=a, blab=b, foo=3
// data: foo=1&foo=2&blah=a&blah=b&foo=3

// expected: "foo=1,2,3&blah=a,b",

function compress(source) {
    const keys = {};
    source.replace(/([^=&]+)=([^&]*)/g,
    function (full,key,value) {
       keys[key] =
            (keys[key] ? keys[key] + "," : "") + value;
       return "";
   });
//full = foo=1 , key=foo , value = 1

  const result = [];
  for (let key in keys) {
    result.push(key + "=" + keys[key]);
  }
  return result.join("&");
}
}

The most interesting aspect of this example is its use of the string replace method as a means of traversing a string for values rather than as a search-and-replace mechanism. The trick is twofold: passing in a function as the replacement value argument, and instead of returning a value, using it as a means of searching

With global matching string, we can only get a list of matches.

The search method

The indexOf method on strings cannot be called with a regular expression. But there is another method, search, that does expect a regular expression. Like indexOf, it returns the first index on which the expression was found, or -1 when it wasn’t found.

console.log(" word".search(/\S/);
// -> 2
console.log(" ".search(/\S/); 
// -> -1
However there's no way to indicate where the match should start like
indexOf

Solving common problem with regular expressions

Matching newlines

When performing a search, it’s sometimes desirable for the period (.) term, which matches any character except for newline, to also include newline characters. Regular expression implementations in other languages frequently include a flag for making this possible, but JavaScript’s implementation doesn’t.

const html = "<b>Hello</b>\n<i>world!</i>";
/.*/.exec(html)[0] === "<b>Hello</b>";

/[\S\s]*/.exec(html)[0] ===
        "<b>Hello</b>\n<i>world!</i>",
💡
There's an interesting combination here; we define a character class that matches anything that's not a whitespace and anything that's a whitespace. This union is a set of characters

Another approach is ( | )

/(?:.|\s)/.exec(html)[0] === "<b>Hello</b><i>world!</i>";
💡
We match everything with . but also newline. The result is a set of characters, including everything and newlines. Note the use of a passive subexpression to prevent any unintended captures. Because of its simplicity (and implicit speed benefits),

Exercises

  1. In JavaScript, regular expressions can be created with which of the following?

    a   Regular expression literals

    b   The built-in RegExp constructor

    c The built-in RegularExpression constructor

  2. Which of the following is a regular expression literal?

    a /test/
    b \test\
    c new RegExp("test");

  3. Choose the correct regular expression flags:

    a /test/g
    b g/test/
    c new RegExp("test", "gi");

  4. The regular expression /def/ matches which of the following strings?

    a One of the strings d, e, or f

    b def

    c de

  5. The regular expression /[^abc]/ matches which of the following?

    a One of strings a, b, c

    b   One of strings d, e, f

    c Matches the string ab

  6. Which of the following regular expressions matches the string hello?

    a /hello/
    b /hell?o/

    c /helo/
    *d /[hello]/

  7. The regular expression /(cd)+(de)*/ matches which of the following strings?

    a cd
    b de
    c cdde
    d cdcd
    e ce
    f cdcddedede

  8. In regular expressions, we can express alternatives with which of the following?

    a#

    b&

    c|

  9. In the regular expression /([0-9])2/, we can reference the first matched digit with which of the following?

    a /0

    b /1

    c \0

    d \1

  10. The regular expression /([0-5])6\1/ will match which of the following?

    a060

    b 16

    c 261

    d 565

  11. The regular expression /(?:ninja)-(trick)?-\1/ will match which of the fol-lowing?

    a ninja-
    b ninja-trick-ninja

    c ninja-trick-trick

  12. What is the result of executing "012675" replace(/0-5/g, "a")?

    a aaa67a

    b a12675

    c a1267a

Practical use-case with Regular expression

Making self-closing tag

Solving the problem with self-closing tags. Not all of the HTML elements have this behavior, so we can come up with a way to use a JS regular expression to parse them correctly. Some of the HTML elements have a closing tag

  • area

  • base

  • img

  • link

  • menuitem

  • meta

  • source

  • etc

💡
The idea is that if the users pass in the non-self-closing tag, we should be able to format it correctly. For example: <a/> we should correct it into <a> </a>

HTML wrapping

According to the semantics of HTML, some HTML elements must be within certain container elements before they can be injected.

Element nameAncestor element
<option>,<optgroup><select multiple>...</select>
<legend><fieldset>....</fieldset>
<thead>,<tbody>,<tfoot>,<colgroup>,<caption><table>....</table>
<tr><table><thead>...</thead><table><table><tbody>...</tbody></table> ...
<td>,<th><table><tbody><tr>...</tr></tbody></table>
<col><table><tbody></tbody><colgroup>...</colgroup>
<option>Yoshi</option>
<option>Kuma</option>
<!-- We want this to turn into this -->
<select multiple>
  <option>Yoshi</option>
  <option>Kuma</option>
</select>

A simple method for accessing style

CSS attributes is different than accessing CSS value from JS

Consider this example

const fontSize = element.style['font-size'];

The preceding is perfect. But the following isn't

the fontSize = element.style.font-size;

As forcing every developer to obey the rules in JS. Instead, why don't we write a simple abstraction and let's everyone use what they prefer?

function style(element,name,value) {
    name = name.replace(/-[a-z]/ig,(all,letter)=>{
      return letter.toUpperCase();
    });
    if(typeof value !==undefined) {
          element.style[name] = value; 
    }
    return element.style[name];
}

Fetching computed style

Normalize fetching in a computed-style API. The computed style API lets you pass in the CSS property but we'd also like to support passing JS style to it for example, fontColor

function fetchComputedStyle(element,property) {
    const computedStyle = getComputedStyle(element);
    if(computedStyle) {
        property = property.replace(/([A-Z])/g,'-$1').toLowerCase();
        return computedStyle.getPropertyValue(property);
    }
}
fetchComputedStyle("div","fontSize");
💡
There are many use cases that need an elegant solution, like regular expressions. Having an understanding of Regex can level up our skills as FE developers