Deep dive into String in JS

String, string and string

Just like the blog on collection I have written here. This time, we'll go deep into some foundations and even more advanced topics related to primitive values in JS. For me, it's easy to write a piece of code, but to actually understand it fully, it's hard

Avoid Object wrapper types

In addition to objects, JS has seven types of primitive values: string, numbers, booleans, null, undefined, symbol, and bigint

Primitives are distinguished from objects by being immutable and not having methods. You might witness that strings do have methods

"primitive".charAt(3)
// m

But things are not quite as they seem. There's actually something surprising and subtle going on here. While a string primitive does not have methods, JS also defines a String object type that does. JS freely converts between these types without our permission

💡
When you call a method on a primitive. JS wraps it in a String object, calls the method, and then throws the object away
const originalCharAt = String.prototype.charAt;
String.prototype.charAt = function (pos) {
    console.log(this,pos);
    originalCharAt.call(this,pos);
}
"Test".charAt(3);
//String {'Test'}0: "T"1: "e"2: "s"3: "t"length: 4[[Prototype]]: String[[PrimitiveValue]]: "Test" 3
💡
This can cause some unexpected issues between converting a primitive object and then converting it back to primitive
const x ="hello";
x.language="English";
// English;
x.language;
//undefined ( after convert it back to primitive the object is thrown aw)

Wrapper types

These wrapper types exist as a convenience to provide methods on primitive values and static values (String.fromCharCode). But there is no reason to instantiate them directly

  • string and String

  • number and Number

  • boolean and Boolean

  • symbol and Symbol

  • bigint and BigInt

💡
Avoid using wrapper types directly. It exists for convenience purposes

String

In JavaScript, the textual data is stored as strings. There is no separate type for a single character.

Quotes

A string can be enclosed within either single quotes, double quotes, or backticks.

const single = 'single-quoted';
const double = "double-quoted";

let backticks = `backticks`;

//single and double quotes are essentialy the same
backticks =`${backticks}`; // backticks allows expressions

let guestList = `Multiple line:
I can have a line-break`;
console.log(guestList);

//Single and double quotes can not do this
💡
Single and double quotes come from ancient times of language creation, when the need for multiline strings was not taken into account. Backticks appeared much later and are thus more versatile.

Special characters

We have a limited character. I'm saying in case of western language, so we often have to come up with a bunch of combination between multiple characters

CharacterDescription
\nNew Line
\rRepresents a break
\',\",\`QUOTES
\Backslash
\tTab
\b,\f,\vBackspace, formfeed, and vertical tabs (not used nowadays)
💡
This is very similar to regular expression
alert( `The backslash: \\` ); // The backslash: \

// We have to use \' otherwise it thinks it's a enclosing '
alert( 'I\'m the Walrus!' ); // I'm the Walrus!

String length

The length property has the string length (Object Type Wrapper)

alert( `My\n`.length ); // 3

Accessing character

let str = `Hello`;

// the first character
alert( str[0] ); // H
alert( str.at(0) ); // H

// the last character
alert( str[str.length - 1] ); // o
alert( str.at(-1) );
💡
At method has the benefit of allowing negative positions. If pos is negative, then it's counted from the end of the string

Iterator methods

String has implemented the Iterator protocal so we can have the ability to loop over every single character of a string

for( let char of "Hello" ) {
    console.log(char);
}

String are immutable

let str = 'Hi';

str[0] = 'h'; // error
alert( str[0] ); // doesn't work

//work-around
str = 'h' + str[1]; // replace the variable with different value

Changing the case

alert('Interface'.toUpperCase() ); // INTERFACE
alert('Interface'.toLowerCase() ); // interface

String API

Searching for a substring

There are multiple ways to search for a sub-string

str.indexof(substr,pos)

💡
Because string has index as their key so it will of course have a position :p
let str = 'Widget with id';

alert( str.indexOf('Widget') ); // 0, because 'Widget' is found at the beginning
alert( str.indexOf('widget') ); // -1, not found, the search is case-sensitive

alert( str.indexOf("id") ); // 1, "id" is found at the position 1 (..idget with id)

If we're interested in all occurences, just like regular expressions with /g. We can use this trick

let str = 'As sly as a fox, as strong as an ox';
let target = "as";
let position = 0;

while(true) {
   const idx = str.indexOf(target,position);
   if(idx === -1) break;
   position = idx + 1; // continue the search from next-position
}

//Another way
while( (pos=str.indexOf(target,position+1))! == -1) {
    console.log(pos);
}
💡
It's slightly inconvenient in the if test that we have to do the the matching compare with value -1

Sometimes we just want to know if the character you're looking for exists or not

if(str.indexOf(target,position)!== -1) { }

includes, startsWith, endsWith (yes or no)

💡
String and arrays both have include methods

The more modern method str.includes(substr, pos) returns true/false depending on whether str contains substr within.

💡
if("I'm Vince".includes("Vince")) { console.log("We found Vince") }

startsWith is like /^text/ in regex

endsWith is like /text$/ in regex

startsWith('attribute', column.name);

Chopping a string

There are 3 methods in JavaScript to get a substring: substring, substr and slice.

Which one should I use?

They're both very similar and can do the jobs. But in practical, slice is a little bit more flexible; it allows negative arguments and is shorter to write.

Unicode is a friendly term

Let's talk about how data is stored.

Data is stored as bit with a subsequence of 0 and 1

0111 0110 1000 1001
// Whether it's a number or character at the end it will translate
// to binary number to store on DISK or RAM (memory)

// we will have a rule to translate it to numeric number <-> binary

character   number    binary code
  d         100       0111 0110 1000 1001

ASCII (American Standard Code for Information Interchange)

💡
7 bits of one character ,why not 8 bits? It's for historical reason so we lack of the representation of a full 1 byte(8 bits) before we move to the computer of 8 bits system

A = 65
1 0 0 0 0 0 1 (7 BITS)
a = 97
1 1 0 0 0 0 1 (7 BITS)
💡
With ASCII , one character is 1 byte (8 bit), but mainly for english words, it just need 2^7 = 128 character's limited

To convert a string to binary code to store on the computer

String       H   e   l   l   o
ASCII value  72 101 108 108 111
Binary code  0100100 01100101 01101100 01101100 01101111
Each binary codde is 8 bits (1 byte) of data

for example if we have a character Vince and we string.length=5
And 5 is the length of the string but it also means the 5 bytes of data
💡
English characters is not enough for the entire world

Unicode for rest of the world

Non-western characters in the world:

  • Chinese, Japanese , Arabic and so much more

  • Icon

  • Emoj

We can have all the characters map to the code points ( numeric value )

Take in mind that Unicode takes a different approaches, they have not chosen binary digits. All they say is that icon character or vietnamese character that has a number of XXX something

💡
But the web has settled down on UTF-8

UTF

UTF works the same as ASCII, the character A is still 65 and It can support to 32 bits( 4 bytes) 4.294.967.296 (4 billions of character codes) hopefully it will be enough for the languages of entire world ^^

I can see one obvious problem that we might face

  • To represent a A character that we will need to have 27 character of 0 and only needs 1 0 0 0 0 0 1 that's a wasteful of memories and remember that most of the english languages that we only needs 7 bits

UTF-32 (32 bits = 4 byte)

UTF-32 will take each point of the value and convert it to binary

2^32 equals approximately 4.3 billion possible values.

StringCode PointsUTF-32 Encoding
H7200 00 00 48 (Hex)
e10100 00 00 65 (Hex)
💡
Pros: It's an expanded version of ASCII, backward compatibility that ASCII can easily match with UTF-8
💡
Cons: It's a wasteful of memories

UTF-8 ( from 1 to 4 byte)

UTF-8 is more flexible it can take from 1 byte to 4 byte, western languages only takes what it needs (1 byte)

💡
Because it's a flexible size, a code byte has an unequal size, and it's difficult to recognize. It can have an impact on performance when some search engines are trying to parse the words

Is it unfair to the other languages that they have to store more bytes

💡
Yes. English is a dominant language and so it do in computer-science

Unicode and String internals

As we already know, JavaScript strings are based on Unicode: each character is represented by a byte sequence of 1-4 bytes

JavaScript allows us to insert a character into a string by specifying its hexadecimal Unicode code with one of these three notations:

Represent 2 bytes in JS

\xXX (1 byte for one character max so only 2 hex)
with XX is the format of hex code (00-FF). 
It can be used with the  first 256 Unicode characters (ASCII)
console.log("\x7A");
console.log("\xA9");

\uXXXXXX is used with Unicode (supports more wide range characters)
UTF-16 in JS only supports 2 byte so max here is XXXX with hex-code
console.log("\u00A9");

// In this case it uses \u{XXXXXX} to allow more bytes (surrogate pairs)
alert( "\u{20331}" ); // 佫, a rare Chinese character (long Unicode)
alert( "\u{1F60D}" ); // 😍, a smiling face symbol (another long Unicode)

UTF-16 only allows 16 bits (2 byte)

Initially, JavaScript was based on UTF-16 encoding, which only allowed 2 bytes per character. But 2 bytes only allow 65536 combinations, and that’s not enough for every possible symbol of Unicode.

So rare symbols that require more than 2 bytes are encoded with a pair of 2-byte characters called “a surrogate pair.”.

console.log( '𝒳'.length ); // 2, MATHEMATICAL SCRIPT CAPITAL X
console.log( '😂'.length ); // 2, FACE WITH TEARS OF JOY
console.log( '𩷶'.length ); // 2, a rare Chinese character
  1. MATHEMATICAL SCRIPT CAPITAL X ('𝒳')

    • The character '𝒳' is a mathematical script capital X and is represented by a Unicode character outside the Basic Multilingual Plane (BMP). It has a Unicode code point U+1D4B3.

    • Characters outside the BMP are represented using surrogate pairs in JavaScript. In this case, '𝒳' is represented by two code units (surrogate pair): '\uD835\uDCB3'. (4 bytes)

    • When you use the length property on a string containing this character, it counts the number of code units, which is 2. Hence, alert( '𝒳'.length ); results in 2.

  2. FACE WITH TEARS OF JOY ('😂')

    • The emoji '😂' is a character outside the BMP as well, represented by a surrogate pair: '\uD83D\uDE02'.

    • The length property counts the number of code units, so alert( '😂'.length ); also results in 2.

  3. A RARE CHINESE CHARACTER ('𩷶')

    • The Chinese character '𩷶' is another example of a character outside the BMP, represented by a surrogate pair: '\uD869\uDE36'.

    • The length property counts the number of code units, so alert( '𩷶'.length ); also results in 2.

In summary, characters outside the Basic Multilingual Plane (BMP) are represented using surrogate pairs in JavaScript, where each character is represented by two code units. When you use the length property on a string containing such characters, it counts the number of code units, not the number of visible characters or glyphs.

If I have 1 MB, how many characters can I have?

If every character in your text can be represented with a single 16-bit code unit (which is true for most common characters), then each character will take 2 bytes. In this case, you can calculate the number of characters you can store in 1 megabyte (1 MB) as follows:

1 MB = 1,048,576 bytes

Number of characters = (1,048,576 bytes) / (2 bytes/character)

Number of characters = 524,288 characters

However, if your text includes characters that require surrogate pairs, then some characters will take 4 bytes (two 16-bit code units). In that case, the number of characters you can store in 1 MB will be half, as each character will take up twice as much space:

Number of characters = (1,048,576 bytes) / (4 bytes/character)

Number of characters = 262,144 characters

So, depending on the nature of the characters in your text, you can store either 524,288 characters (with all characters represented by a single 16-bit code unit) or 262,144 characters (if some characters require surrogate pairs).

Compare strings

Strings are compared character-by-character in alphabetical order

Although there can be some corner cases,

  1. A lowercase letter is always greater than the uppercase
alert( 'a' > 'Z' ); //true
  1. Letters with diacritical marks are "out of order."
alert( 'Österreich' > 'Zealand' ); // true
//O should come after Z but it has special character in it

To understand what happens, we should be aware that strings in Javascript are encoded using UTF-16. That is, each character has a corresponding numeric code.

"A".charCodeAt(0); //65
"a".charCodeAt(0); //97

String.fromCodePoint(UTF-character)

let str = '';

for (let i = 65; i <= 220; i++) {
  str += String.fromCodePoint(i);
}
alert( str );
// Output:
// ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~€‚ƒ„
// ¡¢£¤¥¦§¨©ª«¬­®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜ
💡
It should become obvious that why a > Z

Correct comparisons

The “right” algorithm to do string comparisons is more complex than it may seem because alphabets are different for different languages.

So, the browser needs to know the language to compare.

Luckily, modern browsers support the internationalization standard ECMA-402.

It provides a special method to compare strings in different languages, following their rules.

💡
"str1".localCompare("str2", locales). If the locale value isn't set, it will be the default of your browser
"a".localeCompare("b"); // -1 a come before b (64 - 65)
"b".localCompare("a"); // 1 b come after a (65 - 64 )

"a".localCompare("A"); // 97 - 65 a come before "A" makes sense

"a".charCodeAt(); // 97
"A".charCodeAt(); // 65

alert( 'Österreich'.localeCompare('Zealand') ); // -1

Regular expression with a string?

I've covered in depth in this blog

There are several other helpful methods for strings:

  • str.trim()removes (“trims”) spaces from the beginning and end of the string.

  • str.repeat(n) – repeats the string n times.

Summary

  • There are three types of quotes. Backticks allow a string to span multiple lines and embed expressions${…}.

  • We can use special characters, such as a line break \n.

  • To get a character, use: [] or at method.

  • To get a substring, use: slice or substring.

  • To lowercase/uppercase a string, use: toLowerCase/toUpperCase.

  • To look for a substring, use: indexOf, or includes/startsWith/endsWith for simple checks.

  • To compare strings according to the language, use: localeCompare, otherwise they are compared by character codes.

Exercise with strings