Just like the blog on collection I have written here. This time, we'll go deep into some foundations and even more advanced topics related to primitive values in JS. For me, it's easy to write a piece of code, but to actually understand it fully, it's hard
Avoid Object wrapper types
In addition to objects, JS has seven types of primitive values: string, numbers, booleans, null, undefined, symbol, and bigint
Primitives are distinguished from objects by being immutable and not having methods. You might witness that strings do have methods
"primitive".charAt(3)
// m
But things are not quite as they seem. There's actually something surprising and subtle going on here. While a string primitive does not have methods, JS also defines a String object type that does. JS freely converts between these types without our permission
const originalCharAt = String.prototype.charAt;
String.prototype.charAt = function (pos) {
console.log(this,pos);
originalCharAt.call(this,pos);
}
"Test".charAt(3);
//String {'Test'}0: "T"1: "e"2: "s"3: "t"length: 4[[Prototype]]: String[[PrimitiveValue]]: "Test" 3
const x ="hello";
x.language="English";
// English;
x.language;
//undefined ( after convert it back to primitive the object is thrown aw)
Wrapper types
These wrapper types exist as a convenience to provide methods on primitive values and static values (String.fromCharCode). But there is no reason to instantiate them directly
string and String
number and Number
boolean and Boolean
symbol and Symbol
bigint and BigInt
String
In JavaScript, the textual data is stored as strings. There is no separate type for a single character.
Quotes
A string can be enclosed within either single quotes, double quotes, or backticks.
const single = 'single-quoted';
const double = "double-quoted";
let backticks = `backticks`;
//single and double quotes are essentialy the same
backticks =`${backticks}`; // backticks allows expressions
let guestList = `Multiple line:
I can have a line-break`;
console.log(guestList);
//Single and double quotes can not do this
Special characters
We have a limited character. I'm saying in case of western language, so we often have to come up with a bunch of combination between multiple characters
Character | Description |
\n | New Line |
\r | Represents a break |
\',\",\` | QUOTES |
\ | Backslash |
\t | Tab |
\b,\f,\v | Backspace, formfeed, and vertical tabs (not used nowadays) |
alert( `The backslash: \\` ); // The backslash: \
// We have to use \' otherwise it thinks it's a enclosing '
alert( 'I\'m the Walrus!' ); // I'm the Walrus!
String length
The length property has the string length (Object Type Wrapper)
alert( `My\n`.length ); // 3
Accessing character
let str = `Hello`;
// the first character
alert( str[0] ); // H
alert( str.at(0) ); // H
// the last character
alert( str[str.length - 1] ); // o
alert( str.at(-1) );
Iterator methods
String has implemented the Iterator protocal so we can have the ability to loop over every single character of a string
for( let char of "Hello" ) {
console.log(char);
}
String are immutable
let str = 'Hi';
str[0] = 'h'; // error
alert( str[0] ); // doesn't work
//work-around
str = 'h' + str[1]; // replace the variable with different value
Changing the case
alert('Interface'.toUpperCase() ); // INTERFACE
alert('Interface'.toLowerCase() ); // interface
String API
Searching for a substring
There are multiple ways to search for a sub-string
str.indexof(substr,pos)
let str = 'Widget with id';
alert( str.indexOf('Widget') ); // 0, because 'Widget' is found at the beginning
alert( str.indexOf('widget') ); // -1, not found, the search is case-sensitive
alert( str.indexOf("id") ); // 1, "id" is found at the position 1 (..idget with id)
If we're interested in all occurences, just like regular expressions with /g. We can use this trick
let str = 'As sly as a fox, as strong as an ox';
let target = "as";
let position = 0;
while(true) {
const idx = str.indexOf(target,position);
if(idx === -1) break;
position = idx + 1; // continue the search from next-position
}
//Another way
while( (pos=str.indexOf(target,position+1))! == -1) {
console.log(pos);
}
Sometimes we just want to know if the character you're looking for exists or not
if(str.indexOf(target,position)!== -1) { }
includes, startsWith, endsWith (yes or no)
The more modern method str.includes(substr, pos) returns true/false
depending on whether str
contains substr
within.
startsWith is like /^text/ in regex
endsWith is like /text$/ in regex
startsWith('attribute', column.name);
Chopping a string
There are 3 methods in JavaScript to get a substring: substring
, substr
and slice
.
Which one should I use?
They're both very similar and can do the jobs. But in practical, slice
is a little bit more flexible; it allows negative arguments and is shorter to write.
Unicode is a friendly term
Let's talk about how data is stored.
Data is stored as bit with a subsequence of 0 and 1
0111 0110 1000 1001
// Whether it's a number or character at the end it will translate
// to binary number to store on DISK or RAM (memory)
// we will have a rule to translate it to numeric number <-> binary
character number binary code
d 100 0111 0110 1000 1001
ASCII (American Standard Code for Information Interchange)
A = 65
1 0 0 0 0 0 1 (7 BITS)
a = 97
1 1 0 0 0 0 1 (7 BITS)
To convert a string to binary code to store on the computer
String H e l l o
ASCII value 72 101 108 108 111
Binary code 0100100 01100101 01101100 01101100 01101111
Each binary codde is 8 bits (1 byte) of data
for example if we have a character Vince and we string.length=5
And 5 is the length of the string but it also means the 5 bytes of data
Unicode for rest of the world
Non-western characters in the world:
Chinese, Japanese , Arabic and so much more
Icon
Emoj
We can have all the characters map to the code points ( numeric value )
Take in mind that Unicode takes a different approaches, they have not chosen binary digits. All they say is that icon character or vietnamese character that has a number of XXX something
UTF
UTF works the same as ASCII, the character A is still 65 and It can support to 32 bits( 4 bytes) 4.294.967.296 (4 billions of character codes) hopefully it will be enough for the languages of entire world ^^
I can see one obvious problem that we might face
- To represent a A character that we will need to have 27 character of 0 and only needs 1 0 0 0 0 0 1 that's a wasteful of memories and remember that most of the english languages that we only needs 7 bits
UTF-32 (32 bits = 4 byte)
UTF-32 will take each point of the value and convert it to binary
2^32 equals approximately 4.3 billion possible values.
String | Code Points | UTF-32 Encoding |
H | 72 | 00 00 00 48 (Hex) |
e | 101 | 00 00 00 65 (Hex) |
UTF-8 ( from 1 to 4 byte)
UTF-8 is more flexible it can take from 1 byte to 4 byte, western languages only takes what it needs (1 byte)
Is it unfair to the other languages that they have to store more bytes
Unicode and String internals
As we already know, JavaScript strings are based on Unicode: each character is represented by a byte sequence of 1-4 bytes
JavaScript allows us to insert a character into a string by specifying its hexadecimal Unicode code with one of these three notations:
Represent 2 bytes in JS
\xXX (1 byte for one character max so only 2 hex)
with XX is the format of hex code (00-FF).
It can be used with the first 256 Unicode characters (ASCII)
console.log("\x7A");
console.log("\xA9");
\uXXXXXX is used with Unicode (supports more wide range characters)
UTF-16 in JS only supports 2 byte so max here is XXXX with hex-code
console.log("\u00A9");
// In this case it uses \u{XXXXXX} to allow more bytes (surrogate pairs)
alert( "\u{20331}" ); // 佫, a rare Chinese character (long Unicode)
alert( "\u{1F60D}" ); // 😍, a smiling face symbol (another long Unicode)
UTF-16 only allows 16 bits (2 byte)
Initially, JavaScript was based on UTF-16 encoding, which only allowed 2 bytes per character. But 2 bytes only allow 65536 combinations, and that’s not enough for every possible symbol of Unicode.
So rare symbols that require more than 2 bytes are encoded with a pair of 2-byte characters called “a surrogate pair.”.
console.log( '𝒳'.length ); // 2, MATHEMATICAL SCRIPT CAPITAL X
console.log( '😂'.length ); // 2, FACE WITH TEARS OF JOY
console.log( '𩷶'.length ); // 2, a rare Chinese character
MATHEMATICAL SCRIPT CAPITAL X ('𝒳')
The character '𝒳' is a mathematical script capital X and is represented by a Unicode character outside the Basic Multilingual Plane (BMP). It has a Unicode code point U+1D4B3.
Characters outside the BMP are represented using surrogate pairs in JavaScript. In this case, '𝒳' is represented by two code units (surrogate pair): '\uD835\uDCB3'. (4 bytes)
When you use the
length
property on a string containing this character, it counts the number of code units, which is 2. Hence,alert( '𝒳'.length );
results in 2.
FACE WITH TEARS OF JOY ('😂')
The emoji '😂' is a character outside the BMP as well, represented by a surrogate pair: '\uD83D\uDE02'.
The
length
property counts the number of code units, soalert( '😂'.length );
also results in 2.
A RARE CHINESE CHARACTER ('𩷶')
The Chinese character '𩷶' is another example of a character outside the BMP, represented by a surrogate pair: '\uD869\uDE36'.
The
length
property counts the number of code units, soalert( '𩷶'.length );
also results in 2.
In summary, characters outside the Basic Multilingual Plane (BMP) are represented using surrogate pairs in JavaScript, where each character is represented by two code units. When you use the length
property on a string containing such characters, it counts the number of code units, not the number of visible characters or glyphs.
If I have 1 MB, how many characters can I have?
If every character in your text can be represented with a single 16-bit code unit (which is true for most common characters), then each character will take 2 bytes. In this case, you can calculate the number of characters you can store in 1 megabyte (1 MB) as follows:
1 MB = 1,048,576 bytes
Number of characters = (1,048,576 bytes) / (2 bytes/character)
Number of characters = 524,288 characters
However, if your text includes characters that require surrogate pairs, then some characters will take 4 bytes (two 16-bit code units). In that case, the number of characters you can store in 1 MB will be half, as each character will take up twice as much space:
Number of characters = (1,048,576 bytes) / (4 bytes/character)
Number of characters = 262,144 characters
So, depending on the nature of the characters in your text, you can store either 524,288 characters (with all characters represented by a single 16-bit code unit) or 262,144 characters (if some characters require surrogate pairs).
Compare strings
Strings are compared character-by-character in alphabetical order
Although there can be some corner cases,
- A lowercase letter is always greater than the uppercase
alert( 'a' > 'Z' ); //true
- Letters with diacritical marks are "out of order."
alert( 'Österreich' > 'Zealand' ); // true
//O should come after Z but it has special character in it
To understand what happens, we should be aware that strings in Javascript are encoded using UTF-16. That is, each character has a corresponding numeric code.
"A".charCodeAt(0); //65
"a".charCodeAt(0); //97
String.fromCodePoint(UTF-character)
let str = '';
for (let i = 65; i <= 220; i++) {
str += String.fromCodePoint(i);
}
alert( str );
// Output:
// ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
// ¡¢£¤¥¦§¨©ª«¬®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜ
Correct comparisons
The “right” algorithm to do string comparisons is more complex than it may seem because alphabets are different for different languages.
So, the browser needs to know the language to compare.
Luckily, modern browsers support the internationalization standard ECMA-402.
It provides a special method to compare strings in different languages, following their rules.
"a".localeCompare("b"); // -1 a come before b (64 - 65)
"b".localCompare("a"); // 1 b come after a (65 - 64 )
"a".localCompare("A"); // 97 - 65 a come before "A" makes sense
"a".charCodeAt(); // 97
"A".charCodeAt(); // 65
alert( 'Österreich'.localeCompare('Zealand') ); // -1
Regular expression with a string?
I've covered in depth in this blog
There are several other helpful methods for strings:
str.trim()
removes (“trims”) spaces from the beginning and end of the string.str.repeat(n)
– repeats the stringn
times.
Summary
There are three types of quotes. Backticks allow a string to span multiple lines and embed expressions
${…}
.We can use special characters, such as a line break
\n
.To get a character, use:
[]
orat
method.To get a substring, use:
slice
orsubstring
.To lowercase/uppercase a string, use:
toLowerCase/toUpperCase
.To look for a substring, use:
indexOf
, orincludes/startsWith/endsWith
for simple checks.To compare strings according to the language, use:
localeCompare
, otherwise they are compared by character codes.