What Exactly are JavaScript Strings? Sequences of 8-Bit Units or 16-Bit Units?

Codeguage
Level Up Coding
Published in
7 min readJul 5, 2022

--

Photo by THIS IS ZUN from Pexels

We all work with strings almost every single time we write some piece of JavaScript code. Knowing how to work with strings in JavaScript is of immense importance for every single developer; in fact, this notion of strings holds for almost all programming languages. No questions in that.

But while we may know how to work with strings in JavaScript, are we really familiar with the way JavaScript treats strings? Can we explain why '🙂'.length returns 2 instead of 1 ?

All this will be discussed thoroughly in this article. Let’s begin…

What really is a string in JavaScript?

A string in any programming language is defined in a very typical and simple way. That is,

A string is a sequence of textual characters.

This definition ain’t wrong — after all, a string is actually a sequence of characters.

But if we really zoom into the depth of the memory, we see that this definition is somehow vague. We don’t really know what exactly is the meaning of a ‘character’ here.

It could have different interpretations with every programming language.

Right now we are concerned with JavaScript, right?

So, from the perspective of JavaScript, a string is defined in the following way, as per the ECMAScript spec:

Primitive value that is a finite ordered sequence of zero or more 16-bit unsigned integer values.

To start with, one thing is absolutely clear from this definition — that whatever the case be, a string is a sequence type.

Moving on, each element in this sequence is a 16-bit number, generally referred to as a 16-bit code unit, or simply as a code unit. The most important thing is that each element is NOT strictly a character.

Let’s elaborate this further…

Most characters in JavaScript do indeed consume one single code unit (i.e. a 16-bit number), however this doesn’t hold for all characters. Some characters such as 🙂 consume 2 code units.

This right away clarifies the fact that a string is, technically, NOT a sequence of characters; instead it’s a sequence of code units that when parsed, form a sequence of (visual) characters.

It might sound complicated, but at the core, it’s a very simple idea.

Following from this idea, the length string property does NOT return the number of characters in a given string, but rather the number of these 16-bit code units.

This explains the following code:

In the latter part of this article, we’ll see exactly why does 🙂 span 2 code units.

Now the question is that if a string in JavaScript is a sequence of 16-bit integers, how are these mere numbers converted to characters?

Moreover, how does JavaScript figure out when a code unit doesn’t represent a character alone, but instead has to be amalgamated with the next unit to altogether represent a character?

All this will be answered in the following section.

Unicode and UTF-16

JavaScript strings are sequences of 16-bit code units. That’s clear.

But this doesn’t really tell us how does JavaScript go from a number like 97 to the character a .

Well, that’s where Unicode and its UTF-16 encoding scheme comes into the game.

Unicode is an international encoding standard for working with textual information. It can be used to portably represent a huge variety of characters, 144,697 as of version 14.0, in use throughout the globe.

In layman terms, Unicode could be thought of as a huge table where a unique number is associated with a particular character. This unique number is called the character’s code point.

For instance, the code point of a is 97 , the code point of b is 98 , the code point of A is 61 , the code point of B is 62 , the code point of a space character is 32 , and so on. The code point of 🙂 is also there — 128578 .

You can see parts, or the full version, of the table at Unicode Character Table.

Now Unicode is just a mapping of integers to characters, or better to say a mapping of integers to glyphs (the visual representation of a character).

What helps us convert back and forth from these integer (code points) and visual representations (glyphs) in a programming environment is an encoding scheme.

There are a handful of encoding schemes used by Unicode:

  1. UTF-8, for Unicode Transformation Format - 8 bits
  2. UTF-16, for Unicode Transformation Format — 16 bits
  3. UTF-32, for Unicode Transformation Format — 32 bits

JavaScript uses the second of these, i.e. UTF-16.

Let’s dig deeper into UTF-16…

The most elementary unit in UTF-16, as its name suggests, is a 16-bit integer. Every character in Unicode is converted to a sequence of 16-bit integers under UTF-16.

Note that we used the term ‘sequence’ here instead of saying that every character in Unicode gets converted to a single 16-bit integer under UTF-16. That’s because there are characters that get converted to not just one, but two 16-bit units, such as 🙂.

This is where the notion of planes and surrogates steps in.

First starting with planes. A plane is just a fancy term used to refer to a collection of characters in Unicode. There are a total of 17 planes that can be divided into two categories:

  1. Basic Multilingual Plane — contains most common characters. There is only one such plane, and that’s the first one. The range of code points it covers is as follows (mentioned in hexadecimal): 0x0000 — 0xFFFF.
  2. Supplementary Plane — contains less common characters, such as mathematical symbols, emojis, musical notation, etc. All the remaining 16 planes are supplementary planes, covering the rest of the range of code points.

Next up, let’s talk about surrogates.

Recall that the simplest unit in UTF-16 spans 16 bits of memory. This means that we can effectively only represent the numbers in the range 0x0000 — 0xFFFF in it.

Isn’t that so?

But then how are characters like 🙂 represented in UTF-16, whose code point exceeds the maximum limit of a 16-bit integer?

The answer is surrogates. There is a range of characters in Unicode, specifically from 0xDC00 to 0xDFFF, that are known as low surrogates. These characters don’t have any visual representation, simply because they are meant for a special purpose.

That is, the purpose of a low surrogate character is to indicate that another 16-bit code unit follows along that, together with it, represents a single character. The value after a low surrogate is called a high surrogate.

UTF-16 is responsible for converting a given Unicode code point into one single 16-bit code unit, or a pair of 16-bit code units (i.e. low surrogate followed by a high surrogate), entirely based on the code point.

To be more precise, if a character’s code point can be represented in 16 bits, UTF-16 will convert it into one single code unit. Otherwise, UTF-16 will convert it into a pair of two code units (spanning a total of 32 bits).

That how exactly does this UTF-16 processing works is out of the scope of this article. You can read more details at UTF-8, UTF-16, UTF-32 & BOM.

Let’s now talk about a JavaScript feature that helps us treat a string literally as a sequence of characters, rather than a sequence of code units.

The string @iterator method

ECMAScript 6 introduced symbols into JavaScript. We’ll not go into the details of symbols right now, but just get to know the fact that the value returned by the Symbol.iterator property was made a key of the String class in ECMAScript 6 that helps us iterate over a string’s characters, not its code units.

Conventionally, symbolic properties of a given object are prefixed with @. Hence, the Symbol.iterator property of String, which refers to a function, is compactly referred to as the @iterator() string method.

Now, there are two ways to work with this method in JavaScript:

  1. Use the for...of loop
  2. Use the array spread operator (...)

(Actually, there are three way to work with this method, but the third one relies on manually working with the iterator object returned by the string @iterator() method, which in turn requires us to know a little bit, if not much, about iterators. If you want to take a deep dive into iterators in JavaScript, you can refer to the chapter Advanced JavaScript — Introduction to Iterators from our very own Advanced JavaScript course).

Let’s see each one of these one-by-one…

The for...of loop

In the code below, we iterate over the string str and log each of its characters, one-by-one:

Notice how the array treats 🙂 as a single entity, even though, internally in the string str, it consumes two code units.

The array spread operator (…)

The second way to leverage this character-based, rather than code-unit-based, string @iterator() method is to use the array spread operator (...)

In the code below, we use ... to determine the total number of characters in a given string (NOT the total number of code units):

As stated before, str.length doesn’t return the number of characters in str. Rather, it returns the number of code units.

In contrast, [...str].length returns the total number of characters in str.

[...str] first converts str into an array of characters using the string’s @iterator() method. Then, the length property (called on the array) returns back the total number of items in this array.

Simple, isn’t this?

In the end

To boil it all down, JavaScript treats a string as a sequence of 16-bit integers (known as code units). Almost all string facilities in the language process strings in this way — for e.g. the string length property, lexicographic comparison of strings, and so on.

If we want to work with strings in JavaScript in an error-free way, we must be aware of this behavior of the language.

If you want to learn even more about Unicode in JavaScript strings, you can refer to JavaScript Strings— Unicode.

Have a good time learning JavaScript and developing amazing programs with it! 🙂

Learn JavaScript From ScratchJavaScript Course at CodeGuage

Level Up Coding

Thanks for being a part of our community! More content in the Level Up Coding publication.
Follow: Twitter, LinkedIn, Newsletter
Level Up is transforming tech recruiting ➡️ Join our talent collective

--

--