VBA String


General

Imagine a set that contains all the characters of all known languages, of all traditions of all times, past and present.

The UCS and the Unicode set

Two organizations have made (and are making) the effort to build such a set.

The first organization is the International Organization for Standardization (ISO). Their set is called the Universal Coded Character Set (UCS). It is defined in the standard ISO/IEC 10646, of which the latest version is ISO/IEC 10646:2014 that defines 128,585 characters. Each character in the UCS set has a unique name and a unique number.

The second organization is the Unicode Consortium which represents computer manufacturers. Their set is defined in the Unicode Standard and we will call it the Unicode set. Version 9.0 of the Unicode set contains 128,172 characters. Again, each character in the Unicode set has a unique name and a unique number. An overview of the languages and pdfs containing the characters in the Unicode set can be found here. A list of all the characters in the Unicode set can be found on this great site.

Luckily, the two organizations are cooperating closely. They each have their own updating frequency and focuss, but essentially, their sets are the same. They contain the same characters, apart from differences due to updating frequency.

A code point

In the UCS and the Unicode set, each character has a code point. A code point is a numerical representation of the character. The code point acts as an id for the character. Code points are non-negative integer numbers. They can be expressed in decimal format, or, in hexadecimal format. In the hexadecimal format, the number is often preceded by "U+".

For example,

CharacterNameCode pointCode point
aLatin Small Letter AU+6197
ДCyrillic Capital Letter DeU+4141044
Radioactive SignU+26229762
Cjk Radical C-Simplified SeeU+2EC511973

In both UCS and the Unicode set, there is space for 1,114,112 code points.

Code pointCode point
First available code pointU+00
Last available code pointU+10FFFF1,114,111

At the time of writing there are around 128,000 characters in UCS and the Unicode set. This means that not all available code points are taken, leaving space for many more characters.

The Basic Multilingual Plane

The most important part of both UCS and the Unicode set are the first 65,536 characters. This subset is called the Basic Multilingual Plane (BMP).

Code pointCode point
First available code point in BMPU+00
Last available code point in BMPU+FFFF65,535

The BMP contains the characters of (almost) all modern languages (English, Spanish, Norwegian, Russian, Chinese, Japanse, Korean... you name it). It also contains many of the modern symbols. Even in the BMP, not all the available code points are assigned to characters. There is still open space, but not much.

P(BPM is all you need) ~ 1

The UCS-2 encoding

How can the code points in the Basic Multilingual Plane be represented in bits? The obvious choice would be to use two bytes.

Image showing to empty bytes

Two bytes can have 65,536 values, which is exactly the number of available code points in the BMP.

Code pointCode pointTwo Bytes
First available code point in BMPU+0000000000 00000000
Second available code point in BMPU+1100000000 00000001
Third available code point in BMPU+2200000000 00000010
............
Last available code point in BMPU+FFFF65,53511111111 11111111

The endianness

What about the order of the two bytes? There are two ways of ordering the two bytes: high endianness and low endianness. In high endianness (also called big endianness), the most significant byte comes first (like when writing a number). In low endian ordering (also called little endianness), the least significant byte is put first. The choice of the endianness is guided by architectural considerations. In Windows, you can assume low endianness. In VBA too.

Code pointCode pointTwo bytes high endiannessTwo bytes low endianness
First available code point in BMPU+0000000000 0000000000000000 00000000
Second available code point in BMPU+1100000000 0000000100000001 00000000
Third available code point in BMPU+2200000000 0000001000000010 00000000
............
Last available code point in BMPU+FFFF65,53511111111 1111111111111111 11111111

Example: BMP characters in VBA

Say we have this text.

Hell〇 ₩ørld!

All the characters in the text are in the BMP.

The binary representation in UCS-2 looks like this.

CharacterCode pointCode pointTwo bytes high endiannessTwo bytes low endianness
HU+487200000000 0100100001001000 00000000
eU+6510100000000 0110010101100101 00000000
lU+6C10800000000 0110110001101100 00000000
lU+6C10800000000 0110110001101100 00000000
U+30071229500110000 0000011100000111 00110000
U+203200000000 0010000000100000 00000000
U+FFE66551011111111 1110011011100110 11111111
øU+F824800000000 1111100011111000 00000000
rU+7211400000000 0111001001110010 00000000
lU+6C10800000000 0110110001101100 00000000
dU+6410000000000 0110010001100100 00000000
!U+213300000000 0010000100100001 00000000

This text, in UCS-2 encoding, requires 24 bytes. Note that 10 of the 24 bytes contain only zeroes. This is a bit of a waste, which is one of the motivations for developping other encodings.

The VBA code below will "show" you the binary representation in the memory. You can see that it matches the two bytes low endianness representation. If you try the code, you need to copy the help functions too.

Image showing to the bits in the memory of a VBA string

Here are the help functions.

The supplementary planes

All the available code points that are not in the BMP are in the so called supplementary planes. The supplimentary planes cover ancient languages and many more symbols like the music symbols.

Code pointCode point
First available code point in BMPU+00
.........
Last available code point in BMPU+FFFF65,535
First available code point in suplimentary planesU+1000065,536
.........
Last available code point in suplimentary planesU+10FFFF1,114,111

The UTF-16 encoding

The code points in the supplementary planes are too big to be represented in two bytes. The obvious alternative would be to use four bytes (UCS-4). However, this means texts would double in size, and, as many of the bytes would be 00000000, represent a waste of bytes. Also, it would mean manufacturers and programmers would need to adjust to four bytes per character in stead of two. It proved to be more practical to extend UCS-2. The extension is called the 16-bit Unicode Transformation Format (UTF-16).

JSON assumes UTF-16

UTF-16 uses two bytes or four bytes to represent the code point of a character. Never one byte, never three bytes.

Image showing to empty bytes

UTF-16 exploits the fact that there is still open space in the BMP. A large part of that open space is reserved for UTF-16.

Code points in BMPCode points in BMP
From U+0 to U+D7FFFrom 0 to 55,295Assigned to characters
From U+D800 to U+DBFFFrom 55,296 to 56,319The high surrogates. Not assigned to characters. 1024 code points reserved for UTF-16.
From U+DC00 to U+DFFFFrom 56,320 to 57,343The low surrogates. Not assigned to characters. 1024 code points reserved for UTF-16.
From U+E000 to U+FFFFFrom 57,344 to 65,535Assigned to characters

In UTF-16, a character that is in the BMP is represented by two bytes, exactly the same as in UCS-2. Both high and low endianness exists in UTF-16, so also the endianness that exists in UCS-2 can be preserved. Characters that are non-BMP are represented by four bytes: a high surrogate followed by a low surrogate. This representation of the non-BMP character is called the surrogate pair. There are 1,048,576 possible surrogate pairs which is exactly enough to cover all non-BMP code points. There is a simple algoritm that starts from the code point of the non-BMP character and finishes with the right high surrogate and low surrogate. It is a great, practical idea! All the texts that are encoded in UCS-2 need not be changed and many more characters can be represented too.

Example: The function ChrW()

The VBA function ChrW() turns a number in the range -32,768 to 65,535 into a character in a string. The number can be in decimal or hexadecimal format. For all characters in the BMP, if you know the code point, you can put the character into a VBA string. Code points are never negative. If the argument in the function ChrW() is negative, the code point that corresponds to the negative number can be obtained by adding 65,536. Although the representation of the character is correct in the VBA string itself, unfortunately, it doesn't mean that the editor will display it correctly. The immediate window and the message box will display a question mark if they consider the character too exotic.

Example: ChrW does not always yield the same result as Chr

Run the following program, after changing the path of the file to which the results will be written.

In the range 1 to 255, ChrW and Chr map the same number to the same character except for some numbers between 128 and 159. Have a look at the output file.

Image showing the differences between ChrW and Chr

In Unicode, the numbers from 128 to 159 represent control characters. Chr maps (most of) them to characters that are thought to be more useful. However, by doing so, the function Chr does not follow the standard set by UCS and Unicode. Be careful if you would use these non-standard characters.

Example: The VBA editor does not support visualization of all BMP characters

Let's consider this text that only consists of characters in the BMP.

Hell〇 ₩ørld!

Here is some VBA code.

Unfortunately, the VBA editor does not support visualization of all BMP characters. All characters that have a code point larger than U+FF (or 255) cannot be visualized. For example, the first call to MsgBox S will not show the text correctly. Also, in the immediate window or the watch window, you will see "?" for all BMP characters that cannot be visualized. However, in the memory, the text is fine. You can see this by trying to replace and by something else. It works fine.