General
Imagine a set that contains all the characters of all known languages, of all traditions of all times, past and present.
The UCS and the Unicode set
Two organizations have made (and are making) the effort to build such a set.
The first organization is the International Organization for Standardization (ISO). Their set is called the Universal Coded Character Set (UCS). It is defined in the standard ISO/IEC 10646, of which the latest version is ISO/IEC 10646:2014 that defines 128,585 characters. Each character in the UCS set has a unique name and a unique number.
The second organization is the Unicode Consortium which represents computer manufacturers. Their set is defined in the Unicode Standard and we will call it the Unicode set. Version 9.0 of the Unicode set contains 128,172 characters. Again, each character in the Unicode set has a unique name and a unique number. An overview of the languages and pdfs containing the characters in the Unicode set can be found here. A list of all the characters in the Unicode set can be found on this great site.
Luckily, the two organizations are cooperating closely. They each have their own updating frequency and focuss, but essentially, their sets are the same. They contain the same characters, apart from differences due to updating frequency.
A code point
In the UCS and the Unicode set, each character has a code point. A code point is a numerical representation of the character. The code point acts as an id for the character. Code points are non-negative integer numbers. They can be expressed in decimal format, or, in hexadecimal format. In the hexadecimal format, the number is often preceded by "U+".
For example,
Character | Name | Code point | Code point |
---|---|---|---|
a | Latin Small Letter A | U+61 | 97 |
Д | Cyrillic Capital Letter De | U+414 | 1044 |
☢ | Radioactive Sign | U+2622 | 9762 |
⻅ | Cjk Radical C-Simplified See | U+2EC5 | 11973 |
In both UCS and the Unicode set, there is space for 1,114,112 code points.
Code point | Code point | |
---|---|---|
First available code point | U+0 | 0 |
Last available code point | U+10FFFF | 1,114,111 |
At the time of writing there are around 128,000 characters in UCS and the Unicode set. This means that not all available code points are taken, leaving space for many more characters.
The Basic Multilingual Plane
The most important part of both UCS and the Unicode set are the first 65,536 characters. This subset is called the Basic Multilingual Plane (BMP).
Code point | Code point | |
---|---|---|
First available code point in BMP | U+0 | 0 |
Last available code point in BMP | U+FFFF | 65,535 |
The BMP contains the characters of (almost) all modern languages (English, Spanish, Norwegian, Russian, Chinese, Japanse, Korean... you name it). It also contains many of the modern symbols. Even in the BMP, not all the available code points are assigned to characters. There is still open space, but not much.
The UCS-2 encoding
How can the code points in the Basic Multilingual Plane be represented in bits? The obvious choice would be to use two bytes.
Two bytes can have 65,536 values, which is exactly the number of available code points in the BMP.
Code point | Code point | Two Bytes | |
---|---|---|---|
First available code point in BMP | U+0 | 0 | 00000000 00000000 |
Second available code point in BMP | U+1 | 1 | 00000000 00000001 |
Third available code point in BMP | U+2 | 2 | 00000000 00000010 |
... | ... | ... | ... |
Last available code point in BMP | U+FFFF | 65,535 | 11111111 11111111 |
The endianness
What about the order of the two bytes? There are two ways of ordering the two bytes: high endianness and low endianness. In high endianness (also called big endianness), the most significant byte comes first (like when writing a number). In low endian ordering (also called little endianness), the least significant byte is put first. The choice of the endianness is guided by architectural considerations. In Windows, you can assume low endianness. In VBA too.
Code point | Code point | Two bytes high endianness | Two bytes low endianness | |
---|---|---|---|---|
First available code point in BMP | U+0 | 0 | 00000000 00000000 | 00000000 00000000 |
Second available code point in BMP | U+1 | 1 | 00000000 00000001 | 00000001 00000000 |
Third available code point in BMP | U+2 | 2 | 00000000 00000010 | 00000010 00000000 |
... | ... | ... | ... | |
Last available code point in BMP | U+FFFF | 65,535 | 11111111 11111111 | 11111111 11111111 |
Example: BMP characters in VBA
Say we have this text.
Hell〇 ₩ørld!
All the characters in the text are in the BMP.
The binary representation in UCS-2 looks like this.
Character | Code point | Code point | Two bytes high endianness | Two bytes low endianness |
---|---|---|---|---|
H | U+48 | 72 | 00000000 01001000 | 01001000 00000000 |
e | U+65 | 101 | 00000000 01100101 | 01100101 00000000 |
l | U+6C | 108 | 00000000 01101100 | 01101100 00000000 |
l | U+6C | 108 | 00000000 01101100 | 01101100 00000000 |
〇 | U+3007 | 12295 | 00110000 00000111 | 00000111 00110000 |
U+20 | 32 | 00000000 00100000 | 00100000 00000000 | |
₩ | U+FFE6 | 65510 | 11111111 11100110 | 11100110 11111111 |
ø | U+F8 | 248 | 00000000 11111000 | 11111000 00000000 |
r | U+72 | 114 | 00000000 01110010 | 01110010 00000000 |
l | U+6C | 108 | 00000000 01101100 | 01101100 00000000 |
d | U+64 | 100 | 00000000 01100100 | 01100100 00000000 |
! | U+21 | 33 | 00000000 00100001 | 00100001 00000000 |
This text, in UCS-2 encoding, requires 24 bytes. Note that 10 of the 24 bytes contain only zeroes. This is a bit of a waste, which is one of the motivations for developping other encodings.
The VBA code below will "show" you the binary representation in the memory. You can see that it matches the two bytes low endianness representation. If you try the code, you need to copy the help functions too.
1 2 3 4 5 6 7 8 9 10 11 12 13 | Public Sub Example() Dim S As String S = "Hell" & ChrW("&H3007") & " " & ChrW("&HFFE6") & ChrW("&H00F8") & "rld!" Dim MemoryAdress As Long MemoryAdress = StrPtr(S) MsgBox GetMemoryDump(MemoryAdress, 2 * Len(S), False, eHex, 2, eBit, " ") End Sub |
Here are the help functions.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 | Option Explicit Private Declare Sub MoveMemory Lib "kernel32" Alias "RtlMoveMemory" (Destination As Long, Source As Long, ByVal Length As Long) Public Enum eFormatToDumpTo eAscii eBit eByte eHex End Enum Public Enum eBitValue eOne = 1 eZero = 0 End Enum Public Function GetMemoryDump(ByVal MemoryAdressToStartAt As Long, ByVal NumberOfBytesToDump As Long, Optional ByVal AddMemoryAdressToLine As Boolean = True, Optional ByVal FormatToDumpMemoryAdressTo As eFormatToDumpTo = eFormatToDumpTo.eAscii, Optional NumberOfBytesToPutOnOneLine As Long = 16, Optional ByVal FormatToDumpByteTo As eFormatToDumpTo = eFormatToDumpTo.eAscii, Optional ByVal DelimitorBetweenBytes As String = " ") As String On Error GoTo GetMemoryDumpError 'ready? Dim NumberOfBytesDumped As Long NumberOfBytesDumped = 0 Dim NumberOfBytesToDumpNow As Long NumberOfBytesToDumpNow = 0 Dim MemoryAdress As Long MemoryAdress = 0 Dim Bytes() As Byte ReDim Bytes(0 To NumberOfBytesToPutOnOneLine) Dim IndexOfBytes As Long IndexOfBytes = 0 Dim Factor As Integer Factor = 0 '- Dim Line As String Line = "" Dim Line_Adress As String Line_Adress = "" Dim Line_Ascii As String Line_Ascii = "" Dim Line_Bits As String Line_Bits = "" Dim Line_Bytes As String Line_Bytes = "" Dim Line_Hexadecimal As String Line_Hexadecimal = "" 'set? If (NumberOfBytesToDump < 0) Then Factor = -1 Else Factor = 1 End If NumberOfBytesToDump = Factor * NumberOfBytesToDump 'go! Do While (NumberOfBytesDumped < NumberOfBytesToDump) NumberOfBytesToDumpNow = NumberOfBytesToDump - NumberOfBytesDumped If (NumberOfBytesToDumpNow > NumberOfBytesToPutOnOneLine) Then NumberOfBytesToDumpNow = NumberOfBytesToPutOnOneLine 'so dump {1,2,...,NumberOfBytesToPutOnOneLine} bytes 'step1: get {1,2,...,NumberOfBytesToPutOnOneLine} bytes MemoryAdress = MemoryAdressToStartAt + Factor * NumberOfBytesDumped MoveMemory ByVal VarPtr(Bytes(0)), ByVal MemoryAdress, NumberOfBytesToDumpNow 'step2: print {1,2,...,NumberOfBytesToPutOnOneLine} bytes Line = "" If (FormatToDumpMemoryAdressTo = eFormatToDumpTo.eAscii) Then Line_Adress = MemoryAdress ElseIf (FormatToDumpMemoryAdressTo = eFormatToDumpTo.eBit) Then Line_Adress = MemoryAdress ElseIf (FormatToDumpMemoryAdressTo = eFormatToDumpTo.eByte) Then Line_Adress = MemoryAdress ElseIf (FormatToDumpMemoryAdressTo = eFormatToDumpTo.eHex) Then Line_Adress = Hex(MemoryAdress) End If Line_Ascii = "" Line_Bits = "" Line_Bytes = "" Line_Hexadecimal = "" For IndexOfBytes = 0 To NumberOfBytesToDumpNow - 1 If (FormatToDumpByteTo = eFormatToDumpTo.eAscii) Then Line_Ascii = Line_Ascii & CStr(ChrW(Bytes(IndexOfBytes))) End If If (FormatToDumpByteTo = eFormatToDumpTo.eBit) Then Line_Bits = Line_Bits & ToBits(Bytes(IndexOfBytes), , 8) If (IndexOfBytes < (NumberOfBytesToDumpNow - 1)) Then Line_Bits = Line_Bits & DelimitorBetweenBytes End If If (FormatToDumpByteTo = eFormatToDumpTo.eByte) Then If (Bytes(IndexOfBytes) < 10) Then Line_Bytes = Line_Bytes & ("00" & CStr(Bytes(IndexOfBytes))) ElseIf (Bytes(IndexOfBytes) < 100) Then Line_Bytes = Line_Bytes & ("0" & CStr(Bytes(IndexOfBytes))) Else Line_Bytes = Line_Bytes & CStr(Bytes(IndexOfBytes)) End If If (IndexOfBytes < (NumberOfBytesToDumpNow - 1)) Then Line_Bytes = Line_Bytes & DelimitorBetweenBytes End If If (FormatToDumpByteTo = eFormatToDumpTo.eHex) Then If (Bytes(IndexOfBytes) < 16) Then Line_Hexadecimal = Line_Hexadecimal & ("0" & CStr(Hex(Bytes(IndexOfBytes)))) Else Line_Hexadecimal = Line_Hexadecimal & CStr(Hex(Bytes(IndexOfBytes))) End If If (IndexOfBytes < (NumberOfBytesToDumpNow - 1)) Then Line_Hexadecimal = Line_Hexadecimal & DelimitorBetweenBytes End If Next IndexOfBytes If (AddMemoryAdressToLine) Then Line = Line_Adress & vbTab End If If (FormatToDumpByteTo = eFormatToDumpTo.eAscii) Then Line = Line & Line_Ascii ElseIf (FormatToDumpByteTo = eFormatToDumpTo.eBit) Then Line = Line & Line_Bits ElseIf (FormatToDumpByteTo = eFormatToDumpTo.eByte) Then Line = Line & Line_Bytes ElseIf (FormatToDumpByteTo = eFormatToDumpTo.eHex) Then Line = Line & Line_Hexadecimal End If NumberOfBytesDumped = NumberOfBytesDumped + NumberOfBytesToDumpNow If (NumberOfBytesDumped < NumberOfBytesToDump) Then GetMemoryDump = GetMemoryDump & Line & ChrW(10) Else GetMemoryDump = GetMemoryDump & Line End If Loop 'finish... Exit Function GetMemoryDumpError: GetMemoryDump = "" End Function Public Function ToBits(ByVal L As Long, Optional ByVal IsSigned As Boolean = True, Optional NumberOfBits As Long = 32) As String On Error GoTo ToBitsError Dim S As String S = "" Dim Index As Long Index = 0 If (IsSigned) Then For Index = (NumberOfBits - 1) To 0 Step -1 If (L >= (2 ^ Index)) Then S = S & "1" L = L - (2 ^ Index) Else S = S & "0" End If Next Index Else End If Dim SFormatted As String SFormatted = "" For Index = 1 To Len(S) If (Index Mod 8 = 0) Then SFormatted = " " & Mid(S, Len(S) - Index + 1, 1) & SFormatted Else SFormatted = Mid(S, Len(S) - Index + 1, 1) & SFormatted End If Next Index ToBits = Trim(SFormatted) Exit Function ToBitsError: ToBits = "Error" End Function |
The supplementary planes
All the available code points that are not in the BMP are in the so called supplementary planes. The supplimentary planes cover ancient languages and many more symbols like the music symbols.
Code point | Code point | |
---|---|---|
First available code point in BMP | U+0 | 0 |
... | ... | ... |
Last available code point in BMP | U+FFFF | 65,535 |
First available code point in suplimentary planes | U+10000 | 65,536 |
... | ... | ... |
Last available code point in suplimentary planes | U+10FFFF | 1,114,111 |
The UTF-16 encoding
The code points in the supplementary planes are too big to be represented in two bytes. The obvious alternative would be to use four bytes (UCS-4). However, this means texts would double in size, and, as many of the bytes would be 00000000
, represent a waste of bytes. Also, it would mean manufacturers and programmers would need to adjust to four bytes per character in stead of two. It proved to be more practical to extend UCS-2. The extension is called the 16-bit Unicode Transformation Format (UTF-16).
UTF-16 uses two bytes or four bytes to represent the code point of a character. Never one byte, never three bytes.
UTF-16 exploits the fact that there is still open space in the BMP. A large part of that open space is reserved for UTF-16.
Code points in BMP | Code points in BMP | ||
---|---|---|---|
From U+0 to U+D7FF | From 0 to 55,295 | Assigned to characters | |
From U+D800 to U+DBFF | From 55,296 to 56,319 | The high surrogates. Not assigned to characters. 1024 code points reserved for UTF-16. | |
From U+DC00 to U+DFFF | From 56,320 to 57,343 | The low surrogates. Not assigned to characters. 1024 code points reserved for UTF-16. | |
From U+E000 to U+FFFF | From 57,344 to 65,535 | Assigned to characters |
In UTF-16, a character that is in the BMP is represented by two bytes, exactly the same as in UCS-2. Both high and low endianness exists in UTF-16, so also the endianness that exists in UCS-2 can be preserved. Characters that are non-BMP are represented by four bytes: a high surrogate followed by a low surrogate. This representation of the non-BMP character is called the surrogate pair. There are 1,048,576 possible surrogate pairs which is exactly enough to cover all non-BMP code points. There is a simple algoritm that starts from the code point of the non-BMP character and finishes with the right high surrogate and low surrogate. It is a great, practical idea! All the texts that are encoded in UCS-2 need not be changed and many more characters can be represented too.
Example: The function ChrW()
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | Public Sub ExampleUnicode1() 'ChrW() Dim S As String S = "" S = S & ChrW(-32768) '-32769 would give run time error S = S & ChrW(7) S = S & ChrW(65) S = S & ChrW(&H42) S = S & ChrW(65535) '65536 would give run time error MsgBox S 'shows ?·AB? If (ChrW(-32768) = ChrW(32768)) Then MsgBox "The same!" 'shows The same! End If End Sub |
The VBA function ChrW()
turns a number in the range -32,768 to 65,535 into a character in a string. The number can be in decimal or hexadecimal format. For all characters in the BMP, if you know the code point, you can put the character into a VBA string. Code points are never negative. If the argument in the function ChrW()
is negative, the code point that corresponds to the negative number can be obtained by adding 65,536. Although the representation of the character is correct in the VBA string itself, unfortunately, it doesn't mean that the editor will display it correctly. The immediate window and the message box will display a question mark if they consider the character too exotic.
Example: ChrW
does not always yield the same result as Chr
Run the following program, after changing the path of the file to which the results will be written.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | Public Sub Example() Dim Number As Long Number = 0 Dim S As String S = "" For Number = 1 To 255 S = S & CStr(Number) & vbTab & ChrW(Number) & vbTab & Chr(Number) & vbTab & CStr(ChrW(Number) = Chr(Number)) & vbNewLine Next Number Dim IndexOfFile As Integer IndexOfFile = FreeFile() Open "H:\Temp\Test.txt" For Append As #IndexOfFile Print #IndexOfFile, S Close #IndexOfFile End Sub |
In the range 1 to 255, ChrW
and Chr
map the same number to the same character except for some numbers between 128 and 159. Have a look at the output file.
In Unicode, the numbers from 128 to 159 represent control characters. Chr
maps (most of) them to characters that are thought to be more useful. However, by doing so, the function Chr
does not follow the standard set by UCS and Unicode. Be careful if you would use these non-standard characters.
Example: The VBA editor does not support visualization of all BMP characters
Let's consider this text that only consists of characters in the BMP.
Hell〇 ₩ørld!
Here is some VBA code.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | Public Sub ExampleUnicode2() Dim S As String S = "Hell" & ChrW(&H3007) & " " & ChrW(&HFFE6) & ChrW(&H00F8) & "rld!" MsgBox S 'shows Hell? ?ørld! S = Replace(S, ChrW(&H3007), "OO") S = Replace(S, ChrW(&HFFE6), "WW") MsgBox S 'shows HellOO WWorld! End Sub |
Unfortunately, the VBA editor does not support visualization of all BMP characters. All characters that have a code point larger than U+FF (or 255) cannot be visualized. For example, the first call to MsgBox S
will not show the text correctly. Also, in the immediate window or the watch window, you will see "?" for all BMP characters that cannot be visualized. However, in the memory, the text is fine. You can see this by trying to replace 〇
and ₩
by something else. It works fine.