Universal Character Encoding Standard
Unicode is a character encoding standard that assigns a unique number (called code point) to every character, symbol, or emoji across all the world's writing systems.
โ Key Points:
Unicode was created by the Unicode Consortium, a group of multilingual software manufacturers. It simplifies software localization and improves multilingual text processing.
Unicode overcomes the difficulty inherent in ASCII and extended ASCII by standardizing script behavior, allowing any combination of characters from different scripts and languages to co-exist in a single document.
In COA, Unicode comes under the topic of Representation of Information โ Character Codes.
The CPU and memory only understand binary. Unicode assigns each symbol a number (code point), then this number is stored in binary form in memory.
Unicode has different encoding schemes:
COA cares because encoding affects storage size and memory organization.
When a program processes text, the CPU loads the binary Unicode values from memory. The ALU (Arithmetic Logic Unit) may compare or manipulate these codes, but actual rendering (showing letter "เค " on screen) is done by OS/software.
Earlier CPUs and memory systems were designed mainly for ASCII (1 byte per char). With Unicode, systems must handle multi-byte characters, which impacts:
Just like numbers are stored in binary, text must also be represented in a standard way. Unicode provides a universal standard to represent characters from all languages.
Unicode tells us how many bytes each character occupies, which helps in organizing memory correctly.
Unicode bridges old 8-bit systems (like ASCII and EBCDIC) with modern systems that use 16-bit or 32-bit encodings (like UTF-16 and UTF-32).
In traditional C programming, a char is 1 byte, which works well for ASCII characters. But Unicode characters may require multiple bytes, so languages introduced types like wchar_t or string classes that support multi-byte characters.
๐ Example in action:
If you type this in C with UTF-8 encoding:
Each Arabic letter (ู , ุฑ, ุญ, ุจ, ุง) is stored in memory using 2โ3 bytes per character (because Arabic characters require more than 1 byte in UTF-8).
The CPU loads these multi-byte sequences and sends them to the display hardware. The OS font rendering engine processes the bytes and correctly displays the Arabic word "ู ุฑุญุจุง" (which means "Hello").
Unicode defines multiple encodings of its single character set: UTF-7, UTF-8, UTF-16, and UTF-32. Conversion of data among these encodings is lossless.
| Encoding | Bytes per Character | Description |
|---|---|---|
| UTF-8 | 1 to 4 bytes | Variable length depending on character. ASCII takes only 1 byte, while unusual characters take 4 bytes. |
| UTF-16 | 2 or 4 bytes | Uses 2 bytes for most characters, while very unusual characters take 4. |
| UTF-32 | 4 bytes | Fixed 4 bytes per character. We can calculate the number of characters in a UTF-32 string by only counting bytes. |
The notation uses hexadecimal digits in format as follows: U-XXXXXXXX
The numbering goes from U-00000000 to U-FFFFFFFF.
Unicode divides the available space codes into planes. A plane is a continuous group of 65,536 code points.
The most significant 16 bits define the plane (i.e. number of planes = 65,535) and each plane can define up to 65,536 characters or symbols.
| Plane Type | Plane Number | Description |
|---|---|---|
| Basic Multilingual Plane (BMP) | 0000 | Designed to be compatible with the previous 16-bit Unicode. The most significant 16-bits in this plane are all zeroes. Mostly defines character sets in different languages. Represented as U+XXXX where XXXX is the least significant 16-bits. Examples: U+0900 to U+09FF reserved for Devanagari, U+2200 to U+22FF reserved for mathematical operations. |
| Supplementary Multilingual Plane (SMP) | 0001 | Designed to provide more codes for those multilingual characters that are excluded in the BMP. Example: 10140-1018F are reserved for Ancient Greek Numbers. |
| Supplementary Ideographic Plane (SIP) | 0002 | Designed to provide codes for ideographic symbols, symbols that provide an idea in contrast to a sound. Example: 20000-2A6DF are reserved for CJK Unified Extension B. |
| Supplementary Special Plane (SSP) | 000E | Used for special characters. Example: E0000-E007F are reserved for tags. |
| Private Use Planes (PUPs) | 000F and 0010 | For private use. They are used by fonts internally to refer to auxiliary glyphs. |
๐ Universal character set: Unicode supports almost all the characters and symbols used in the world's writing systems, making it a universal character set that can be used to represent text in any language.
๐ Interoperability: Unicode provides interoperability between different computing systems, platforms, and software applications. This means that text encoded in Unicode can be exchanged and displayed correctly across different systems, regardless of the language or script used.
โ Compatibility: Unicode is compatible with all the major computing platforms, including Windows, macOS, Linux, and mobile devices. This makes it easy to share and display text across different devices and platforms.
๐พ Efficient storage: Unicode uses a fixed-length encoding scheme, which makes it more efficient in terms of storage and memory usage than other encoding standards.
๐งฉ Complexity: Unicode is a complex encoding standard that can be difficult to implement and use correctly. It requires a significant amount of knowledge and expertise to correctly encode, store, and display text in Unicode.
โ ๏ธ Compatibility issues with legacy systems: Some legacy systems and software applications may not support Unicode or may not display Unicode characters correctly. This can cause compatibility issues when exchanging text across different systems.
๐ฆ Large character set: Unicode's large character set can be a disadvantage in some applications, where only a small subset of characters is needed. This can result in larger file sizes and increased memory usage.
๐ Localization: While Unicode supports most of the world's writing systems, it may not be sufficient for some localization requirements, such as the need for specialized symbols or characters that are unique to a particular language or culture.
๐ Key Takeaways:
Unicode ensures that text from any language โ whether Arabic, Chinese, or emoji โ can be stored, transmitted, and displayed correctly in modern computer systems. This is crucial for internationalization and proper handling of global data.