๐ŸŒ What is Unicode?

Universal Character Encoding Standard

What is Unicode?

Unicode is a character encoding standard that assigns a unique number (called code point) to every character, symbol, or emoji across all the world's writing systems.

โœ… Key Points:

  • Created to replace older limited encodings like ASCII (7-bit, 128 characters) and EBCDIC (8-bit, 256 characters)
  • Can represent over 1.1 million characters
  • Supports English letters, Chinese, Arabic, Hindi, emojis, math symbols, and more
Unicode Examples
In ASCII:
'A' = 65 (01000001 in binary)

In Unicode:
'A' = U+0041
'เค…' (Devanagari A) = U+0905
'๐Ÿ™‚' (emoji) = U+1F642
Unicode Consortium

Unicode was created by the Unicode Consortium, a group of multilingual software manufacturers. It simplifies software localization and improves multilingual text processing.

Unicode overcomes the difficulty inherent in ASCII and extended ASCII by standardizing script behavior, allowing any combination of characters from different scripts and languages to co-exist in a single document.

Unicode in COA (Computer Organization & Architecture)

In COA, Unicode comes under the topic of Representation of Information โ†’ Character Codes.

Binary Representation

The CPU and memory only understand binary. Unicode assigns each symbol a number (code point), then this number is stored in binary form in memory.

Encoding Forms (UTF)

Unicode has different encoding schemes:

  • UTF-8: Variable length (1 to 4 bytes). Most common on the web.
  • UTF-16: 2 or 4 bytes. Common in Windows/Java.
  • UTF-32: Fixed 4 bytes per character.

COA cares because encoding affects storage size and memory organization.

Instruction Execution

When a program processes text, the CPU loads the binary Unicode values from memory. The ALU (Arithmetic Logic Unit) may compare or manipulate these codes, but actual rendering (showing letter "เค…" on screen) is done by OS/software.

Internationalization (I18N) in Architecture

Earlier CPUs and memory systems were designed mainly for ASCII (1 byte per char). With Unicode, systems must handle multi-byte characters, which impacts:

  • Memory addressing
  • Bus transfers (more bytes per character)
  • Instruction set design (string handling)

Why Unicode Matters in COA

Data Representation

Just like numbers are stored in binary, text must also be represented in a standard way. Unicode provides a universal standard to represent characters from all languages.

Memory Organization

Unicode tells us how many bytes each character occupies, which helps in organizing memory correctly.

Compatibility

Unicode bridges old 8-bit systems (like ASCII and EBCDIC) with modern systems that use 16-bit or 32-bit encodings (like UTF-16 and UTF-32).

Real Systems

In traditional C programming, a char is 1 byte, which works well for ASCII characters. But Unicode characters may require multiple bytes, so languages introduced types like wchar_t or string classes that support multi-byte characters.

๐Ÿ‘‰ Example in action:

If you type this in C with UTF-8 encoding:

printf("ู…ุฑุญุจุง");

Each Arabic letter (ู…, ุฑ, ุญ, ุจ, ุง) is stored in memory using 2โ€“3 bytes per character (because Arabic characters require more than 1 byte in UTF-8).

The CPU loads these multi-byte sequences and sends them to the display hardware. The OS font rendering engine processes the bytes and correctly displays the Arabic word "ู…ุฑุญุจุง" (which means "Hello").

Unicode Encodings

Unicode defines multiple encodings of its single character set: UTF-7, UTF-8, UTF-16, and UTF-32. Conversion of data among these encodings is lossless.

Encoding Bytes per Character Description
UTF-8 1 to 4 bytes Variable length depending on character. ASCII takes only 1 byte, while unusual characters take 4 bytes.
UTF-16 2 or 4 bytes Uses 2 bytes for most characters, while very unusual characters take 4.
UTF-32 4 bytes Fixed 4 bytes per character. We can calculate the number of characters in a UTF-32 string by only counting bytes.
Unicode Notation

The notation uses hexadecimal digits in format as follows: U-XXXXXXXX

The numbering goes from U-00000000 to U-FFFFFFFF.

Unicode Planes

Unicode divides the available space codes into planes. A plane is a continuous group of 65,536 code points.

Plane Structure

The most significant 16 bits define the plane (i.e. number of planes = 65,535) and each plane can define up to 65,536 characters or symbols.

Plane Type Plane Number Description
Basic Multilingual Plane (BMP) 0000 Designed to be compatible with the previous 16-bit Unicode. The most significant 16-bits in this plane are all zeroes. Mostly defines character sets in different languages. Represented as U+XXXX where XXXX is the least significant 16-bits. Examples: U+0900 to U+09FF reserved for Devanagari, U+2200 to U+22FF reserved for mathematical operations.
Supplementary Multilingual Plane (SMP) 0001 Designed to provide more codes for those multilingual characters that are excluded in the BMP. Example: 10140-1018F are reserved for Ancient Greek Numbers.
Supplementary Ideographic Plane (SIP) 0002 Designed to provide codes for ideographic symbols, symbols that provide an idea in contrast to a sound. Example: 20000-2A6DF are reserved for CJK Unified Extension B.
Supplementary Special Plane (SSP) 000E Used for special characters. Example: E0000-E007F are reserved for tags.
Private Use Planes (PUPs) 000F and 0010 For private use. They are used by fonts internally to refer to auxiliary glyphs.

Advantages and Disadvantages

Advantages

๐ŸŒ Universal character set: Unicode supports almost all the characters and symbols used in the world's writing systems, making it a universal character set that can be used to represent text in any language.

๐Ÿ”„ Interoperability: Unicode provides interoperability between different computing systems, platforms, and software applications. This means that text encoded in Unicode can be exchanged and displayed correctly across different systems, regardless of the language or script used.

โœ… Compatibility: Unicode is compatible with all the major computing platforms, including Windows, macOS, Linux, and mobile devices. This makes it easy to share and display text across different devices and platforms.

๐Ÿ’พ Efficient storage: Unicode uses a fixed-length encoding scheme, which makes it more efficient in terms of storage and memory usage than other encoding standards.

Disadvantages

๐Ÿงฉ Complexity: Unicode is a complex encoding standard that can be difficult to implement and use correctly. It requires a significant amount of knowledge and expertise to correctly encode, store, and display text in Unicode.

โš ๏ธ Compatibility issues with legacy systems: Some legacy systems and software applications may not support Unicode or may not display Unicode characters correctly. This can cause compatibility issues when exchanging text across different systems.

๐Ÿ“ฆ Large character set: Unicode's large character set can be a disadvantage in some applications, where only a small subset of characters is needed. This can result in larger file sizes and increased memory usage.

๐ŸŒ Localization: While Unicode supports most of the world's writing systems, it may not be sufficient for some localization requirements, such as the need for specialized symbols or characters that are unique to a particular language or culture.

Summary

๐Ÿ“Œ Key Takeaways:

  • Unicode = universal standard for text representation
  • In COA, it's studied under Character Codes (with ASCII, EBCDIC, BCD)
  • Important because it affects how text is stored in memory, fetched by CPU, and processed in instructions
  • Encodings (UTF-8/16/32) decide memory size and performance in handling text

Unicode ensures that text from any language โ€” whether Arabic, Chinese, or emoji โ€” can be stored, transmitted, and displayed correctly in modern computer systems. This is crucial for internationalization and proper handling of global data.