Published on 25th October 2003
Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language.
| Fundamentally, computers just deal with numbers. They store letters and other characters by assigning a number for each one. Before Unicode was invented, there were hundreds of different encoding systems for assigning these numbers. No single encoding could contain enough characters: for example, the European Union alone requires several different encodings to cover all its languages. Even for a single language like English no single encoding was adequate for all the letters, punctuation, and technical symbols in common use. |
 |
These encoding systems also conflict with one another. That is, two encodings can use the same number for two different characters, or use different numbers for the same character. Any given computer (especially servers) needs to support many different encodings; yet whenever data is passed between different encodings or platforms, that data always runs the risk of corruption.
It is the official way to implement ISO/IEC 10646. It is supported in many operating systems, all modern browsers, and many other products. The emergence of the Unicode Standard and the availability of tools supporting it are among the most significant recent global software technology trends.
Incorporating Unicode into client-server or multi-tiered applications and websites offers significant cost savings over the use of legacy character sets.
Unicode enables a single software product or a single website to be targeted across multiple platforms, languages and countries without re-engineering. It allows data to be transported through many different systems without corruption.
Unicode has been hailed by many in the computing communities as an ideal solution to the problems of multiplatform internationalization. Majority of the of the software developers the world over have declared conformance to Unicode. They include IBM, Microsoft, Oracle, Sybase, Unisys, Apple, Bell Labs, Compaq, GNU/Linux, Sun, SCO, Hewlett Packard, Netscape, Ericsson and Novell. More and more applications are becoming Unicode compliant. It is expected that Unicode will become the de facto standard in the multilingual world, especially with the spread of Internet.
The design of Unicode is based on the simplicity and consistency of ASCII, but goes far beyond ASCII's limited ability to encode only the Latin alphabet. The Unicode Standard provides the capacity to encode all of the characters used for the written languages of the world. It uses a 16-bit encoding that provides code points for more than 65,000 characters. To keep character coding simple and efficient, the Unicode Standard assigns each character a unique 16-bit value, and does not use complex modes or escape codes.
While 65,000 characters are sufficient for encoding most of the many thousands of characters used in major languages of the world, the Unicode standard and ISO 10646 provide an extension mechanism called UTF-16 that allows for encoding as many as a million more characters, without use of escape codes. This is sufficient for all known character encoding requirements, including full coverage of all historic scripts of the world.
Unicode also reserves some code values for private use, which the software and hardware developers can assign internally for their own characters and symbols. UTF-8 is popular for HTML and similar protocols. UTF-8 is a way of transforming all Unicode characters into a variable length encoding of bytes. It has the advantages that the Unicode characters corresponding to the familiar ASCII set have the same byte values as ASCII, and that Unicode characters transformed into UTF-8 can be used with much existing software without extensive software rewrites.
UTF-16 is popular in many environments that need to balance efficient access to characters with economical use of storage. It is reasonably compact and all the heavily used characters fit into a single 16-bit code unit, while all other characters are accessible via pairs of 16-bit code units.
UTF-32 is popular where memory space is no concern, but fixed width, single code unit access to characters is desired. Each Unicode character is encoded in a single 32-bit code unit when using UTF-32.
All three encoding forms need at most 4 bytes (or 32-bits) of data for each character.
Code spaces U+0900 to U+0D7F is allotted to Indian scripts in Unicode. The allotment of code space for Indian scripts in Unicode is as follows:
| Language |
Code space |
| Devanagari |
U+0900 to U+097F |
| Bengali |
U+0980 to U+09FF |
| Gurmukhi |
U+0A00 to U+0A7F |
| Gujarati |
U+0A80 to U+0AFF |
| Oriya |
U+0B00 to U+0B7F |
| Tamil |
U+0B80 to U+0BFF |
| Telugu |
U+0C00 to U+0C7F |
| Kannada |
U+0C80 to U+0CFF |
| Malayalam |
U+0D00 to U+0D7F |
Malayalam is allotted 128 character positions, code space from U+0D00 to U+0D7F.
Read more on : "Tamil Unicode issues"
Unicode and its importance to world-wide product.
Unicode has facilitated the development process of Windows in a number of ways:
- Core international functionality for all user locales ships with every version of a release;
- Due to less time spent on development fixes, the international team has more time to devote to localization and localization processes (resulting in quicker releases of translated versions);
- Interoperability is guaranteed, following a recognized international standard.
While there certainly are growing pains when developing standards and software for an emerging market like India, the advantage of using Unicode—an international and interoperable standard—far outweighs those issues.
The enabled Indic support on the single world-wide binary of Windows is only possible due to Unicode. Unicode makes development for the worldwide market not more complicated, but considerably easier; Unicode is the future of worldwide software, including that software for the Indic language market.