Unicode Standard
By Raveesh Gupta - Localization, Microsoft Corporation (India) Pvt. Ltd.
The Unicode Standard is the universal character-encoding standard used for representation of text for computer processing. It is fully compatible with the International Standard ISO/IEC 10646-1; 1993, and contains all the same characters and encoding points as ISO/IEC 10646. Unicode provides a consistent way of encoding multilingual plain text and brings order to a chaotic state of affairs that has made it difficult to exchange text files internationally.
The Unicode standard directly addresses only the encoding and decoding of the text elements, that is, it defines how characters are interpreted. The Unicode Standard defines three encoding forms that allow the same data to be transmitted in a byte, word or double word oriented format (i.e. in 8, 16 or 32-bits per code unit).The Unicode Standard defines codes for characters used in the major languages written today.
Scripts include Latin, Greek, Cyrillic, Armenian, Hebrew, Arabic, Devanagari, Bengali, Gurmukhi, Gujarati, Oriya, Tamil, Telugu, Kannada, Malayalam, Thai, Lao, Georgian, Tibetan, Japanese Kana, the complete set of modern Korean Hangul, and a unified set of Chinese/Japanese/Korean (CJK) ideographs. Many more scripts and characters are to be added shortly, including Ethiopic, Canadian Syllabics, Cherokee, additional rare ideographs, Sinhala, Syriac, Burmese, Khmer, and Braille.
The Unicode standard directly addresses only the encoding and decoding of the text elements, that is, it defines how characters are interpreted. The character identified by a Unicode code value is an abstract entity, such as "MALAYALAM CONSONANT PA". The mark made on the screen or paper, called a glyph, is the visual representation of the character. Unicode standard does not define glyph images, that is how the character should appear on the screen or paper, which is the responsibility of the hardware or software-rendering engine.
The Unicode Standard defines three encoding forms that allow the same data to be transmitted in a byte, word or double word oriented format (i.e. in 8, 16 or 32-bits per code unit). All three encoding forms encode the same common character repertoire and can be efficiently transformed into one another without loss of data.
- The Unicode Standard has been adopted by all industry leaders as Apple, HP, IBM, JustSystem, Sybase, Unisys and many others, besides of course Microsoft. Unicode is required by modern standards such as XML, Java, ECMAScript (JavaScript), LDAP, CORBA 3.0, WML, etc., and is the official way to implement ISO/IEC 10646.
- It is supported in many operating systems, all modern browsers, and many other products. The emergence of the Unicode Standard, and the availability of tools supporting it, are among the most significant recent global software technology trends.
- Incorporating Unicode into client-server or multi-tiered applications and websites offers significant cost savings over the use of legacy character sets.
- Unicode enables a single software product or a single website to be targeted across multiple platforms, languages and countries without re-engineering.
- It allows data to be transported through many different systems without corruption.
- There is just one unique way to store and write a lingual string in Unicode. Thus this makes the Unicode data to be easily sorted and searched.
Implementation or development of a website using the Unicode Standard include these but are not limited to:
- normalizing Unicode text for comparison and storage
- compressing Unicode text, for storage comparable to that of legacy encodings
- collating (sorting) strings
- linebreaking text
- performing uppercase, lowercase, titlecase, and case folding operations
- handling CRLF
- designing regular expressions
Hence an intelligent decision is to make use of the Unicode standard for your web-targeted information.