About BhashaIndia | Contribute | SiteMap | Register | Sign in to Windows Live ID
  Developers Patrons
Hindi Tamil Kannada Gujarati Marathi Telugu Bengali Malayalam Punjabi Konkani Oriya Sanskrit Nepali
Home > Developers > MSTech > IndicSupport > Life of an Indic Character Welcome Guest!

Life of an Indic character: Input, Rendering, Display

By Cathy Wissink - Windows Globalization, Microsoft Corporation

 Process by which Indic data is input, rendered and displayed

It is something which requires a considerable amount of processing from the system, but only minimal effort from the user.

Text processing begins when a user inputs the data. For input of Indic data on Windows, keyboards are used. There are implementers who believe that an IME—Input Method Editor, with thousands of potential code point combinations—is needed to input Indic; however, this is not necessary on Windows. A user simply needs to pick their appropriate keyboard (via Regional and Language Options in the Control Panel), and type.

The output from the keyboard consists of Unicode code points. For example, assume a user wanted to type the Hindi word for "student", which is (vidya-rthi-). First, he'd choose the Hindi keyboard from Regional and Language Options. Using this keyboard, he'd hit the unshifted "B" key (US keyboard; VK_B), and the resulting output would consist of U+0935: (Devanagari Letter VA).

The user would follow that with Devanagari Vowel Sign I (U+093F; ि ), at the unshifted "F" key, or VK_F. So at this point, the user has a string of U+0935 + U+093F, corresponding to the first syllable of the above word. What happens now?

The work of the keyboard for the first syllable of the word is now done. It passes the code points U+0935 + U+093F to the shaping engine, called Uniscribe (usp10.dll) on Windows. Uniscribe is responsible for properly displaying complex script text in Windows and its respective applications. (A complex script is any writing system that needs additional processing in order to properly display. For example, Arabic needs contextual shaping as well as bidirectional behavior. Vietnamese needs diacritic positioning, and Indic scripts sometimes need rearrangement of vowel marks.)

Uniscribe renders text based on the writing system of the appropriate languages. In the case of Indic, Uniscribe examines the runs of text to determine syllable clusters and set syllable boundaries.

The shaping engine then analyzes the syllable to determine if any glyph swapping is needed, in cases like Tamil Vowel Sign E, EE, or AI. In these cases, the visual presentation of the vowel (to the left of the consonant) differs from the pronunciation of the vowel (after the consonant) and the logical ordering of the code points within the text stream. For example, Tamil Letter Ka ( and Tamil Vowel Sign EE ( ே) display as கே, even though the vowel is pronounced after the consonant.

Finally, OpenType Layout Services is called to provide glyph positioning and glyph substitution. These formatted glyphs, based on the underlying code points, are sent on for display in whatever application is being used at this time.

This process continues for the entire string. The following figure is a simple example of this process, using the whole string for from keyboard to output for display.

NOTE: (For a far more detailed and technical explanation of this process using a Sanskrit example, please see John Hudson's article on the Microsoft Typography website.

Figure : The process from keyboard input to display, using the Hindi word . Because Indic languages need additional processing, Uniscribe and Open Type Layout Services must be called prior to text display.

Understanding the character-glyph model is crucial to understanding this process. Note the boundary between code points and glyphs as shown in Figure : text is stored as code points until it needs to be displayed. When it is displayed, the shaping engine is called, and the corresponding glyphs are assigned.

At the point of display, technologies such as fonts and rendering engines map between code points and glyphs. While code points are passed from the keyboard onto the shaping engine (Uniscribe), and code points are the basis of linguistic analysis to determine syllable structure, glyphs are ultimately used for display. This goes back to the definition of code points vs. glyphs: while code points deal with semantic information, glyphs deal with visual shape.

The burden to determine the shape is not on the user (such that she would have to pick the correct contextual shape relative to the rest of the text), or the encoding, but rather, the shaping engine in conjunction with the font technology. There is an important technical boundary between code points and glyphs, and this exists in order to maintain at least a modicum of simplicity within the system. (Imagine if every single visual variant of a code point had to be maintained for text processing!) For this reason, keyboards focus exclusively on code points, and leave the work of linking code points to the appropriate visual display to fonts and shaping engines.

Looking at the above technology, it may now be clear to the reader why an IME is not necessary for Indic input. It has been suggested that an IME could contain necessary all the necessary code point combinations (and their respective glyphs), thus circumventing the shaping engine altogether. This is generally heard from customers working with complex script languages who feel that they need to have all visual variants of a code point on an input method. Input Method Editors really make sense with ideographic languages such as Chinese or Korean, where there are literally thousands of characters needed for the language. Each of these ideographic characters is semantically distinct.

Compare this with complex scripts, where the number of semantically distinct characters is generally less than 100, but the number of visually distinct characters is considerable (into the hundreds). Again, keyboards work with code points, not with glyphs. Since code points are semantically distinct and not visually distinct, a complex script language can easily be handled via a keyboard; as noted earlier, the code points are linked to the appropriate visual display by other non-keyboard technologies.

Partner Profile | Privacy Statement | Why Passport | Testimonials
This site uses Unicode for non-English characters and uses Open Type fonts.
©2003-2007 Microsoft Corporation. All rights reserved.