About BhashaIndia | Contribute | SiteMap | Register | Sign in to Windows Live ID
  Developers Patrons
Hindi Tamil Kannada Gujarati Marathi Telugu Bengali Malayalam Punjabi Konkani Oriya Sanskrit Nepali
Home > Developers > MSTech > IndicSupport > Indic application of Character-glyph models Welcome Guest!

The character-glyph model and its application to Indic

By Cathy Wissink - Windows Globalization, Microsoft Corporation

The character-glyph model (as defined by Unicode and the W3C), when applied to Indic scripts, appears to be one of the most difficult concepts for implementers to grasp; it is the source of much confusion concerning what is needed in a character encoding for Indic. However, understanding the basic idea behind the character-glyph model is crucial in order to understand the relationship between the different globalization technologies used to enable Indic on Windows.

Depending on the person's perspective, 'character' could mean a phoneme, a grapheme, a collation unit, code point, code unit or a number of other things. From the perspective of multilingual processing, a code point has semantic content, while a glyph deals exclusively with visual representation. Code points as a rule are semantically distinct (as an example with Cyrillic/Latin/Greek A demonstrates below)

The most important point to understand is the non one-to-one relationship between code points and glyphs. In other words, it cannot be assumed that one code point (in Unicode or any other encoding) corresponds to one displayed (or output) glyph. Frequently, there is a one-to-one correspondence, but not always. The type of writing system used determines the relationship between code points and glyphs.

Alphabets are perhaps the least complicated of the writing systems in their relationship to code points; there is generally a relationship between code points and glyph. In each of the following examples, there is one code point, one "character" and one representative glyph.

U+03BB = λ (Greek Small Letter Lambda)
U+0041 = A (Latin Capital Letter A)
U+042E = Ю (Cyrillic Capital Letter Yu)

However, not even alphabets always have the one-to-one correspondence between code points and glyphs, as the following examples demonstrate:

  • Many Latin languages require use of diacritics—encoded separately as distinct code points—to complete their character repertoire. (Some of these base letter + diacritic combinations have been encoded in Unicode as one code point, but these were only added for interoperability with legacy encodings.)
  • Some alphabetic languages require multiple letters to be treated as a single unit, often in collation or in typography. There are a number of recent initiatives in Indian governmental and educational sectors that require local language support;

One will often see the same glyph used for multiple alphabets, e.g., the Latin A, the Cyrillic A, and the Greek Alpha all use "A".

Indic scripts, which are alphasyllabaries, are somewhat more complicated than alphabets with regards to the character glyph model in a number of ways. One way Indic is more complex is the inherent vowel found in consonant characters (a hallmark of alphasyllabaries).

In Unicode, the consonant characters are encoded as single code points and include the inherent vowel: Devanagari KA (, U+0915) includes not only the [k], but also the inherent vowel [Ə] and is pronounced [kƏ].

Since the constant character has an inherent vowel by default, it is necessary when overriding this vowel to add an additional code point representing the vowel mark. As a result, you will get a single glyph corresponding to multiple code points when using a non-inherent vowel, as seen with Devanagari KA and various vowel marks:

(U+0915) [kƏ] (Inherent vowel used—no vowel mark-one code point)
के  (U+0915 + U+0947) [ke] (Vowel Sign E added—two code points)
की (U+0915 + U+0940) [ki:] (Vowel Sign II added –two code points)
कू (U+0915 + U+0942) [ku:] (Vowel Sign UU added –two code points)

An analogous phenomenon occurs when applying various diacritics to change some quality in the syllable. This happens for vowel nasalization (e.g., anusvara, Gurmukhi tippi), creating fricatives (aytham in Tamil), or using the nukta to extend the script a second code point is used to show the change in the syllable:

(U+0915 + 093C) [qƏ (Devanagari KA + nukta = Devanagari QA-two code points)

Viramas also create a non-one-to-one correspondence between code points and glyphs in Unicode. Since the inherent vowel is included by default in a consonant code point to 'kill' the vowel requires an additional code point. In other words, a half form will require two code points.

(U+0C15) [kƏ] (Telugu KA; inherent vowel used—one code point) 

క్(U+0C15 + U+0C4D) [k] (Telugu KA + virama; no inherent vowel—two code points)

There are syllables in Indic which have a many-to-one relationship between code points and glyphs. Since the syllable has multiple consonants in the onset, a virama is needed to ensure there are no intervening vowels between the consonants. This results in multiple code points for a single glyph, as seen in Devanagari Ksha and Tamil Shri:

क्ष (U+0915 + U+094D + U+0937) [k ʃƏ (Devanagari KA + virama + SSA—three code points)

ஷ்ர(U+0BB7 + U+0BCD + U+0BB0) [ʃri:] (Tamil SSA + virama + RA—three code points)

Finally, there are situations where vowel marks will 'wrap' around a consonant, resulting in a need for multiple glyphs for a single code point.

For example, Tamil Vowel Sign O ( ;U+0BCA) has two glyphs—one for the left of the consonant, the other for the right (the consonant would go where the dotted circle is represented, as is seen in மொ)

As the above examples have shown, the relationship between code points and glyphs is not always simple, even in alphabets, and it cannot be assumed that there is a one-to-one relationship between the two. The legacy code points from old encoding standards that were encoded as one pre-composed code point rather than two—e.g., Latin letters with diacritics—has not helped to clarify the need to follow the character -glyph model, unfortunately.

Lack of a one-to-one correlation between code points and glyph is an interesting implementation challenge. This lack of a clear relationship also makes for an additional political challenge, when the reasons for making a particular distinction between code points and glyphs are not obvious to the stakeholders of the language and culture. It appears that misunderstandings about the character-glyph model (as well as other issues) have led some implementers of Indic to suggest the use of alternative encodings other than Unicode.

The Indic repertoires in the Unicode standard are still maturing, and will need more work. The Editorial Committee of the Unicode Standard version 4.0 has been working closely with many specialists in India to ensure the correctness of the text concerning Indic. In addition, the Unicode Technical Committee has worked with a number of Indic experts over the last year to improve the repertoire of the various scripts.

However it is apparent that implementers already work with the Indic repertoire as outlined successfully provide culturally appropriate text processing for Indic languages on a world-wide platform.

Partner Profile | Privacy Statement | Why Passport | Testimonials
This site uses Unicode for non-English characters and uses Open Type fonts.
©2003-2007 Microsoft Corporation. All rights reserved.