About BhashaIndia | Contribute | SiteMap | Register | Sign in to Windows Live ID
  Developers Patrons
Hindi Tamil Kannada Gujarati Marathi Telugu Bengali Malayalam Punjabi Konkani Oriya Sanskrit Nepali
Home > Developers > IndianLang > IndicScript > Indic script & Unicode Welcome Guest!

The Indic script development community and Unicode

As developers of Indic-language software begin to consider Unicode as an encoding option for their software, they see a need to refine the repertoire to best represent the scripts. Over the past year, it has become increasingly obvious that some Indic-script software developers are not fully satisfied with the encoding solution that Unicode provides for Indic scripts. One of the concerns in this developer community has been that the Unicode script repertoires for Indic languages are too Devanagari based, having initially been defined from ISCII 1988; developers for non-Devanagari languages have felt that ISCII (and respectively Unicode) do not satisfactorily support their languages.
As such, changes to the Indic character repertoire to better represent non-Devanagari languages have been proposed to the Unicode Technical Committee (UTC), including changes to the Tamil block description (which is now being updated for a future version of the standard). Like other script repertoires in Unicode, it has taken some time to refine the set of characters, character properties and block descriptions to the full satisfaction of the linguistic community, and this could continue for some time.
However, beyond the updates in character semantics and repertoires, a stumbling block to Unicode acceptance with the Indian development community remains: the perception that character encoding order should be equivalent to linguistic collation, and that incorrect ordering of code points in Unicode will result in incorrect collation. (This belief is reflected in changes implemented in ISCII 1991; some of the changes from ISCII 1988 involved rearranging code points within the encoding, resulting in a more linguistically-correct order.) In the last year, there have been several proposals brought before the UTC involving rearranging code points in the Indic repertoires.
Rearranging characters however runs counter to Unicode Character Encoding Stability
Policy #1: once a character is encoded, it will not be moved or removed and any such proposal to rearrange characters is rejected by the UTC. As a result, the perception persists in some of the Indic language communities that since code points for a particular language are out of order within the Unicode repertoire and will not be rearranged, correct linguistic collation (and by extension, properly globalized software) is not possible using Unicode.
Encoding order cannot be considered satisfactory collation for just about any language, and the Indic languages are no exception to this rule. The two major reasons for this are:
  1. Character encodings are generally script- (or subscript) based; collation must be applied at the language (or language variant) level. This means that any chosen encoding order could be incorrect for a number of languages supported by the script, despite being correct for other languages;
  2. Encodings do not always take language-specific sorting elements into account (again, in part due to the fact encodings are script based), and language-based sorting elements (which often are multiple code points) are needed for correct collation.
(A third and less important reason is that there is a long-standing precedent for 'disorder' in encodings, if extant code pages or character sets are any indication of implementer expectation concerning order of code points. It is common knowledge in most user communities that some function outside of the character encoding will be needed to perform linguistic collation; there is no expectation that code point order within an encoding will be sufficient. Many scripts in Unicode already fall into this category.)
Print Print
Broadcast Broadcast
Save this Article Save
E-mail this article link E-Mail
Rate this article
Related Articles
Contribute an article

Also read:

Related articles
Rate this article
1 2 3 4 5 6 7 8 9
Poor Outstanding
Tell us why you rated the content this way. [Optional]
 

Average rating:
5 out of 9
1 2 3 4 5 6 7 8 9
9 people have rated this article
Partner Profile | Privacy Statement | Why Passport | Testimonials
This site uses Unicode for non-English characters and uses Open Type fonts.
©2003-2007 Microsoft Corporation. All rights reserved.