Understanding Encoding

Encoding is the process by which text written in a particular natural language is converted to another form. To a computer, all input is in the form of numbers, which when seen at a still lower level is nothing but electrical pulses. For computers to work with text, it is only natural that text also be converted to numbers. But how will a computer differentiate between a number and text? Well, when text is converted digits also are treated as text, therefore making it possible to handle them uniformly.

During the initial stages of development of computers and software every vendor used their own proprietary mechanism of encoding. However, as the market has become increasingly competitive standards have become an imperative.

There are various schemes using which data is encoded. The most popular being the American Standard Code for Information Interchange (ASCII). ASCII is a repertoire, code, and encoding scheme. Other popular schemes for encoding include Extended Binary Coded Decimal Interchange Code (EBCDIC), International Organization for Standardization (ISO) encoding. The ASCII encoding scheme was a 7-Bit code initially, which was later extended to 8-bits. By 7 and 8 Bits what is meant is the number of binary digits required to encode a particular character using that scheme. So a 7-bit encoding scheme can effectively encode 128 characters starting from 0 to 127. An 8-bit encoding scheme can encode 256 characters starting from 0 to 255.

A vendor or system provides support for a collection of abstract characters called character repertoire. A character set is subset of this collection which is used to represent text. This subset defines the number that is to be assigned for each character. This means that two different encoding schemes may use two different numbers to represent a single character in the character extent. For example the code for character 'A' in ASCII is 65 while in EBCDIC 'A' is represented using 193.

Developing applications that support multiple encoding schemes and converting data between different encoding schemes is a laborious task. Most of the encoding schemes devised by vendors are either 7-bit or 8-bit schemes which have been adopted for use with different languages worldwide. As the number of languages that need to be represented in computers have increased, a need for a common and standard code for interchanging data in global languages is required. This need is fulfilled by Unicode.

Understanding Unicode

The ISO 10646 standard (ISO/IEC 10646) defines the Universal Character Set (UCS). Presently many thousands of characters have been included in the UCS repertoire and coded. This list is growing to accommodate many more languages.

Unicode is a character repertoire and character code that maintains compatibility with ISO 10646 standard. This is a standard proposed by the Unicode Consortium and used by many computer manufacturers. Unicode places some additional checks over the definitions of ISO to ensure compatibility and portability of characters across platforms.

Unicode was originally a 16 bit code but later has been expanded to a range from 0 – 10FFFFF. This range is divided into what are called planes. Each plane is 16 bits long. Presently the Basic Multilingual Plane (BMP) is used. The range of the plane is 0 – FFFF. Presently the Unicode version 4.0 has been defined which can represent 96, 248 characters.

Unicode characters are predominantly encoded using UTF-32, UTF-16 or UTF-8 encoding formats. As the names may imply the digits denote the number of bits used to encode a character.

Unicode uses the terms abstract character and character. The term abstract character denotes a character as a facet of the repertoire whereas the word character represents a coded value for the abstract character. The physical representation of a character is known as a glyph. A glyph denotes the way a character is displayed or rendered on screen. Fonts are collections of glyphs. Each glyph in a font is numbered with the character code it is supposed to represent.

Understanding Uniscribe

Uniscribe is a mechanism using which a glyph is rendered on screen. Initially the Open Type Layout Services (OTLS) API was used to render glyphs on screen. But complex scripts such as Arabic, Indic and Far Eastern languages require special processing because the glyphs for these scripts are not laid out in a simple way.

The glyphs for the scripts mentioned above should support bi directional rendering of text for example Arabic. The glyphs should support shaping of glyphs based on context. For example the ऱ् consonant changes its shape based combining character. The combining character could be a vowel; it may be before or after ऱ् and the combining character could be a consonant which may be before or after ऱ् Many characters in different Indic scripts require contextual shaping. The glyphs should support combining characters. For example:

ज + ् + ञ = ज्ञ

Here combing ja and nya has given rise to a new letter jnya. In turn jnya may combine with other consonants and vowels.

When you want to render complex scripts you can use the controls that are provided with Windows API such as Edit controls and Rich Edit Controls. These controls internally use Uniscribe to render content from Windows 2000 onwards.

The Text processing functions of the Windows API such as TextOut, ExtTextOut, TabbedTextOut, DrawText, and GetTextExtentExPoint. These functions also support working with complex scripts from Windows 2000 onwards.

For developing simple applications these mechanisms are sufficient. But when you require fine control over how these complex scripts are processed, Uniscribe is the way to go.

Uniscribe, in the context of I18N is a common API that allows font creators and applications developers to have common outlook towards rendering of complex scripts. Uniscribe implements the Unicode standard glyph processing for complex scripts. Font developers can now be assured that script rules as defined by Unicode will be followed by application developers irrespective of whether text is rendered directly or using Uniscribe. And application developers can be certain about fonts with glyphs and layouts that conform to Unicode rules.