Understanding Encoding
Encoding is the process by which text written in a particular natural language
is converted to another form. To a computer, all input is in the form of
numbers, which when seen at a still lower level is nothing but electrical
pulses. For computers to work with text, it is only natural that text also be
converted to numbers. But how will a computer differentiate between a number
and text? Well, when text is converted digits also are treated as text,
therefore making it possible to handle them uniformly.
During the initial stages of development of computers and software every vendor
used their own proprietary mechanism of encoding. However, as the market has
become increasingly competitive standards have become an imperative.
There are various schemes using which data is encoded. The most popular being
the American Standard Code for Information Interchange (ASCII). ASCII is a
repertoire, code, and encoding scheme. Other popular schemes for encoding
include Extended Binary Coded Decimal Interchange Code (EBCDIC), International
Organization for Standardization (ISO) encoding. The ASCII encoding scheme was
a 7-Bit code initially, which was later extended to 8-bits. By 7 and 8 Bits
what is meant is the number of binary digits required to encode a particular
character using that scheme. So a 7-bit encoding scheme can effectively encode
128 characters starting from 0 to 127. An 8-bit encoding scheme can encode 256
characters starting from 0 to 255.
A vendor or system provides support for a collection of abstract characters
called character repertoire. A character set is subset of this collection which
is used to represent text. This subset defines the number that is to be
assigned for each character. This means that two different encoding schemes may
use two different numbers to represent a single character in the character
extent. For example the code for character 'A' in ASCII is 65 while in EBCDIC
'A' is represented using 193.
Developing applications that support multiple encoding schemes and converting
data between different encoding schemes is a laborious task. Most of the
encoding schemes devised by vendors are either 7-bit or 8-bit schemes which
have been adopted for use with different languages worldwide. As the number of
languages that need to be represented in computers have increased, a need for a
common and standard code for interchanging data in global languages is
required. This need is fulfilled by Unicode.
Understanding Unicode
The ISO 10646 standard (ISO/IEC 10646) defines the Universal Character Set
(UCS). Presently many thousands of characters have been included in the UCS
repertoire and coded. This list is growing to accommodate many more languages.
Unicode is a character repertoire and character code that maintains
compatibility with ISO 10646 standard. This is a standard proposed by the
Unicode Consortium and used by many computer manufacturers. Unicode places some
additional checks over the definitions of ISO to ensure compatibility and
portability of characters across platforms.
Unicode was originally a 16 bit code but later has been expanded to a range from
0 – 10FFFFF. This range is divided into what are called planes. Each plane is
16 bits long. Presently the Basic Multilingual Plane (BMP) is used. The range
of the plane is 0 – FFFF. Presently the Unicode version 4.0 has been defined
which can represent 96, 248 characters.
Unicode characters are predominantly encoded using UTF-32, UTF-16 or UTF-8
encoding formats. As the names may imply the digits denote the number of bits
used to encode a character.
Unicode uses the terms abstract character and character. The term abstract
character denotes a character as a facet of the repertoire whereas the word
character represents a coded value for the abstract character. The physical
representation of a character is known as a glyph. A glyph denotes the way a
character is displayed or rendered on screen. Fonts are collections of glyphs.
Each glyph in a font is numbered with the character code it is supposed to
represent.
Understanding Uniscribe
Uniscribe is a mechanism using which a glyph is rendered on screen. Initially
the Open Type Layout Services (OTLS) API was used to render glyphs on screen.
But complex scripts such as Arabic, Indic and Far Eastern languages require
special processing because the glyphs for these scripts are not laid out in a
simple way.
The glyphs for the scripts mentioned above should support bi directional
rendering of text for example Arabic. The glyphs should support shaping of
glyphs based on context. For example the ऱ् consonant changes its shape based
combining character. The combining character could be a vowel; it may be before
or after ऱ् and the combining character could be a consonant which may be
before or after ऱ् Many characters in different Indic scripts require
contextual shaping. The glyphs should support combining characters. For
example:
ज + ् + ञ = ज्ञ
Here combing ja and nya has given rise to a new letter jnya. In turn jnya may
combine with other consonants and vowels.
When you want to render complex scripts you can use the controls that are
provided with Windows API such as Edit controls and Rich Edit Controls. These
controls internally use Uniscribe to render content from Windows 2000 onwards.
The Text processing functions of the Windows API such as TextOut, ExtTextOut,
TabbedTextOut, DrawText, and GetTextExtentExPoint. These functions also support
working with complex scripts from Windows 2000 onwards.
For developing simple applications these mechanisms are sufficient. But when you
require fine control over how these complex scripts are processed, Uniscribe is
the way to go.
Uniscribe, in the context of I18N is a common API that allows font creators and
applications developers to have common outlook towards rendering of complex
scripts. Uniscribe implements the Unicode standard glyph processing for complex
scripts. Font developers can now be assured that script rules as defined by
Unicode will be followed by application developers irrespective of whether text
is rendered directly or using Uniscribe. And application developers can be
certain about fonts with glyphs and layouts that conform to Unicode rules.