Glyphs, Graphemes and Phonemes

Graphemes are the basic building blocks of a written script. Font represents the graphical form of that script. Grapheme and glyph are two related, but different concepts. Grapheme is a synonym for character and font is a synonym for the collection of glyphs.

As an example, µ and × are basic characters, that is, they are graphemes. However f is not a grapheme because, it is the combination of µ and ×. Therefore, f is just a glyph. However, we need to represent, f as a separate symbol, which is graphically different from µ and ×. This symbol should be put in a font, which is the collection of glyphs and not in character set. In this example, {µ, ×} constitute the character set and {µ, ×, f} constitute the font.

The real hindrance in understanding this concept is that we compare Asian languages with Latin languages. In English, there is one-to-one correspondence between a character and its glyph. However, in Asian scripts a character can exist in different graphical forms. E.g.: the character Malayalam ù has different graphical forms in following words: Éù, Õdµ¢, ÕV×¢. The grapheme ù produces three different glyphs on three different contexts.

Lastly about phonemes. They are the basic building blocks of phonetics of a language. Set of phonemes is not the character set. For example, we have two È's in pronunciation. However, have only one grapheme to represent them. Instead, Tamil has two different graphemes (characters) to represent them. But phonemes are not of much interest to us since we are into writing Malayalam; not speaking Malayalam:)

Now you can see that, graphemes form as an abstract conceptual layer in between physically conceivable glyphs and phonemes.

Unicode and Graphemes

Unicode Consortium is standardizing the character sets of the world languages. Including Malayalam, character sets of 30+languages are currently standardized under Unicode. The basic characters of a language are identified and each of them is given a unique number. One important thing to remember is that Unicode is not standardizing the glyphs, but graphemes.

More over, encoding 900+ glyphs of Malayalam is a mammoth effort. If that is case, Mandarin/Cantonese will need nearly 10,000 positions. Since the set of graphemes in Malayalam is around 50-60 in size, 128 slots allotted are more than enough. Click here to get the chart of the Malayalam Unicode encoding.

Representing 'chill~' letters

I will show the rendering laws through examples:

È + virama + Ï => Èc
È + virama + ZWJ + Ï =>
È + virama + ZWNJ + Ï => ÈíÏ

This agrees with the current interpretation in FAQ on Indic scripts from Unicode site.

Representing repeated vowels (my suggestion - not a standard)

Consider somebody singing ³ÞÞ.... Think how to represent this in Unicode. Here is my suggestion: if there are more than one vowel symbols/virama after a consonant, then use the right part of the symbol alone for the second symbol onwards. (remember, symbol for ³ has 2 parts - ç, Þ).

Standards are for information exchange

Use of character set is for internal representation of data which glyphs represent. This internal representation should facilitate the operations available on the text, like, searching for a word, viewing the document in different one fonts etc. Most importantly, it should allow document to be readable and editable across different software applications.

Now we will have a look at how this happens in an electronic document scenario. Assume we have the basic character set {µ, ×} and the glyph set {µ, ×, f}. A sample document contains a word f in it. Internally that file will have only µ and ×, which are the basic characters. These characters will be stored adjacent in it. When we open it in an editor like MSWord, it will see the basic characters µ and ×. Then MSWord will ask the font to render it. Font will in turn look into its rule table and will find that, when characters µ and × are written adjacent; the glyph f should be rendered. Then font will give that glyph to the MSWord to display. More over, any font can decide to display characters in its own way. For example, let us say there is Thanima font, which does not like conjunct f. It will look into the table and will see that there is no rules for joining basic characters µ and ×; but when × is coming after a consonant, symbol glyph of × (á)should be used. Therefore, Thanima will output glyphs µ and á. Thus, two fonts, displays same document and conveys the information in two different ways.

We can see that, standardization on this minimum level is enough for information exchange. Therefore, a Malayalam text will look like any other ASCII plain English text that we open in Notepad. Font information will not be included in it. Just like, whether to view a plain English text in Helvetica or Times Roman is a user's choice; viewing a Malayalam text in Rachana font or Thanima font will again be a user's discretion.

What should not be standardized?

Standardization is required for information exchange. Standardization for the heck of it will be waste of effort. Just like, for any other ideas, the need for standardization should come from necessity. Moreover, standardizing glyphs is not a necessity. Rather, it will make the system too rigid. For instance, in future, if somebody else like Rachana comes up with a more comprehensive one, that can not be incorporated in. Instead, if somebody does not want to make a font with so many glyphs, then he will not be following the standard. That is why people go for standardizing deeper underlying language structures like graphemes. This will offer enough flexibility and at the same time allow the basic purpose of information exchange.

Enforcing standards

It is not the government who pushes the standards, but the companies in the related business. Publishing industry in Kerala has a big role to play here. Ultimately they are going to decide how a common man will read and write. That depends on the available, publishing software. Giants like Microsoft, Mac and Adobe are after Unicode. So ultimately, that is how future softwares are going to work. That is why we should participate in Unicode discussions, point out the errors, and suggest improvements. Again, that is why we should write softwares conforming to standards.

Fonts to supplement Unicode

TTF/Type1 font formats may not be flexible/powerful enough to support Unicode. The new OpenType standard introduced by Adobe and Microsoft together, is able to provide enough power to render Asian language text from its Unicode character set. Microsoft is come up with a Unicode complaint Indian font set called Arial Unicode in WindowsXP. It includes Malayalam too. All the above mentioned companies have identified the big market opportunities available in India for publishing softwares. Around the world, Unicode complaint publishing softwares for Indic scripts are in the implementation stages.

Searchable Malayalam websites or documents

This involves two huge steps:

  1. Create a Unicode compatible font(OpenType) for Malayalam. Allen Wood has compiled a list of Malayalam Unicode compatible fonts. The Swathanthra Malayalam Computing and Pango are involved in this ongoing research.
  2. Create an Input Method Editor for Malayalam language. This essentialy is a transliteration software. It is in its early stage of development. Linux & Solaris follows IIIMF standard. Microsoft deals with this in their own way(as usual ;-)

I don't know of anybody seriously working on Malayalam Input Method Editor.

Outstanding issues on Malayalam Unicode

The issues pertaining to current Malayalam Unicode are listed by Jeroen Hellingman who made the first freely available Malayalam font.

Unicode and transliteration

When you compare the efforts in Unicode with transliteration, many amazing similarities can be found. Both are into representing a text using graphemes. In transliteration, these are represented using English characters; in Unicode, they are represented using numbers. So finding the set of graphemes is a common effort.

Unicode and transliteration has much longer impacts. Unicode will decide how Malayalam text is going to be internally represented in an electronic document and transliteration will be the common man's way of typing Malayalam. So transliteration demands immediate standardization efforts.

Acronyms

Graphemes, characters -- basic building blocks of WRITTEN script

Phonemes -- basic building blocks of PHONETICS

Glyphs -- basic graphical units of a script

Font -- collection of glyphs

Character -- complete collection of graphemes

Unicode Consortium -- international committee for standardization of character sets

Transliteration -- representing graphemes of a script in graphemes of another script