Next Previous Contents

6. Formalisation

In the model the author uses, a sequence of one or more characters to represent each Malayalam letter. In the following sections, with the help of few definitions, we will see algorithms for generating Malayalam script from the character stream representing Malayalam text. Its implementation is mentioned in the last section.

In this section, Malayalam text will appear within single quotes(') written using Mozhi scheme described in chapter Mozhi 7 bit transliteration scheme. Plus(+) is used as the string concatenation operator and forward slash (/) indicates either the string before or the string after the forward slash or both are under consideration.

6.1 Deviations from Malayalam Scripting Conventions

We assume following deviations from the traditional Malayalam scripting conventions. This is for the ease of generalisation of transliteration properties.

6.2 Definitions

Letter:

Letters are the basic phonetic building blocks of a language. Along with the sound it has one or more graphical forms also.

Symbol:

In contrast to a letter, a symbol does not have its own phonetic identity (ex: coma, colon or visarga).

Literal:

Literal is a sequence of letters which has a single fused graphical form when written either independently or combined with any other letters (ex: `n', `nn' and `yu'). Usually it includes all letters and all conjunct forms consisting of two or more letters. Literals representing the letters are called base literals and those representing the conjunct forms are called derived literals. Size of a literal is the number of characters being used to represent that literal in the given transliteration scheme. |x| operator denote the size of the literal x with respect to that scheme. A literal can also be classified as vocal or consonantal depending on whether it can be pronounced independently or not.

Concept of literals is closely linked to fonts or some code for information interchange in general. Whether a character sequence has a single fused graphical form or not, can be decided only by analysing that code.

Glyph:

graphical form of a literal or symbol.

Text:

It refers to the sequence of characters representing transliterated Malayalam. It can also be interpreted as a sequence of literals and symbols represented by that character stream.

Script:

sequence of glyphs denoting a text.

Transliteration:

the representation of the glyphs of a source script by the glyphs of a target script. In our description, source script is Malayalam and target script is English.

Reverse transliteration:

the process whereby the glyphs of a target script are transliterated into those of the source script.

Conjunct:

a sequence of literals such that all the literals other than the last are consonantal literals. Last literal can be either vocal or consonantal. (eg: In the conjunct `sva', `s' is the first literal, `v' is the second and `a' is the third. More over, `svar' is not a conjunct because the vocal literal `a' does not come as the last literal.)

Context:

Context of a literal refers to its neighbourhood in a text. There are two different contexts for a literal. One is the lexical context and other is the phonetical context. By default context means phonetical context. A literal can have null neighbourhood also. That is, when no other sound is pronounced along with it.

A detailed list of related definitions can be found in [ Antony P. Stone's transliteration page].

Different Glyphs of a Literal

A literal assumes different glyphs depending on its context. They are classified as following:

Independent glyph

A vocal literal assumes independent glyph in null context. (eg: graphical form to represent a single `o' sound). Any consonant literal X assumes independent glyph in the conjunct X + `a'. (eg: graphical form of `ka')

Sign glyph or Partial glyph

Graphical form of the literal when it appears last in a two literal conjunct. (eg: graphical form of `o' in `ko'; form of `ya' in `kya' and form of `na' in `sna'). It can have two parts which come on left and right of the independent glyph of the first literal. They are called "left sign glyph" and "right sign glyph" respectively. (eg: in `ko', `o' has both sign glyphs. In `kya', `y' has only right sign glyph and in `kra', `r' has only left sign glyph). More over, all the vocal literals except `a' have sign glyphs.

The literal `v' has two right sign glyphs. The sign glyph which appears when first literal is not `y' or `zh' is called primary right sign glyph of `v'. The one which appears with `y' and `zh' is called secondary right sign glyph of `v'. For example, see the words `svapnam', `vaazhv~' and `meyvazhakkam'.

Chillu glyph

Only `N', `n', `m', `r', `l', `L' and `rr' have single fused glyph in null context. Those glyphs are called chillu glyphs. (In old orthography `k' and `y' also had chillu glyphs. Described in Gundert's Dictionary.)

The chillu glyphs of `r' and `rr' are same. There are two different chillu glyphs for `r/rr'. The graphical form `r' assumes when it is the last literal of a word, is called primary chillu glyph of `r/rr' (eg: graphical form of `r' in `avar'). The chillu glyph of `r' as a dot over the next consonant in the word is called secondary chillu glyph of `r/rr' (eg: graphical form of `r' in the word `charkka' written in old orthography). In new orthography, primary chillu glyph itself is used in place of secondary.

6.3 Rules for Reverse Transliteration

Reverse transliteration has two steps. First, parsing the character stream of transliterated Malayalam text into sequence of literals. Then generating the glyph sequence corresponding to sequence of literals by looking at the context of each literal.

Parsing Stream of Characters into Literals

In the model we use, sequence of one or more characters represent each Malayalam letter. Hence there arises the problem of splitting the character stream representing Malayalam text into corresponding literals. For example, if the stream is `thn', we can view it as `t' + `hn' or `th' + `n' or `t'+`h'+`n' or `thn'. We have to choose correct one from these different options. The generalised rule is as follows:

Let the character stream S be a1 + a2 + ... + an where ai (1 <= i <= n) is a character.We can consider the non-trivial case where S can be split in two ways as S1 = x1 + x2 + ... + xp and S2 = y1 + y2 +...+ yq such that x1 != y1 and xp != yq where xi(1<=i<=p) and yi(1<=i<=q) are character sequences representing single literals. Without loss of generality S1 will be chosen if either of the following conditions are true:

  1. p = 1 and q > 1
  2. p > 1 and q > 1 and xp and yq are base literals and | xp | > | yq |

Reverse Transliteration Function

Next step is to generate the glyph sequence from the sequence of literals obtained. A rule based algorithm is described below.

Let X, X1 and X2 denote conjuncts; z and z1 denote literals. G(X) be a function mapping from the given conjunct X to its script. G(X) is defined below. There will be a number of clauses such that each one has a condition and an expression returning a value. In a call to G(X), only the clause which satisfies its gate condition will be selected and will return the value of the corresponding expression. If two clauses satisfy the condition, the first one will be selected lexically.

  1. X is null. G(X) = null
  2. X = X1 + X2 where X1 and X2 are non-null conjuncts and first literal of X2 has no sign glyph. G(X) = G(X1) + G(X2)
  3. X = X1 + z + X2 where X2 is not a single vocal literal (X1 and X2 can be null) and z has chillu glyph. G(X) = G(X1) + <(primary)chillu glyph of z> + G(X2)
  4. Last literal of X is not a vocal literal. G(X) = G(X + `~')
  5. X = z + `a' where z is a literal or X is a single vocal literal. G(X) = <independent glyph of z>
  6. X = X1 + z where z is a vocal literal or X = X1 + z + a. In any of the above cases, X1 is non-null. G(X) = <left sign glyph of z> + G(X1 + `a') + <(primary)right sign glyph of z>
  7. Special cases (deviations from above clauses)

Classes of Literals

From the above algorithm, we may see that many literals have exactly same joining property with other literals. So we classify literals into different classes such that any two literals in a class has same joining property with:

  1. any literal in any other classes
  2. other literals in the same class
  3. same literal itself

The classes are as follows:

  1. `a'
  2. Base vocal literals other than `a'
  3. Derived vocal literals
  4. Base consonantal literals (other than those described below)
  5. Derived consonantal literals
  6. `g', `S' and `s'
  7. `T', `Th' and `Dh'
  8. `D'
  9. `N'
  10. `th'
  11. `thh', `d' and `dh'
  12. `n'
  13. `m'
  14. `y' and `zh'
  15. `r' and `rr'
  16. `l'
  17. `v'
  18. `L'

We know that the set of literals depends on the code for information exchange. The concept of classes of literals gives flexibility for number of literals to defer between schemes. This makes it easy for generalised implementation of the rules since the classes alone are fixed.

6.4 Parser Directives

Following symbols and literals are required in the model being considered for transliteration. These additional entries can also be viewed as parser directives.

Zero Width Symbol (ZWS)

Semantically ZWS acts as a symbol having zero width. This symbol is useful for writing many Arabic words where independent glyph of a vocal literal comes in the middle of a word. (eg:`va_aL~'. If it were written as `vaaL~' then it would mean "sword"). Similarly another context is where we want to avoid the usage of the conjunct glyphs or sign glyphs when by default they are the reverse transliteration output. Few conflicting example pairs, if this symbol is avoided are: [ Keralapaniniyam]

Zero Width Literal (ZWL)

This symbol is used to get the vowel signs alone. The semantics of ZWL is same as that of ZWS except that ZWL has the properties of a literal where as ZWS has the properties of a symbol.

6.5 Existing Implementation

The implementation of the algorithm by the author is available from the Varamozhi site. It is distributed under [ GNU Public License]. Input for the generator consist of the scheme describing the character sequence adopted for each literal, the class to which the literal belongs and different glyphs of that literal. Output is the C code for generating the glyph sequence from the character stream representing Malayalam text according to the rules of reverse transliteration described in the previous section. The rules are implemented for each class of literals.

Salient features of the implementation are:

Few third party editors like [ Madhuri, Font Converter] have been successfully attempted on the parser generated from Varamozhi.

6.6 Conclusion

Author believes, theory of reverse and forward transliterator generators can be developed for all languages having phonetic scripts. More over, everything could be put under single generic framework. The work being presented through this paper is a very specific instance of this, much larger framework and a starting point for further research in this area.

6.7 References

  1. A.R. Rajarajavarma, Keralapaniniyam, Second Ed.
  2. Suranad Kunjan Pillai, Malayalam Lexicon, 1965
  3. Gundert, Malayalam-English-Malayalam Dictionary
  4. Antony P. Stone, ISO's draft CD15919, http://ourworld.compuserve.com/homepages/stone_catend/trdcd1a.htm
  5. ISCII (IS 13194:1991) Standard, http://www.cdac.org.in/html/gist/articles.htm
  6. GNU, http://www.gnu.org/
  7. A.C.K, http://www.kerala.org/culture/music/mal/scripts/processQuery.cgi?song_name=alphabet
  8. Frans Velthuis, Velthuis, http://www.rug.nl/~velthuis/velthuis.html
  9. Soji Joseph, Madhuri and Achayan, http://members.tripod.com/~k_achayan
  10. Rajeev K. R., Font Converter, http://members.tripod.com/~rajk/mal


Next Previous Contents