In the model the author uses, a sequence of one or more characters to represent each Malayalam letter. In the following sections, with the help of few definitions, we will see algorithms for generating Malayalam script from the character stream representing Malayalam text. Its implementation is mentioned in the last section.
In this section, Malayalam text will appear within single quotes(') written using Mozhi scheme described in chapter Mozhi 7 bit transliteration scheme. Plus(+) is used as the string concatenation operator and forward slash (/) indicates either the string before or the string after the forward slash or both are under consideration.
We assume following deviations from the traditional Malayalam scripting conventions. This is for the ease of generalisation of transliteration properties.
Letters are the basic phonetic building blocks of a language. Along with the sound it has one or more graphical forms also.
In contrast to a letter, a symbol does not have its own phonetic identity (ex: coma, colon or visarga).
Literal is a sequence of letters which has a single fused graphical form when written either independently or combined with any other letters (ex: `n', `nn' and `yu'). Usually it includes all letters and all conjunct forms consisting of two or more letters. Literals representing the letters are called base literals and those representing the conjunct forms are called derived literals. Size of a literal is the number of characters being used to represent that literal in the given transliteration scheme. |x| operator denote the size of the literal x with respect to that scheme. A literal can also be classified as vocal or consonantal depending on whether it can be pronounced independently or not.
Concept of literals is closely linked to fonts or some code for information interchange in general. Whether a character sequence has a single fused graphical form or not, can be decided only by analysing that code.
graphical form of a literal or symbol.
It refers to the sequence of characters representing transliterated Malayalam. It can also be interpreted as a sequence of literals and symbols represented by that character stream.
sequence of glyphs denoting a text.
the representation of the glyphs of a source script by the glyphs of a target script. In our description, source script is Malayalam and target script is English.
the process whereby the glyphs of a target script are transliterated into those of the source script.
a sequence of literals such that all the literals other than the last are consonantal literals. Last literal can be either vocal or consonantal. (eg: In the conjunct `sva', `s' is the first literal, `v' is the second and `a' is the third. More over, `svar' is not a conjunct because the vocal literal `a' does not come as the last literal.)
Context of a literal refers to its neighbourhood in a text. There are two different contexts for a literal. One is the lexical context and other is the phonetical context. By default context means phonetical context. A literal can have null neighbourhood also. That is, when no other sound is pronounced along with it.
A detailed list of related definitions can be found in [ Antony P. Stone's transliteration page].
A literal assumes different glyphs depending on its context. They are classified as following:
A vocal literal assumes independent glyph in null context. (eg: graphical form to represent a single `o' sound). Any consonant literal X assumes independent glyph in the conjunct X + `a'. (eg: graphical form of `ka')
Graphical form of the literal when it appears last in a two literal conjunct. (eg: graphical form of `o' in `ko'; form of `ya' in `kya' and form of `na' in `sna'). It can have two parts which come on left and right of the independent glyph of the first literal. They are called "left sign glyph" and "right sign glyph" respectively. (eg: in `ko', `o' has both sign glyphs. In `kya', `y' has only right sign glyph and in `kra', `r' has only left sign glyph). More over, all the vocal literals except `a' have sign glyphs.
The literal `v' has two right sign glyphs. The sign glyph which appears when first literal is not `y' or `zh' is called primary right sign glyph of `v'. The one which appears with `y' and `zh' is called secondary right sign glyph of `v'. For example, see the words `svapnam', `vaazhv~' and `meyvazhakkam'.
Only `N', `n', `m', `r', `l', `L' and `rr' have single fused glyph in null context. Those glyphs are called chillu glyphs. (In old orthography `k' and `y' also had chillu glyphs. Described in Gundert's Dictionary.)
The chillu glyphs of `r' and `rr' are same. There are two different chillu glyphs for `r/rr'. The graphical form `r' assumes when it is the last literal of a word, is called primary chillu glyph of `r/rr' (eg: graphical form of `r' in `avar'). The chillu glyph of `r' as a dot over the next consonant in the word is called secondary chillu glyph of `r/rr' (eg: graphical form of `r' in the word `charkka' written in old orthography). In new orthography, primary chillu glyph itself is used in place of secondary.
Reverse transliteration has two steps. First, parsing the character stream of transliterated Malayalam text into sequence of literals. Then generating the glyph sequence corresponding to sequence of literals by looking at the context of each literal.
In the model we use, sequence of one or more characters represent each Malayalam letter. Hence there arises the problem of splitting the character stream representing Malayalam text into corresponding literals. For example, if the stream is `thn', we can view it as `t' + `hn' or `th' + `n' or `t'+`h'+`n' or `thn'. We have to choose correct one from these different options. The generalised rule is as follows:
Let the character stream S be a1 + a2 + ... + an where ai (1 <= i <= n) is a character.We can consider the non-trivial case where S can be split in two ways as S1 = x1 + x2 + ... + xp and S2 = y1 + y2 +...+ yq such that x1 != y1 and xp != yq where xi(1<=i<=p) and yi(1<=i<=q) are character sequences representing single literals. Without loss of generality S1 will be chosen if either of the following conditions are true:
Next step is to generate the glyph sequence from the sequence of literals obtained. A rule based algorithm is described below.
Let X, X1 and X2 denote conjuncts; z and z1 denote literals. G(X) be a function mapping from the given conjunct X to its script. G(X) is defined below. There will be a number of clauses such that each one has a condition and an expression returning a value. In a call to G(X), only the clause which satisfies its gate condition will be selected and will return the value of the corresponding expression. If two clauses satisfy the condition, the first one will be selected lexically.
<(primary)chillu glyph of z>
+ G(X2)
<independent glyph of z>
<left sign glyph of z>
+ G(X1 + `a')
+ <(primary)right sign glyph of z>
From the above algorithm, we may see that many literals have exactly same joining property with other literals. So we classify literals into different classes such that any two literals in a class has same joining property with:
The classes are as follows:
We know that the set of literals depends on the code for information exchange. The concept of classes of literals gives flexibility for number of literals to defer between schemes. This makes it easy for generalised implementation of the rules since the classes alone are fixed.
Following symbols and literals are required in the model being considered for transliteration. These additional entries can also be viewed as parser directives.
Semantically ZWS acts as a symbol having zero width. This symbol is useful for writing many Arabic words where independent glyph of a vocal literal comes in the middle of a word. (eg:`va_aL~'. If it were written as `vaaL~' then it would mean "sword"). Similarly another context is where we want to avoid the usage of the conjunct glyphs or sign glyphs when by default they are the reverse transliteration output. Few conflicting example pairs, if this symbol is avoided are: [ Keralapaniniyam]
This symbol is used to get the vowel signs alone. The semantics of ZWL is same as that of ZWS except that ZWL has the properties of a literal where as ZWS has the properties of a symbol.
The implementation of the algorithm by the author is available from the Varamozhi site. It is distributed under [ GNU Public License]. Input for the generator consist of the scheme describing the character sequence adopted for each literal, the class to which the literal belongs and different glyphs of that literal. Output is the C code for generating the glyph sequence from the character stream representing Malayalam text according to the rules of reverse transliteration described in the previous section. The rules are implemented for each class of literals.
Salient features of the implementation are:
Few third party editors like [ Madhuri, Font Converter] have been successfully attempted on the parser generated from Varamozhi.
Author believes, theory of reverse and forward transliterator generators can be developed for all languages having phonetic scripts. More over, everything could be put under single generic framework. The work being presented through this paper is a very specific instance of this, much larger framework and a starting point for further research in this area.