Package pology :: Package lang :: Package sr :: Module wconv

Module wconv

Conversions between scripts and dialects in Serbian.

Serbian standard literary language can be written in two dialects, Ekavian and Ijekavian, and two scripts, Cyrillic and Latin. Dialects and scripts can be freely combined, resulting in four official writing standards: Ekavian Cyrillic, Ekavian Latin, Ijekavian Cyrillic, and Ijekavian Latin. Some automatic and semi-automatic conversions between them are possible.

Script Transliteration

For plain text containing only Serbian words (including well adapted loans), it is trivial to transliterate from Cyrillic to Latin script. It is only necessary to take care when converting Cyrillic Љ, Њ, Џ into Latin digraphs Lj, Nj, Dž, because sometimes they should be full upper-case (e.g. Љубљана→Ljubljana, ЉУБЉАНА→LJUBLJANA). But this is easily algorithmically resolvable, by checking if the previous or the next letter are upper-case too.

To transliterate from Latin to Cyrillic is somewhat harder, because in rare cases digraphs nj, lj, dž may not be single, but standalone letters; i.e. they do not map Cyrillic to љ, њ, џ, but to лј, нј, дж (dablju→даблју, konjunkcija→конјункција, nadživeti→надживети). The only way to handle this is by having a dictionary of special cases.

Furthermore, in today's practice texts are rarely clean as assumed above. They are frequently riddled with foreign Latin phrases (such as proper names) quasiphrases (such as electronic addresses), and constructive elements (such as markup tags). On the other hand, foreign Cyrillic phrases are quite infrequent (may be found e.g. in texts on linguistic topics). This means that in practice transliteration from Cyrillic to Latin remains straightforward, but from Latin to Cyrillic decidedly not so.

Script Hybridization

Sometimes the result of direct transliteration from Cyrillic to Latin is against the established Latin practice in a certain field, even if valid according to official orthography. Then it becomes necessary to specially handle some parts of the text (e.g. transliterations or lack thereof of foreign proper names).

Alternatives directives are a way to compose "hybrid" Cyrillic-Latin text, out of which both ordinary Cyrillic and non-directly transliterated Latin texts can be automatically derived. For example, this hybrid text:

   Различите ~@/линукс/Linux/ дистрибуције...

can be automatically resolved into:

   Различите линукс дистрибуције...
   Različite Linux distribucije...

String ~@ is the head of alternatives directive. It is followed by a single character, which is then used to delimit Cyrillic and Latin parts, in that order, out of surrounding text. (For all details on format of alternatives directives, see resolve_alternatives()). Transliteration from Cyrillic to Latin is performed only on text outside of alternatives directives.

Dialect Hybridization

Both Ekavian and Ijekavian dialect may be represented within single text. Such hybrid text is basically Ijekavian, but jat-reflexes are marked by inserting one of the jat-reflex ticks ›, ‹, ▹, ◃:

   Д‹ио б‹иљежака о В›јештичјој р›ијеци.

Clean Ijekavian text is then obtained by just removing jat-reflex ticks preceding valid jat-reflexes, and Ekavian by applying the jat-reflex map:

   Дио биљежака о Вјештичјој ријеци.
   Део бележака о Вештичјој реци.

The jat-reflex mapping rules are as follows, grouped by tick:

›ије→е, ›је→е
‹иј→еј, ‹иљ→ел, ‹ио→ео, ‹ље→ле, ‹ње→не
▹ије→и, ▹је→и
◃ијел→ео, ◃ијен→ењ, ◃ит→ет, ◃ил→ел, ◃јел→ео, ◃тн→тњ, ◃шње→сне

For very rare special cases, it is possible to directly provide different forms for Ekavian and Ijekavian, in that order, by using alternatives directive:

   Гд›је с' ~#/то/ба/ пошо̑?

Compared to alternatives directives for scripts, the only difference is that here the directive head is ~#. Alternatives directives for script and dialect can thus be mixed without conflicts, in single text and even interwoven (when interweaving, different delimiters must be used).

Author: Chusslove Illich (Часлав Илић) <caslav.ilic@gmx.net>

License: GPLv3

Functions

ctol(text)
Transliterate text from Cyrillic to proper Latin [type F1A hook].

cltoa(text)
Transliterate text from Cyrillic or Latin to stripped ASCII [type F1A hook].

ectol(text)
Transliterate text from English in Cyrillic by keyboard layout to proper English [type F1A hook].

hctoc(text)
Resolve hybrid Cyrillic text with script alternatives into plain Cyrillic text [type F1A hook].

hctol(text)
Resolve hybrid Cyrillic text with script alternatives into plain Latin text [type F1A hook].

(string, string)

hctocl(htext)
Resolve hybrid Cyrillic-Latin text into clean Cyrillic and clean Latin.

string

cltoh(textc, textl, delims=u'/|¦', full=False)
Construct hybrid Cyrillic text out of clean Cyrillic and Latin texts.

hitoe(text)
Resolve hybrid Ijekavian text with jat-reflex ticks and dialect alternatives into plain Ekavian text [type F1A hook].

hitoeq(text)
Like hitoe, but does not output warnings on problems [type F1A hook].

hitoi(text)
Resolve hybrid Ijekavian text with jat-reflex ticks and dialect alternatives into plain Ijekavian text [type F1A hook].

hitoiq(text)
Like hitoi, but does not output warnings on problems [type F1A hook].

validate_dhyb(text)
Check whether dialect-hybrid text is valid [type V1A hook].

(string, string)

hitoei(htext)
Resolve hybrid Ijekavian-Ekavain text into clean Ekavian and Ijekavian.

string

tohi(text1, text2, ekord=None, delims=u'/|¦', parthyb=False)
Construct hybrid Ijekavian text out of Ekavian and Ijekavian texts.

hictoec(text)
Resolve hybrid Ijekavian-Ekavian Cyrillic-Latin text into clean Ekavian Cyrillic text [type F1A hook].

hictoecq(text)
Like hictoec, but does not output warnings on problems [type F1A hook].

hictoel(text)
Resolve hybrid Ijekavian-Ekavian Cyrillic-Latin text into clean Ekavian Latin text [type F1A hook].

hictoic(text)
Resolve hybrid Ijekavian-Ekavian Cyrillic-Latin text into clean Ijekavian Cyrillic text [type F1A hook].

hictoicq(text)
Like hictoic, but does not output warnings on problems [type F1A hook].

hictoil(text)
Resolve hybrid Ijekavian-Ekavian Cyrillic-Latin text into clean Ijekavian Latin text [type F1A hook].

(string, string, string, string)

hictoall(htext)
Resolve hybrid Ijekavian-Ekavian Cyrillic-Latin text into all four clean variants.

Variables
	__package__ = `'pology.lang.sr'`

Function Details

hctocl(htext)

Resolve hybrid Cyrillic-Latin text into clean Cyrillic and clean Latin.

Parameters:

htext (string) - hybrid text

Returns: (string, string)

Cyrillic and Latin texts

cltoh(textc, textl, delims=`u'/|¦'`, full=False)

Construct hybrid Cyrillic text out of clean Cyrillic and Latin texts.

Hybridization is performed by inserting alternatives directives for parts which cannot be resolved by direct transliteration. If full is set to True, complete texts are unconditionally wrapped into single alternatives directive.

Parameters:

textc (string) - Cyrillic text
textl (string) - Latin text
delims (string) - possible delimiter characters
full (bool) - whether to wraf full texts as single alternatives directive

Returns: string

hybrid Cyrillic text

hitoei(htext)

Resolve hybrid Ijekavian-Ekavain text into clean Ekavian and Ijekavian.

Parameters:

htext (string) - hybrid text

Returns: (string, string)

Ekavian and Ijekavian text

tohi(text1, text2, ekord=None, delims=`u'/|¦'`, parthyb=False)

Construct hybrid Ijekavian text out of Ekavian and Ijekavian texts.

Hybridization is performed by merging Ekavian and Ijekavian forms into Ijekavian forms with inserted jat-reflex ticks. Input texts can be both in Cyrillic and Latin, and piecewise so. Texts also do not have to be clean Ekavian and Ijekavian, as hybridization is performed only at difference segments. Order of text arguments is not important as long as all difference segments can be merged (i.e. the function is comutative in that case).

If a difference segment cannot be merged by jat-reflex ticks, then the resolution depends on ekord parameter. If it is None, then the segment of text2 is taken into result. If it is 1 or 2, then the segments of text1 and text2 are combined in a dialect alternatives directive (~#/.../.../); the number determines which segment is put first in the directive (i.e. considered Ekavian), that of text1 or of text2. Any other value of ekord leads to undefined behavior.

It is possible that input texts are already partially hybridized, and only some parts of them need to be additionally hybridized. Setting parthyb to True will tell the function to detect and skip already hybridized segments, and hybridize only the rest.

Parameters:

text1 (string) - first text
text2 (string) - second text
ekord (None, 1, 2) - enumerates the text to be considered Ekavian when adding alternatives directives
delims (string) - possible delimiter characters for alternatives directives
parthyb (bool) - whether input texts are already partially hybridized

Returns: string

hybrid Ijekavian text

hictoall(htext)

Resolve hybrid Ijekavian-Ekavian Cyrillic-Latin text into all four clean variants.

Parameters:

htext (string) - hybrid text

Returns: (string, string, string, string)

Ekavian Cyrillic, Ekavian Latin, Ijekavian Cyrillic, and Ijekavian Latin text

Module wconv

Script Transliteration

Script Hybridization

Dialect Hybridization

hctocl(htext)

cltoh(textc, textl, delims=u'/|¦', full=False)

hitoei(htext)

tohi(text1, text2, ekord=None, delims=u'/|¦', parthyb=False)

hictoall(htext)

cltoh(textc, textl, delims=`u'/|¦'`, full=False)

tohi(text1, text2, ekord=None, delims=`u'/|¦'`, parthyb=False)