Module wconv
Conversions between scripts and dialects in Serbian.
Serbian standard literary language can be written in two dialects,
Ekavian and Ijekavian, and two scripts, Cyrillic and Latin. Dialects and
scripts can be freely combined, resulting in four official writing
standards: Ekavian Cyrillic, Ekavian Latin, Ijekavian Cyrillic, and
Ijekavian Latin. Some automatic and semi-automatic conversions between
them are possible.
Script Transliteration
For plain text containing only Serbian words (including well adapted
loans), it is trivial to transliterate from Cyrillic to Latin script.
It is only necessary to take care when converting Cyrillic Љ, Њ, Џ into
Latin digraphs Lj, Nj, Dž, because sometimes they should be full
upper-case (e.g. Љубљана→Ljubljana, ЉУБЉАНА→LJUBLJANA). But this is
easily algorithmically resolvable, by checking if the previous or the
next letter are upper-case too.
To transliterate from Latin to Cyrillic is somewhat harder, because
in rare cases digraphs nj, lj, dž may not be single, but standalone
letters; i.e. they do not map Cyrillic to љ, њ, џ, but to лј, нј, дж
(dablju→даблју, konjunkcija→конјункција, nadživeti→надживети). The only
way to handle this is by having a dictionary of special cases.
Furthermore, in today's practice texts are rarely clean as assumed
above. They are frequently riddled with foreign Latin phrases (such as
proper names) quasiphrases (such as electronic addresses), and
constructive elements (such as markup tags). On the other hand, foreign
Cyrillic phrases are quite infrequent (may be found e.g. in texts on
linguistic topics). This means that in practice transliteration from
Cyrillic to Latin remains straightforward, but from Latin to Cyrillic
decidedly not so.
Script Hybridization
Sometimes the result of direct transliteration from Cyrillic to
Latin is against the established Latin practice in a certain field,
even if valid according to official orthography. Then it becomes
necessary to specially handle some parts of the text (e.g.
transliterations or lack thereof of foreign proper names).
Alternatives directives are a way to compose "hybrid"
Cyrillic-Latin text, out of which both ordinary Cyrillic and
non-directly transliterated Latin texts can be automatically derived.
For example, this hybrid text:
Различите ~@/линукс/Linux/ дистрибуције...
can be automatically resolved into:
Различите линукс дистрибуције...
Različite Linux distribucije...
String ~@
is the head of alternatives directive. It is
followed by a single character, which is then used to delimit Cyrillic
and Latin parts, in that order, out of surrounding text. (For all
details on format of alternatives directives, see resolve_alternatives()). Transliteration from Cyrillic
to Latin is performed only on text outside of alternatives
directives.
Dialect Hybridization
Both Ekavian and Ijekavian dialect may be represented within single
text. Such hybrid text is basically Ijekavian, but jat-reflexes are
marked by inserting one of the jat-reflex ticks ›
,
‹
, ▹
, ◃
:
Д‹ио б‹иљежака о В›јештичјој р›ијеци.
Clean Ijekavian text is then obtained by just removing jat-reflex
ticks preceding valid jat-reflexes, and Ekavian by applying the
jat-reflex map:
Дио биљежака о Вјештичјој ријеци.
Део бележака о Вештичјој реци.
The jat-reflex mapping rules are as follows, grouped by tick:
-
›ије→е, ›је→е
-
‹иј→еј, ‹иљ→ел, ‹ио→ео, ‹ље→ле, ‹ње→не
-
▹ије→и, ▹је→и
-
◃ијел→ео, ◃ијен→ењ, ◃ит→ет, ◃ил→ел, ◃јел→ео, ◃тн→тњ, ◃шње→сне
For very rare special cases, it is possible to directly provide
different forms for Ekavian and Ijekavian, in that order, by using
alternatives directive:
Гд›је с' ~#/то/ба/ пошо̑?
Compared to alternatives directives for scripts, the only difference
is that here the directive head is ~#
. Alternatives
directives for script and dialect can thus be mixed without conflicts,
in single text and even interwoven (when interweaving, different
delimiters must be used).
Author:
Chusslove Illich (Часлав Илић) <caslav.ilic@gmx.net>
License:
GPLv3
|
ctol(text)
Transliterate text from Cyrillic to proper Latin [type F1A hook]. |
|
|
|
cltoa(text)
Transliterate text from Cyrillic or Latin to stripped ASCII [type F1A
hook]. |
|
|
|
ectol(text)
Transliterate text from English in Cyrillic by keyboard layout to
proper English [type F1A hook]. |
|
|
|
hctoc(text)
Resolve hybrid Cyrillic text with script alternatives into plain
Cyrillic text [type F1A hook]. |
|
|
|
hctol(text)
Resolve hybrid Cyrillic text with script alternatives into plain
Latin text [type F1A hook]. |
|
|
(string, string)
|
hctocl(htext)
Resolve hybrid Cyrillic-Latin text into clean Cyrillic and clean
Latin. |
|
|
string
|
cltoh(textc,
textl,
delims=u' /|¦ ' ,
full=False)
Construct hybrid Cyrillic text out of clean Cyrillic and Latin texts. |
|
|
|
hitoe(text)
Resolve hybrid Ijekavian text with jat-reflex ticks and dialect
alternatives into plain Ekavian text [type F1A hook]. |
|
|
|
hitoeq(text)
Like hitoe, but does not output warnings on problems
[type F1A hook]. |
|
|
|
hitoi(text)
Resolve hybrid Ijekavian text with jat-reflex ticks and dialect
alternatives into plain Ijekavian text [type F1A hook]. |
|
|
|
hitoiq(text)
Like hitoi, but does not output warnings on problems
[type F1A hook]. |
|
|
|
validate_dhyb(text)
Check whether dialect-hybrid text is valid [type V1A hook]. |
|
|
(string, string)
|
hitoei(htext)
Resolve hybrid Ijekavian-Ekavain text into clean Ekavian and
Ijekavian. |
|
|
string
|
tohi(text1,
text2,
ekord=None,
delims=u' /|¦ ' ,
parthyb=False)
Construct hybrid Ijekavian text out of Ekavian and Ijekavian texts. |
|
|
|
hictoec(text)
Resolve hybrid Ijekavian-Ekavian Cyrillic-Latin text into clean
Ekavian Cyrillic text [type F1A hook]. |
|
|
|
hictoecq(text)
Like hictoec, but does not output warnings on problems
[type F1A hook]. |
|
|
|
hictoel(text)
Resolve hybrid Ijekavian-Ekavian Cyrillic-Latin text into clean
Ekavian Latin text [type F1A hook]. |
|
|
|
hictoic(text)
Resolve hybrid Ijekavian-Ekavian Cyrillic-Latin text into clean
Ijekavian Cyrillic text [type F1A hook]. |
|
|
|
hictoicq(text)
Like hictoic, but does not output warnings on problems
[type F1A hook]. |
|
|
|
hictoil(text)
Resolve hybrid Ijekavian-Ekavian Cyrillic-Latin text into clean
Ijekavian Latin text [type F1A hook]. |
|
|
(string, string, string, string)
|
hictoall(htext)
Resolve hybrid Ijekavian-Ekavian Cyrillic-Latin text into all four
clean variants. |
|
|
|
__package__ = ' pology.lang.sr '
|
Resolve hybrid Cyrillic-Latin text into clean Cyrillic and clean
Latin.
- Parameters:
htext (string) - hybrid text
- Returns: (string, string)
- Cyrillic and Latin texts
|
cltoh(textc,
textl,
delims=u' /|¦ ' ,
full=False)
|
|
Construct hybrid Cyrillic text out of clean Cyrillic and Latin
texts.
Hybridization is performed by inserting alternatives directives for
parts which cannot be resolved by direct transliteration. If
full is set to True , complete texts are
unconditionally wrapped into single alternatives directive.
- Parameters:
textc (string) - Cyrillic text
textl (string) - Latin text
delims (string) - possible delimiter characters
full (bool) - whether to wraf full texts as single alternatives directive
- Returns: string
- hybrid Cyrillic text
|
Resolve hybrid Ijekavian-Ekavain text into clean Ekavian and
Ijekavian.
- Parameters:
htext (string) - hybrid text
- Returns: (string, string)
- Ekavian and Ijekavian text
|
tohi(text1,
text2,
ekord=None,
delims=u' /|¦ ' ,
parthyb=False)
|
|
Construct hybrid Ijekavian text out of Ekavian and Ijekavian
texts.
Hybridization is performed by merging Ekavian and Ijekavian forms into
Ijekavian forms with inserted jat-reflex ticks. Input texts can be both
in Cyrillic and Latin, and piecewise so. Texts also do not have to be
clean Ekavian and Ijekavian, as hybridization is performed only at
difference segments. Order of text arguments is not important as long as
all difference segments can be merged (i.e. the function is comutative in
that case).
If a difference segment cannot be merged by jat-reflex ticks, then the
resolution depends on ekord parameter. If it is
None , then the segment of text2 is taken into
result. If it is 1 or 2 , then the segments of
text1 and text2 are combined in a dialect
alternatives directive (~#/.../.../ ); the number determines
which segment is put first in the directive (i.e. considered Ekavian),
that of text1 or of text2 . Any other value of
ekord leads to undefined behavior.
It is possible that input texts are already partially hybridized, and
only some parts of them need to be additionally hybridized. Setting
parthyb to True will tell the function to
detect and skip already hybridized segments, and hybridize only the
rest.
- Parameters:
text1 (string) - first text
text2 (string) - second text
ekord (None, 1, 2) - enumerates the text to be considered Ekavian when adding
alternatives directives
delims (string) - possible delimiter characters for alternatives directives
parthyb (bool) - whether input texts are already partially hybridized
- Returns: string
- hybrid Ijekavian text
|
Resolve hybrid Ijekavian-Ekavian Cyrillic-Latin text into all four
clean variants.
- Parameters:
htext (string) - hybrid text
- Returns: (string, string, string, string)
- Ekavian Cyrillic, Ekavian Latin, Ijekavian Cyrillic, and
Ijekavian Latin text
|