Alphabets in grammatical dictionary

Alphabet is a named set of symbols (characters). There are several alphabets in grammatical dictionary: basic latin alphabet for English language, cyrillic for Russian, extended latin alphabets for French and Spanish.

Some common symbols, for example the digits and punctuation signs, are declared in 'neutral' alphabet which is valid for all languages.

There are two main purpose of alphabet declaration.

First, the word entry declaration is controlled during dictionary compilation for incorrect characters. This control is able to detect typos concerning with similar characters in different languages. For example, 'C' and 'С' are different letters with codes U+0043 and U+0421. The typical keyboard layout for Russian locale has there letters placed on the same button making it possible to insert Russian letter in English word, or vice versa.

Second, each language declaration includes the set of alphabets which are allowed for words belonging to this language.

Alphabet storage

Letter declarations are stored using 32-bit integers for symbol codes. Thus the whole set of UCS-4 symbols is available for grammatical dictionary word entries and phrases.

You can explore the alphabets using the SQL version of grammatical dictionary. Data tables with prefix ABC_ are developed to store the letter entries, forms and their attributes. For example the following SQL query:

SELECT E.id, E.name AS "letter", E.code AS "unicode", COUNT(F.id) AS "number of variants"
 FROM  abc_alphabet A, abc_entry E, abc_form F
 WHERE A.name='Russian' AND E.id_alphabet=A.id AND F.id_entry=E.id
 GROUP BY E.id, E.name, E.code

It shows the list of Russian alphabet entries with unicode values and number of variants. Pay attention to the letter Ё - it has 4 variants in contrast to 2 for other letters.

idletterunicodenumber of variants
54А10402
55Б10412
56В10422
57Г10432
58Д10442
59Е10454
60Ж10462
61З10472
62И10482
63Й10492
64К10502
65Л10512
66М10522
67Н10532
68О10542
69П10552
70Р10562
71С10572
72Т10582
73У10592
74Ф10602
75Х10612
76Ц10622
77Ч10632
78Ш10642
79Щ10652
80Ъ10662
81Ы10672
82Ь10682
83Э10692
84Ю10702
85Я10712

Russian alphabet

Russian alphabet contains 32 entries with 66 letters counting the minuscule and capital variants. Letters Ё and ё are considered as variants of Е.

The letters in Russian alphabet can be divided into four groups - consonants, vowels, signs and semivowel.

Consonants is a largest group containing 20 entries and 40 letters: Б В Г Д Ж З К Л М Н П Р С Т Ф Х Ц Ч Ш Щ

Vowels is a set of 10 letters in 5 pairs: Э-Е, А-Я, О-Ё, У-Ю, Ы-И

Soft and hard signs: Ь Ъ

Semivowel: Й

Accute accent is used as stress indicator for disambiguation between different words or grammatical forms, for example по́ра-пора́. The letters with accent are not included in Russian alphabet. The accent must be eliminated before calling the grammatical dictonary API.

Additional information

Алфавиты

  © Козиев Илья 2019
changed 05-Feb-12