Alphabet is a named set of symbols (characters). There are several alphabets in grammatical dictionary: basic latin alphabet for English language, cyrillic for Russian, extended latin alphabets for French and Spanish.
Some common symbols, for example the digits and punctuation signs, are declared in 'neutral' alphabet which is valid for all languages.
There are two main purpose of alphabet declaration.
First, the word entry declaration is controlled during dictionary compilation for incorrect characters. This control is able to detect typos concerning with similar characters in different languages. For example, 'C' and 'С' are different letters with codes U+0043 and U+0421. The typical keyboard layout for Russian locale has there letters placed on the same button making it possible to insert Russian letter in English word, or vice versa.
Second, each language declaration includes the set of alphabets which are allowed for words belonging to this language.
Letter declarations are stored using 32-bit integers for symbol codes. Thus the whole set of UCS-4 symbols is available for grammatical dictionary word entries and phrases.
You can explore the alphabets using the SQL version of grammatical dictionary. Data tables with prefix ABC_ are developed to store the letter entries, forms and their attributes. For example the following SQL query:
SELECT E.id, E.name AS "letter", E.code AS "unicode", COUNT(F.id) AS "number of variants" FROM abc_alphabet A, abc_entry E, abc_form F WHERE A.name='Russian' AND E.id_alphabet=A.id AND F.id_entry=E.id GROUP BY E.id, E.name, E.code
It shows the list of Russian alphabet entries with unicode values and number of variants. Pay attention to the letter Ё - it has 4 variants in contrast to 2 for other letters.
|id||letter||unicode||number of variants|
Russian alphabet contains 32 entries with 66 letters counting the minuscule and capital variants. Letters Ё and ё are considered as variants of Е.
The letters in Russian alphabet can be divided into four groups - consonants, vowels, signs and semivowel.
Consonants is a largest group containing 20 entries and 40 letters: Б В Г Д Ж З К Л М Н П Р С Т Ф Х Ц Ч Ш Щ
Vowels is a set of 10 letters in 5 pairs: Э-Е, А-Я, О-Ё, У-Ю, Ы-И
Soft and hard signs: Ь Ъ
Accute accent is used as stress indicator for disambiguation between different words or grammatical forms, for example по́ра-пора́. The letters with accent are not included in Russian alphabet. The accent must be eliminated before calling the grammatical dictonary API.
|© Козиев Илья 2019||