18 September, 2012

A MySQL Unicode collation for Maltese

If you’ve tried to store and retrieve Maltese text in a MySQL database before, you may have noticed that there is no way to sort it correctly according to the Maltese alphabet.

The  utf8_unicode_ci collation treats g and ġ etc. as interchangeable but that is of course not right. You can try utf8_bin, but since this sorts according to Unicode codepoints, ċ, ġ, ħ, ż get sorted after the letter z — which is even worse (although it does at least mean you can search for ħ without getting h back too).

What you really need is a custom collation for the Maltese alphabet. There isn’t one built-in, but luckily MySQL makes adding custom collations relatively painless. So I went ahead and implemented such a collation for Maltese, and called it utf8_maltese_ci. You can find the code, along with detailed installation and usage instructions at the GitHub repository for the Maltese MySQL collation.

24 August, 2012

Vowel length and negation

Continuing the previous post about vowel lengths, here are some remarks about the handling of the long vowel ie under negation (which is after all the suffixation of the letter x).

Consider the verbs waqaf, kiel, and ħa. Note that the latter two are irregular, however I think they are still valid for the point I want to make. Their imperfect forms all consist of a stem which begins with the long vowel ie: nieqaf, tiekol, jieħu. Does this vowel get shortened under negation? Let’s see what the Maltese corpus has to say about this:


-ieqaf -ieqafx -iqafx -ieqfu -ieqfux -iqfux
n- 1070 21 26 850 24 12
t- 2124 116 53 23 3 2
j- 2828 90 102 1390 51 58
Totals 6022 227 281 2263 78 72


-iekol -iekolx -ikolx -ieklu -ieklux -iklux
n- 292 3 8 752 7 8
t- 935 18 18 75 1 7
j- 1339 16 21 1747 30 18
Totals 2566 37 47 2574 38 33


-ieħu -ieħux -iħux -ieħdu -ieħdux -iħdux
n- 7191 23 58 11215 24 38
t- 17643 101 163 631 6 5
j- 33070 155 204 22682 113 103
Totals 57904 279 425 34528 143 146

These are the totals of the negative forms, as percentages of the total occurrences of the corresponding positive form:

Verb Singular IE Singular I Plural IE Plural I
waqaf 3.76% 4.66% 3.45% 3.18%
kiel 1.44% 1.83% 1.48% 1.28%
ħa 0.48% 0.73% 0.41% 0.42%

So what do all these numbers mean?
When considering the singular negative, the version without the long ie vowel is more common in all cases. As an example, ma nikolx is more common than ma niekolx, which would indicate that the former is really the correct form.

In the plural though, it’s almost the complete opposite. To continue our example, this means that ma nieklux is slightly more frequent that ma niklux. However the difference in frequency is less pronounced: 7% in plural compared to 12% in singular for the given example.

So here we have another indication of the correct spelling, but not exactly hard evidence. The more I try to rely on the corpus for these things, the more apparent it becomes that it is not really a good settler of questions of minor orthographic differences.

21 August, 2012

Vowel length and pronominal suffixes in Maltese

Vowel length in Maltese seems to be one of those tricky things. The combination of pronominal suffixes with verbs ending in ‘a’ is a good example.

Direct Object suffixes

Think of the single verb form for “we saw you”: rajniek. Or should that be rajnik? Based on how it sounds as a native speaker, the latter shorter-vowel version seems more likely.

The Maltese corpus is not much help in deciding this. Just look at these frequency counts for tokens ending in jniek and jnik:

Rank Token Count
1 tajniek 4
2 rrispondejniek 3
3 smajniek 3
4 qtajniek 2
5 rajniek 2
6 drajniek 1
7 ħabbejniek 1
8 obdejniek 1
Rank Token Count
1 tajnik 6
2 għabbejnik 2
3 mejnik 2
4 rajnik 2
5 staqsejnik 2
6 avviċinajnik 1
7 għaddejnik 1
8 kkritikajnik 1
9 ħallejnik 1

In total, jniek occurs 17 times and jnik occurs 18 times. Note also the even split of the words which appear in both lists: tajniek (4) vs. tajnik (3), and rajniek (2) vs. rajnik (2).

But it turns out there is an explicit rule for this. According to “Grammatika Maltija” pg 166, whenever a verb ending in ‘a’ is going to have a pronominal suffix attached to it, the joining vowel becomes an ‘ie’. So tajna + ek = tajniek, even though when you say it it sounds a lot more like tajnik. The results from the corpus seem to confirm that I’m not the only one confused by this, although admittedly the numbers are probably too low to be statistically significant. While counter-intuitive, this rule seems pretty established, so we just accept it.

Indirect Object suffixes

What about indirect pronominal suffixes? Think of “we sang for your”, kantajnielek. Or is that kantajnilek? Again, the latter sounds like a more accurately transcription of the spoken form. The corpus reports 11 occurances of tokens ending in jnielek, and 10 for jnilek. Another even non-statistically-significant split. “Maltese” by Borg and Azzopardi-Alexander claims the former is correct, with an ‘ie’.

Direct and Indirect Object suffixes

And what happens when you have both a direct and indirect pronominal suffixes? The information is much more polarised. Using the rule above, as in “Maltese”, the ‘ie’ remains. So you have the forms kantajniehulek and ftaħniehulek.

But the corpus contains exactly zero tokens which end with iehulek, and a whopping 92 which finish with ihulek. In this case the two sources directly contradict each other. Some personal communication on the Kelmet il-Malti Facebook group confirms that the above rule no longer applies, and the more natural principle of vowel length comes into play again. So kantajnihulek and ftaħnihulek are the correct forms, and the book is wrong.

15 August, 2012

Liquid-medial strong verbs beginning with għ

Liquid-medial verbs are a subclass of the strong Maltese semitic verbs, which have a liquid consonant (għ, l, m, n, r) as their second radical. Their paradigm is slightly different in that they sometimes require an extra vowel in conjugation. Whether this vowel is morphological or euphonic, I don’t know. Not all sources identify them as a subclass, and simply claim the vowel is inserted euphonically as needed. However when the first radical is GĦ, this extra vowel is dropped again:

Class Root Mamma (Perf P3 Sg Masc) Imperfect P1 Sg Imperfect P1 Pl Template (prev column)
Strong Regular K-T-B kiteb nikteb niktbu nvCCCv
Strong Liquid-Medial S-R-Q seraq nisraq nisirqu nvCvCCv
Strong Liquid-Medial GĦ-M-L għamel nagħmel nagħmlu nvCCCv

This also creeps up when adding some indirect object suffixes (P3 Sg Fem, and all Pl) in imperative/imperfect:

Class Root Imperfect P2 Sg Imperfect P1 Sg + I.O. P1 Sg Imperfect P1 Sg + I.O. P1 Pl Template (prev column)
Strong Regular K-T-B tikteb tiktibli tiktbilna tvCCCilna
Strong Liquid-Medial S-R-Q tisraq tisraqli tisraqilna tvCCvCilna
Strong Liquid-Medial GĦ-M-L tagħmel tagħmilli tagħmlilna tvCCCilna

8 August, 2012

Vowel-change patterns in the Maltese “strong” verb (sħiħ)

As with the the post on vowel change patterns in hollow verbs, here is another study of vowel changes in the strong verb. Specifically, I am looking at verbs with the vowel pattern “a-a” in the mamma, i.e. rabat, talablagħab etc. Again, there are many more strong “a-a” verbs than the ones listed here, but I chose the ones which to me are most common.

English Mamma (Perf P3 Sg Masc) Root Imperative P2 Sg Vowel Pattern in Imperative
to scratch barax B-R-X obrox O-O
to forecast basar B-S-R obsor O-O
to deny ċaħad Ċ-Ħ-D iċħad I-A
to enter daħal D-Ħ-L idħol I-O
to gather ġabar Ġ-B-R iġbor I-O
to hit ħabat Ħ-B-T aħbat A-A
to escape ħarab Ħ-R-B aħrab A-A
to grab ħataf Ħ-T-F aħtaf A-A
to reach laħaq L-Ħ-Q ilħaq I-A
to play lagħab L-GĦ-B ilgħab I-A
to hit laqat L-Q-T olqot O-O
to ensnare nasab N-S-B onsob O-O
to catch qabad Q-B-D aqbad A-A
to cross qasam Q-S-M aqsam A-A
to tie rabat R-B-T orbot O-O
to sleep raqad R-Q-D orqod O-O
to warm saħan S-Ħ-N isħon I-O
to mill taħan T-Ħ-N itħan I-A
to pray talab T-L-B itlob I-O
to prune żabar Ż-B-R iżbor I-O

The following vowel-change patterns emerge:

Vowel pattern in Imperative Applicable verbs
A-A ħabat, ħarab, ħataf, qabad, qasam
I-A ċaħad, laħaq, lagħab, taħan
I-O daħal, ġabar, saħan
O-O barax, basar, laqat, nasab, rabat, raqad, talab, żabar

But unfortunately, I cannot intuitively find any syntactic pattern in the examples above; you just need to know.

Hundreds of forms, but nowhere to check them

A previous post showed just how many inflectional forms there are for a single verb in Maltese. But while writing the algorithms for producing such tables, I repeatedly find that for many of these forms, there is no real way of checking them for correctness, because no such other resource exists.

There is the Korpus Malti, but despite containing nearly 100 million tokens, there are numerous grammatically-correct verb forms which do not occur anywhere in the corpus. No traditional dictionary would contain every possible inflected form for each verb, for reasons of size, so in many cases I must simply resort to “best guesses” and intuition. There are so-called verb models which are used in Maltese verb conjugations, e.g. the verb lagħab (he played) should be conjugated as seraq (he stole), but that only covers radical-placement, not vowel changes. For example, which is correct: naqtgħak or naqtgħek? The former does not appear at all in the corpus, and the latter appears just once, from a public blog entry. Not exactly hard evidence, is it?

26 July, 2012

Full inflection table of a Maltese verb

The inflection table of the Maltese verb is formidable. Apart from tense/aspect and person, number & gender, a Maltese verb also can also take suffixes for a direct object, for an indirect object, or for both a direct and indirect object. Add to this the “-x” suffix when the verb is negated, and you end up with no fewer than 952 unique forms for a single verb (the total number of combinations is 1152, but some combinations are non-existent).

Here’s the full table for the verb fetaħ (he opened).


19 July, 2012

Strongly-integrated loan verbs and weak-final quadriconsonantal roots

Splitting quadriliteral verbs into strong and weak is not universal in the literature. At least Borg and Azzopardi-Alexander make no mention of this, however their treatment of quad verbs feels a little lacking to me. But they do make the following distinctions:

  1. Repeated bi-radical base, e.g. GEMGEM (G-M-G-M)
  2. Repeated third radical (C3), e.g. GERBEB (G-R-B-B)
  3. Repeated first radical (C1) after the second (C2), e.g. ŻERŻAQ (Ż-R-Ż-Q)
  4. Addition of a fourth radical to a triradical base, e.g. ĦARBAT (Ħ-R-B-T)

They make no reference to weak radicals in quad verbs. They then go on to discuss “strongly-integrated loan verbs”, i.e. verbs of Romance or even possibly English origin which have taken on completely regular Semitic-style morphology. The examples given are KANTA, VINĊA, and SERVA, which correspond to the 3 different verb endings in Italian (cantare, vincere, and servire respectively).

Spagnol agrees with this, but goes farther and actually classifies these verbs as quadriliteral verbs with the weak consonant J as the fourth radical. Here’s a table of some of the most common ones, including ones for which I could find no Romance origin word.

English Romance origin Għerq (Root) Mamma (Perf P3 Sg Masc) Imperative P2 Sg Perfect P1 Sg Perfect P3 Sg Fem
to sing cantare K-N-T-J kanta kanta kantajt kantat
to serve servire S-R-V-J serva servi servejt serviet
to win vincere V-N-Ċ-J vinċa vinċi vinċejt vinċiet
to ask - S-Q-S-J saqsa saqsi saqsejt saqsiet
to draw - P-N-Ġ-J pinġa pinġi pinġejt pinġiet
to enjoy godere G-W-D-J gawda gawdi gawdejt gawdiet
to talk parlare P-R-L-J parla parla parlajt parlat
to complete - L-S-T-J lesta lesti lestejt lestiet
to vary variare V-R-J-J varja varja varjajt varjat

Looking at the vowel patterns, we end up with a very neat division:

Romance ending Mamma (Perf P3 Sg Masc) Imperative P2 Sg Perfect P1 Sg Perfect P3 Sg Fem
-are a a a a
-ire/-ere/- a i e ie

In other words, the vowel patterns are always the same, except for when the verb derives from a Romance -are verb.

16 July, 2012

Vowel-change patterns in the Maltese “hollow” verb (moħfi)

The behaviour of consonant radicals in Maltese morphology is always predictable, but vowel changes are a lot less so. Consider this list of Maltese “hollow verbs”: that is, where the middle root is the weak consonant w or j (of course there are many more hollow verbs than the ones listed here, but I chose the ones which to me are most “common”).

English Mamma (Perf P3 Sg Masc) Għerq (Root) Perfect P1 Sg Imperative P2 Sg
to urinate biel B-W-L bilt bul
to kiss bies B-W-S bist bus
to take long dam D-W-M domt dum
to turn dar D-W-R dort dur
to taste daq D-W-Q doqt duq
to melt dab D-W-B dobt dub
to heal fieq F-J-Q fiqt fiq
to overflow far F-W-R fort fur
to bring ġab Ġ-J-B ġibt ġib
to sew ħiet Ħ-J-T ħitt ħit
to die miet M-W-T mitt mut
to wake up qam Q-W-M qomt qum
to want ried R-J-A ridt rid
to find sab S-J-B sibt sib
to become ready sar S-J-R sirt sir
to fast sam S-W-M somt sum
to drive saq S-W-Q soqt suq
to fly tar T-J-R tirt tir
to increase żied Ż-J-D żidt żid

The following vowel-change patterns emerge:

Long vowel in base form Middle radical Vowel in Perfect Vowel in Imperative/Imperfect Applicable verbs
a W o u dab, dam, dar, daq, far, qam, sam, saq
a J i i ġab, sab, sib, tar
ie J i i fieq, ħiet, ried, żied
ie J i u biel, bies
ie W i u miet

Conclusions from this minor study:

  1. The long vowel in the base form (mamma) does not necessarily determine the middle radical.
  2. Even the base form combined with the root is not enough to determine the vowel changes in the the perfect and imperative forms, as in the cases of fieq and biel. In such cases the imperative must be specified explicitly.

13 July, 2012

More detail than required

I consider it a design principle, in the Maltese resource grammar for GF, to choose linguistic correctness over representational efficiency. I think at many points I will be confronted with the possibility to combine certain linguistic subdivisions together, or completely leave out bits of linguistic information, as they would be non-required by the GF RGL API or simply making things more complicated internally. But if I want this project to be a valuable contribution to the body of computational resources for Maltese (ad not just to GF), then I need to include more than just “what is necessary”. This also goes well with the vision of being to extract linguistic information out of the resource grammar and use it elsewhere.

