“the j stands for Jewry”
about me | research blog | wordpress plugins | jQuery plugins

15 February, 2013

Determinate and indeterminate noun plural forms

Maltese nouns have two potential plural forms, determinate and indeterminate. The distinction is exhibited in examples such as:

English Singulative Determinate plural Indeterminate plural
road triq triqat toroq
tooth sinna sinniet snien

However it seems that in reality there are very few nouns which actually have both forms. An analysis of the 184 nouns in the GF Resource Grammar Library mini-lexicon shows that:

  • 14 (~7%) have both forms, though I would argue that many of these sound kind of arcane, e.g. ġbiel (ġebliet), xgħur (xagħariet), għejun (għajnejn).
  • 158 (~86%) have just a determinate plural
  • 3 (~1%) have just an indeterminate plural
  • 9 (~5%) have neither plural form. This is usually compensated by a collective form (e.g. baqar), a dual (e.g. riġlejn) or simply a singulative (e.g. plastik).

While this distinction can have some linguistic importance, for the purposes of the GF implementation will be simplified slightly, by storing only one plural form. This change will be made internally in the noun representation, so that the paradigm constructors are not affected and as such we still have this information available (although it is just being ignored for our purposes).

Another solution is to have indeterminate plural forms stored simply as variants of the determinate plural. I think that in most cases one could get away with this, though for now I am steering clear of all variant just to keep testing simple.

20 January, 2013

Removing inferred roots from verb smart paradigms

In the Maltese resource grammar implementation I had some code which tried to extract the radicals from a so-called mamma verb form. So for example:

classifyVerb "ħareġ"

would give (amongst other information) the radicals Ħ-R-Ġ in record form. This works well most of the time, except for cases where it is completely impossible to guess the missing radicals from weak-root verbs. For example, dar is actually the mamma of two distinct verbs, one with root D-W-R and another with D-J-R.

The usual way of dealing with this is to have a less-smart fallback in your paradigm, which takes an explicit root in such ambiguous cases. But the reality is that in this case we don’t even need the smarter version of the paradigm. The set of root-and-pattern verbs in Maltese is a closed set, so there are no new such verbs being added to the language (all new verbs are today added as loan verbs). Furthermore, this list has already been compiled by Michael Spagnol is his PhD thesis, and we even now have it in database form here. I am using this to directly build a monolingual Maltese verb database in GF, and since I already have the radicals for all these verbs, there really is no need at all to try and determine it automatically in a smart paradigm. As my professor Aarne Ranta likes to say, “don’t guess what you know.”

Changing the verb implementation

Perhaps I saw the signs earlier than I would like to admit, but it has become clear now that my current implementation of Maltese verb morphology in GF has taken the wrong direction and needs to be significantly re-written. Having an inflection table with close to 1000 forms is not just a headache implementationally, but also arguably not linguistically accurate either.

So the new plan, which is what is done in the implementations for Italian and Finnish, so to remove pronominal suffixes from the verb’s inflection table, and instead use binding on the syntax level to produce these forms. Reducing the inflection table is the easy part, but getting the rest to produce correct results might be tricky since the stem sometimes changes depending on the pronoun being suffixed.

So anyway I have created a new branch to work on this, so that at any point I can switch back to the original implementation if I want to compare something or if I end up wanting to use that approach again.

23 October, 2012

Pronominal suffixes and transitivity

The verb morphology I am currently working on for Maltese definitely suffers from over-generation, in particular when it comes to derived verbs and pronominal suffixes. Derived verbs are often intransitive and interpreted as reflexive or passive, which makes the addition of direct object suffixes to them very awkward.

For example, take the root W-Ż-N in the first (underived) form: wiżen “he weighed”.
Adding some pronominal suffixes we get wiżnek “he weighed you”, wiżinlek … “he weighed … for you”, and wizinhomlok “he weighed them for you”.

So far so good, but let’s now look at the seventh derived form of this root: ntiżen “he was weighed”.
Appending an indirect object pronoun is fine: ntiżinlek “he was weighed for you”. But when we try with a direct object it ceases to make sense, e.g. ntiżnek and ntizinhomlok. The reflexive meaning taken on by this derived verbs means direct object pronouns no longer make any sense when attached to the verb (even when in combination with an indirect object pronoun).

The problem is that I currently don’t know if these cases are detectable on a morphological level. In other words, if seventh form verbs never have any direct object pronouns attached then it is very simple to fix the over-generation, but it’s still a little early for me to tell whether such a general exclusion can be made.

18 September, 2012

A MySQL Unicode collation for Maltese

If you’ve tried to store and retrieve Maltese text in a MySQL database before, you may have noticed that there is no way to sort it correctly according to the Maltese alphabet.

The  utf8_unicode_ci collation treats g and ġ etc. as interchangeable but that is of course not right. You can try utf8_bin, but since this sorts according to Unicode codepoints, ċ, ġ, ħ, ż get sorted after the letter z — which is even worse (although it does at least mean you can search for ħ without getting h back too).

What you really need is a custom collation for the Maltese alphabet. There isn’t one built-in, but luckily MySQL makes adding custom collations relatively painless. So I went ahead and implemented such a collation for Maltese, and called it utf8_maltese_ci. You can find the code, along with detailed installation and usage instructions at the GitHub repository for the Maltese MySQL collation.

24 August, 2012

Vowel length and negation

Continuing the previous post about vowel lengths, here are some remarks about the handling of the long vowel ie under negation (which is after all the suffixation of the letter x).

Consider the verbs waqaf, kiel, and ħa. Note that the latter two are irregular, however I think they are still valid for the point I want to make. Their imperfect forms all consist of a stem which begins with the long vowel ie: nieqaf, tiekol, jieħu. Does this vowel get shortened under negation? Let’s see what the Maltese corpus has to say about this:

waqaf

-ieqaf -ieqafx -iqafx -ieqfu -ieqfux -iqfux
n- 1070 21 26 850 24 12
t- 2124 116 53 23 3 2
j- 2828 90 102 1390 51 58
Totals 6022 227 281 2263 78 72

kiel

-iekol -iekolx -ikolx -ieklu -ieklux -iklux
n- 292 3 8 752 7 8
t- 935 18 18 75 1 7
j- 1339 16 21 1747 30 18
Totals 2566 37 47 2574 38 33

ħa

-ieħu -ieħux -iħux -ieħdu -ieħdux -iħdux
n- 7191 23 58 11215 24 38
t- 17643 101 163 631 6 5
j- 33070 155 204 22682 113 103
Totals 57904 279 425 34528 143 146

These are the totals of the negative forms, as percentages of the total occurrences of the corresponding positive form:

Verb Singular IE Singular I Plural IE Plural I
waqaf 3.76% 4.66% 3.45% 3.18%
kiel 1.44% 1.83% 1.48% 1.28%
ħa 0.48% 0.73% 0.41% 0.42%

So what do all these numbers mean?
When considering the singular negative, the version without the long ie vowel is more common in all cases. As an example, ma nikolx is more common than ma niekolx, which would indicate that the former is really the correct form.

In the plural though, it’s almost the complete opposite. To continue our example, this means that ma nieklux is slightly more frequent that ma niklux. However the difference in frequency is less pronounced: 7% in plural compared to 12% in singular for the given example.

So here we have another indication of the correct spelling, but not exactly hard evidence. The more I try to rely on the corpus for these things, the more apparent it becomes that it is not really a good settler of questions of minor orthographic differences.

21 August, 2012

Vowel length and pronominal suffixes in Maltese

Vowel length in Maltese seems to be one of those tricky things. The combination of pronominal suffixes with verbs ending in ‘a’ is a good example.

Direct Object suffixes

Think of the single verb form for “we saw you”: rajniek. Or should that be rajnik? Based on how it sounds as a native speaker, the latter shorter-vowel version seems more likely.

The Maltese corpus is not much help in deciding this. Just look at these frequency counts for tokens ending in jniek and jnik:

Rank Token Count
1 tajniek 4
2 rrispondejniek 3
3 smajniek 3
4 qtajniek 2
5 rajniek 2
6 drajniek 1
7 ħabbejniek 1
8 obdejniek 1
Rank Token Count
1 tajnik 6
2 għabbejnik 2
3 mejnik 2
4 rajnik 2
5 staqsejnik 2
6 avviċinajnik 1
7 għaddejnik 1
8 kkritikajnik 1
9 ħallejnik 1

In total, jniek occurs 17 times and jnik occurs 18 times. Note also the even split of the words which appear in both lists: tajniek (4) vs. tajnik (3), and rajniek (2) vs. rajnik (2).

But it turns out there is an explicit rule for this. According to “Grammatika Maltija” pg 166, whenever a verb ending in ‘a’ is going to have a pronominal suffix attached to it, the joining vowel becomes an ‘ie’. So tajna + ek = tajniek, even though when you say it it sounds a lot more like tajnik. The results from the corpus seem to confirm that I’m not the only one confused by this, although admittedly the numbers are probably too low to be statistically significant. While counter-intuitive, this rule seems pretty established, so we just accept it.

Indirect Object suffixes

What about indirect pronominal suffixes? Think of “we sang for your”, kantajnielek. Or is that kantajnilek? Again, the latter sounds like a more accurately transcription of the spoken form. The corpus reports 11 occurances of tokens ending in jnielek, and 10 for jnilek. Another even non-statistically-significant split. “Maltese” by Borg and Azzopardi-Alexander claims the former is correct, with an ‘ie’.

Direct and Indirect Object suffixes

And what happens when you have both a direct and indirect pronominal suffixes? The information is much more polarised. Using the rule above, as in “Maltese”, the ‘ie’ remains. So you have the forms kantajniehulek and ftaħniehulek.

But the corpus contains exactly zero tokens which end with iehulek, and a whopping 92 which finish with ihulek. In this case the two sources directly contradict each other. Some personal communication on the Kelmet il-Malti Facebook group confirms that the above rule no longer applies, and the more natural principle of vowel length comes into play again. So kantajnihulek and ftaħnihulek are the correct forms, and the book is wrong.

15 August, 2012

Liquid-medial strong verbs beginning with għ

Liquid-medial verbs are a subclass of the strong Maltese semitic verbs, which have a liquid consonant (għ, l, m, n, r) as their second radical. Their paradigm is slightly different in that they sometimes require an extra vowel in conjugation. Whether this vowel is morphological or euphonic, I don’t know. Not all sources identify them as a subclass, and simply claim the vowel is inserted euphonically as needed. However when the first radical is GĦ, this extra vowel is dropped again:

Class Root Mamma (Perf P3 Sg Masc) Imperfect P1 Sg Imperfect P1 Pl Template (prev column)
Strong Regular K-T-B kiteb nikteb niktbu nvCCCv
Strong Liquid-Medial S-R-Q seraq nisraq nisirqu nvCvCCv
Strong Liquid-Medial GĦ-M-L għamel nagħmel nagħmlu nvCCCv

This also creeps up when adding some indirect object suffixes (P3 Sg Fem, and all Pl) in imperative/imperfect:

Class Root Imperfect P2 Sg Imperfect P1 Sg + I.O. P1 Sg Imperfect P1 Sg + I.O. P1 Pl Template (prev column)
Strong Regular K-T-B tikteb tiktibli tiktbilna tvCCCilna
Strong Liquid-Medial S-R-Q tisraq tisraqli tisraqilna tvCCvCilna
Strong Liquid-Medial GĦ-M-L tagħmel tagħmilli tagħmlilna tvCCCilna

8 August, 2012

Hundreds of forms, but nowhere to check them

A previous post showed just how many inflectional forms there are for a single verb in Maltese. But while writing the algorithms for producing such tables, I repeatedly find that for many of these forms, there is no real way of checking them for correctness, because no such other resource exists.

There is the Korpus Malti, but despite containing nearly 100 million tokens, there are numerous grammatically-correct verb forms which do not occur anywhere in the corpus. No traditional dictionary would contain every possible inflected form for each verb, for reasons of size, so in many cases I must simply resort to “best guesses” and intuition. There are so-called verb models which are used in Maltese verb conjugations, e.g. the verb lagħab (he played) should be conjugated as seraq (he stole), but that only covers radical-placement, not vowel changes. For example, which is correct: naqtgħak or naqtgħek? The former does not appear at all in the corpus, and the latter appears just once, from a public blog entry. Not exactly hard evidence, is it?

26 July, 2012

Full inflection table of a Maltese verb

The inflection table of the Maltese verb is formidable. Apart from tense/aspect and person, number & gender, a Maltese verb also can also take suffixes for a direct object, for an indirect object, or for both a direct and indirect object. Add to this the “-x” suffix when the verb is negated, and you end up with no fewer than 952 unique forms for a single verb (the total number of combinations is 1152, but some combinations are non-existent).

Here’s the full table for the verb fetaħ (he opened).

(more…)

Newer Posts »