“the j stands for Jards”
about me | research blog | wordpress plugins | jQuery plugins

5 April, 2013

The sad unreliability of Ubuntu One

I started using Ubuntu One more or less when it was first released. Admittedly it was pretty slow in the beginning, but they seemed to improve their speeds a lot and eventually I began to pay for extra space and use Ubuntu One exclusively for all my cloud syncing – some 3000 files from my entire Documents folder, and around 1,500 pictures. I used two Ubuntu machines and I thought U1 worked pretty well in making sure that I always had the most recent versions of everything on both machines.

I first noticed a problem with the syncing when by accident I noticed a that a folder which showed up on the U1 web interface was not on my computer. I tried various things to get this to work, trying all the command line options to u1sdtool, restarting, stopping/starting syncing etc. Eventually I wrote about it on Ask Ubuntu, and ended up getting in touch with U1 support. Their solution was to essentially clear all the cached syncing info on my machine and start again. Admittedly, this worked (although it did require that U1 scan and compare every single file again). I got the missing folder to sync, and everything seemed OK.

Things seemed OK for a few months. The a few weeks ago I got a new machine, a MacBook Pro. I still use Ubuntu at work, and since there is a U1 client for OSX I thought there should be no problem in continuing to use Ubuntu one for my syncing. This is when things really started going downhill. The initial sync on my Mac worked fine – I mean essentially it’s just downloading everything, pretty straightforward. But then I began to notice that some changes made would not get noticed by U1. Say I would delete a file from my Mac, but it would still appear in the web interface even though the U1 client would tell me that everything was up-to-date. This is really when I began to stop trusting it. Again I would try all the command line options for refreshing the sync folders, nothing. When I contacted U1 support again, they just had exactly the same solution – delete the caching data and re-sync. I did, it took it’s time to re-check every single file, and again things seemed OK again. But then I would add/delete some other file and notice that again that Ubuntu One would fail to notice them. There are things you can do to force it to notice the changes, like restarting the computer or un-checking and re-checking the “Sync locally” checkbox inside the client. But that defeats the whole purpose.

To make things worse, I’m also starting to notice this same erratic syncing behaviour from my Ubuntu machine too. And now I have absolutely no idea if there even exists a single complete version of all my files, anywhere. It feels like every computer I used U1 has some copy of my files, but is never 100% complete/updated. It’s a mess. There’s just too many files to check manually. I have backups, and I hope that when I look for a file and find that Ubuntu One has lost it, I can find it by digging into these backups. But that’s hardly a solution.  I absolutely cannot trust Ubuntu One anymore.

But I still want a cross-platform syncing solution. iCloud doesn’t have an Ubuntu client (and I haven’t heard good things about it anyway). Neither does Google Drive, although they keep promising one “soon”. Dropbox has clients for both and is starting to look like a real viable alternative now. I guess it’s popularity compared to U1 will mean it’s more reliable. But it’s going to take some work to move everything over, and I really want to avoid switching.

15 February, 2013

Determinate and indeterminate noun plural forms

Maltese nouns have two potential plural forms, determinate and indeterminate. The distinction is exhibited in examples such as:

English Singulative Determinate plural Indeterminate plural
road triq triqat toroq
tooth sinna sinniet snien

However it seems that in reality there are very few nouns which actually have both forms. An analysis of the 184 nouns in the GF Resource Grammar Library mini-lexicon shows that:

  • 14 (~7%) have both forms, though I would argue that many of these sound kind of arcane, e.g. ġbiel (ġebliet), xgħur (xagħariet), għejun (għajnejn).
  • 158 (~86%) have just a determinate plural
  • 3 (~1%) have just an indeterminate plural
  • 9 (~5%) have neither plural form. This is usually compensated by a collective form (e.g. baqar), a dual (e.g. riġlejn) or simply a singulative (e.g. plastik).

While this distinction can have some linguistic importance, for the purposes of the GF implementation will be simplified slightly, by storing only one plural form. This change will be made internally in the noun representation, so that the paradigm constructors are not affected and as such we still have this information available (although it is just being ignored for our purposes).

Another solution is to have indeterminate plural forms stored simply as variants of the determinate plural. I think that in most cases one could get away with this, though for now I am steering clear of all variant just to keep testing simple.

20 January, 2013

Removing inferred roots from verb smart paradigms

In the Maltese resource grammar implementation I had some code which tried to extract the radicals from a so-called mamma verb form. So for example:

classifyVerb "ħareġ"

would give (amongst other information) the radicals Ħ-R-Ġ in record form. This works well most of the time, except for cases where it is completely impossible to guess the missing radicals from weak-root verbs. For example, dar is actually the mamma of two distinct verbs, one with root D-W-R and another with D-J-R.

The usual way of dealing with this is to have a less-smart fallback in your paradigm, which takes an explicit root in such ambiguous cases. But the reality is that in this case we don’t even need the smarter version of the paradigm. The set of root-and-pattern verbs in Maltese is a closed set, so there are no new such verbs being added to the language (all new verbs are today added as loan verbs). Furthermore, this list has already been compiled by Michael Spagnol is his PhD thesis, and we even now have it in database form here. I am using this to directly build a monolingual Maltese verb database in GF, and since I already have the radicals for all these verbs, there really is no need at all to try and determine it automatically in a smart paradigm. As my professor Aarne Ranta likes to say, “don’t guess what you know.”

Changing the verb implementation

Perhaps I saw the signs earlier than I would like to admit, but it has become clear now that my current implementation of Maltese verb morphology in GF has taken the wrong direction and needs to be significantly re-written. Having an inflection table with close to 1000 forms is not just a headache implementationally, but also arguably not linguistically accurate either.

So the new plan, which is what is done in the implementations for Italian and Finnish, so to remove pronominal suffixes from the verb’s inflection table, and instead use binding on the syntax level to produce these forms. Reducing the inflection table is the easy part, but getting the rest to produce correct results might be tricky since the stem sometimes changes depending on the pronoun being suffixed.

So anyway I have created a new branch to work on this, so that at any point I can switch back to the original implementation if I want to compare something or if I end up wanting to use that approach again.

11 January, 2013

Markdown, Pandoc and GitHub

I love writing in Markdown, and in general I try to always write in Markdown and then convert into HTML/TeX. Pandoc is a fantastic tool for converting from Markdown to other formats, and since it is so versatile I would like to use it for everything. I also use GitHub a lot, which has an automatic renderer for Markdown documents.

Unfortunately, Pandoc’s Markdown (PM) and GitHub Flavored Markdown (GFM) are not identical, and I find myself constantly torn between the two, trying to satisfy both. I typically have some code repository hosted on GitHub, with at least one main readme file written in Markdown format. When browsing the repository through the GitHub website, this readme file is automatically converted to HTML. Since this is often the first and only documentation for my code, it is important to me that it renders correctly.

I often want to also convert my Markdown document locally into a self-contained HTML file, and sometimes TeX too, and for this Pandoc is just the best. But herein begin the differences in syntax support:

Tables

GFM likes “pipe-tables”, as defined in Markdown Extra:

| Item      | Value |
| --------- | -----:|
| Computer  | $1600 |
| Phone     |   $12 |
| Pipe      |    $1 |

However the latest releases of Pandoc (1.9.x) support a bunch of other table types, but not pipe-tables. The latest Pandoc (1.10.x) does thankfully support them, so my current solution is to use the development version of Pandoc and compile from source. This means my Makefile might not be portable, but at least I know it works me (though arguably maybe I shouldn’t depend on Pandoc in the first place).

Definition lists

Quite simply, GFM does not support definition lists. However they are defined in Markdown Extra like so, and Pandoc handles them like champ. At least GFM tends to degrade gracefully in this case, so definition lists don’t bother me too much.

Pre/post code

When building a standalone HTML or TeX file, you will definitely need to include some before and after code around your actual content. You could have completely separate files for this, and then glue them together in a Makefile. But sometimes this seems like overkill for a simple </body></html>, and I just want to stick them at the bottom of my Markdown file and be done with it. In fact GFM will happily ignore HTML tags, but will still display the content of something like <title>Hello!</title>. And if you try to include some TeX code it only gets worse.

 

Maybe it’s my fault for trying to expect too many different things from a simple language. But with a Master’s thesis looming, I’m currently thinking out my writing options. While I love the idea of writing in Markdown and using Pandoc to convert to TeX, this lack of standard really bothers me and I can’t help wondering if I might be safer with something like txt2tags, which my professor swears by.

23 October, 2012

Pronominal suffixes and transitivity

The verb morphology I am currently working on for Maltese definitely suffers from over-generation, in particular when it comes to derived verbs and pronominal suffixes. Derived verbs are often intransitive and interpreted as reflexive or passive, which makes the addition of direct object suffixes to them very awkward.

For example, take the root W-Ż-N in the first (underived) form: wiżen “he weighed”.
Adding some pronominal suffixes we get wiżnek ”he weighed you”, wiżinlek … ”he weighed … for you”, and wizinhomlok “he weighed them for you”.

So far so good, but let’s now look at the seventh derived form of this root: ntiżen “he was weighed”.
Appending an indirect object pronoun is fine: ntiżinlek ”he was weighed for you”. But when we try with a direct object it ceases to make sense, e.g. ntiżnek and ntizinhomlok. The reflexive meaning taken on by this derived verbs means direct object pronouns no longer make any sense when attached to the verb (even when in combination with an indirect object pronoun).

The problem is that I currently don’t know if these cases are detectable on a morphological level. In other words, if seventh form verbs never have any direct object pronouns attached then it is very simple to fix the over-generation, but it’s still a little early for me to tell whether such a general exclusion can be made.

18 September, 2012

A MySQL Unicode collation for Maltese

If you’ve tried to store and retrieve Maltese text in a MySQL database before, you may have noticed that there is no way to sort it correctly according to the Maltese alphabet.

The  utf8_unicode_ci collation treats g and ġ etc. as interchangeable but that is of course not right. You can try utf8_bin, but since this sorts according to Unicode codepoints, ċ, ġ, ħ, ż get sorted after the letter z — which is even worse (although it does at least mean you can search for ħ without getting h back too).

What you really need is a custom collation for the Maltese alphabet. There isn’t one built-in, but luckily MySQL makes adding custom collations relatively painless. So I went ahead and implemented such a collation for Maltese, and called it utf8_maltese_ci. You can find the code, along with detailed installation and usage instructions at the GitHub repository for the Maltese MySQL collation.

24 August, 2012

Vowel length and negation

Continuing the previous post about vowel lengths, here are some remarks about the handling of the long vowel ie under negation (which is after all the suffixation of the letter x).

Consider the verbs waqaf, kiel, and ħa. Note that the latter two are irregular, however I think they are still valid for the point I want to make. Their imperfect forms all consist of a stem which begins with the long vowel ie: nieqaf, tiekol, jieħu. Does this vowel get shortened under negation? Let’s see what the Maltese corpus has to say about this:

waqaf

-ieqaf -ieqafx -iqafx -ieqfu -ieqfux -iqfux
n- 1070 21 26 850 24 12
t- 2124 116 53 23 3 2
j- 2828 90 102 1390 51 58
Totals 6022 227 281 2263 78 72

kiel

-iekol -iekolx -ikolx -ieklu -ieklux -iklux
n- 292 3 8 752 7 8
t- 935 18 18 75 1 7
j- 1339 16 21 1747 30 18
Totals 2566 37 47 2574 38 33

ħa

-ieħu -ieħux -iħux -ieħdu -ieħdux -iħdux
n- 7191 23 58 11215 24 38
t- 17643 101 163 631 6 5
j- 33070 155 204 22682 113 103
Totals 57904 279 425 34528 143 146

These are the totals of the negative forms, as percentages of the total occurrences of the corresponding positive form:

Verb Singular IE Singular I Plural IE Plural I
waqaf 3.76% 4.66% 3.45% 3.18%
kiel 1.44% 1.83% 1.48% 1.28%
ħa 0.48% 0.73% 0.41% 0.42%

So what do all these numbers mean?
When considering the singular negative, the version without the long ie vowel is more common in all cases. As an example, ma nikolx is more common than ma niekolx, which would indicate that the former is really the correct form.

In the plural though, it’s almost the complete opposite. To continue our example, this means that ma nieklux is slightly more frequent that ma niklux. However the difference in frequency is less pronounced: 7% in plural compared to 12% in singular for the given example.

So here we have another indication of the correct spelling, but not exactly hard evidence. The more I try to rely on the corpus for these things, the more apparent it becomes that it is not really a good settler of questions of minor orthographic differences.

21 August, 2012

Vowel length and pronominal suffixes in Maltese

Vowel length in Maltese seems to be one of those tricky things. The combination of pronominal suffixes with verbs ending in ‘a’ is a good example.

Direct Object suffixes

Think of the single verb form for “we saw you”: rajniek. Or should that be rajnik? Based on how it sounds as a native speaker, the latter shorter-vowel version seems more likely.

The Maltese corpus is not much help in deciding this. Just look at these frequency counts for tokens ending in jniek and jnik:

Rank Token Count
1 tajniek 4
2 rrispondejniek 3
3 smajniek 3
4 qtajniek 2
5 rajniek 2
6 drajniek 1
7 ħabbejniek 1
8 obdejniek 1
Rank Token Count
1 tajnik 6
2 għabbejnik 2
3 mejnik 2
4 rajnik 2
5 staqsejnik 2
6 avviċinajnik 1
7 għaddejnik 1
8 kkritikajnik 1
9 ħallejnik 1

In total, jniek occurs 17 times and jnik occurs 18 times. Note also the even split of the words which appear in both lists: tajniek (4) vs. tajnik (3), and rajniek (2) vs. rajnik (2).

But it turns out there is an explicit rule for this. According to “Grammatika Maltija” pg 166, whenever a verb ending in ‘a’ is going to have a pronominal suffix attached to it, the joining vowel becomes an ‘ie’. So tajna + ek = tajniek, even though when you say it it sounds a lot more like tajnik. The results from the corpus seem to confirm that I’m not the only one confused by this, although admittedly the numbers are probably too low to be statistically significant. While counter-intuitive, this rule seems pretty established, so we just accept it.

Indirect Object suffixes

What about indirect pronominal suffixes? Think of “we sang for your”, kantajnielek. Or is that kantajnilek? Again, the latter sounds like a more accurately transcription of the spoken form. The corpus reports 11 occurances of tokens ending in jnielek, and 10 for jnilek. Another even non-statistically-significant split. “Maltese” by Borg and Azzopardi-Alexander claims the former is correct, with an ‘ie’.

Direct and Indirect Object suffixes

And what happens when you have both a direct and indirect pronominal suffixes? The information is much more polarised. Using the rule above, as in “Maltese”, the ‘ie’ remains. So you have the forms kantajniehulek and ftaħniehulek.

But the corpus contains exactly zero tokens which end with iehulek, and a whopping 92 which finish with ihulek. In this case the two sources directly contradict each other. Some personal communication on the Kelmet il-Malti Facebook group confirms that the above rule no longer applies, and the more natural principle of vowel length comes into play again. So kantajnihulek and ftaħnihulek are the correct forms, and the book is wrong.

15 August, 2012

Liquid-medial strong verbs beginning with għ

Liquid-medial verbs are a subclass of the strong Maltese semitic verbs, which have a liquid consonant (għ, l, m, n, r) as their second radical. Their paradigm is slightly different in that they sometimes require an extra vowel in conjugation. Whether this vowel is morphological or euphonic, I don’t know. Not all sources identify them as a subclass, and simply claim the vowel is inserted euphonically as needed. However when the first radical is GĦ, this extra vowel is dropped again:

Class Root Mamma (Perf P3 Sg Masc) Imperfect P1 Sg Imperfect P1 Pl Template (prev column)
Strong Regular K-T-B kiteb nikteb niktbu nvCCCv
Strong Liquid-Medial S-R-Q seraq nisraq nisirqu nvCvCCv
Strong Liquid-Medial GĦ-M-L għamel nagħmel nagħmlu nvCCCv

This also creeps up when adding some indirect object suffixes (P3 Sg Fem, and all Pl) in imperative/imperfect:

Class Root Imperfect P2 Sg Imperfect P1 Sg + I.O. P1 Sg Imperfect P1 Sg + I.O. P1 Pl Template (prev column)
Strong Regular K-T-B tikteb tiktibli tiktbilna tvCCCilna
Strong Liquid-Medial S-R-Q tisraq tisraqli tisraqilna tvCCvCilna
Strong Liquid-Medial GĦ-M-L tagħmel tagħmilli tagħmlilna tvCCCilna
Newer Posts »