“the j stands for Jewfish”
about me | research blog | wordpress plugins | jQuery plugins

18 September, 2012

A MySQL Unicode collation for Maltese

If you’ve tried to store and retrieve Maltese text in a MySQL database before, you may have noticed that there is no way to sort it correctly according to the Maltese alphabet.

The  utf8_unicode_ci collation treats g and ġ etc. as interchangeable but that is of course not right. You can try utf8_bin, but since this sorts according to Unicode codepoints, ċ, ġ, ħ, ż get sorted after the letter z — which is even worse (although it does at least mean you can search for ħ without getting h back too).

What you really need is a custom collation for the Maltese alphabet. There isn’t one built-in, but luckily MySQL makes adding custom collations relatively painless. So I went ahead and implemented such a collation for Maltese, and called it utf8_maltese_ci. You can find the code, along with detailed installation and usage instructions at the GitHub repository for the Maltese MySQL collation.

6 November, 2010

Fuzzy string matching in MySQL using Levenshtein Distance stored function

Searching for fuzzy string matching methods will return various algorithms and various implementations of them.

I found this MySQL implementation of Levenshtein Distance to be adequate for my needs, and using this handy MySQL stored function makes it super as to use in queries without having to create temporary search-optimised tables or performing post-processing in another language (eg PHP).

Just install the functions in your database, and use like so:

SELECT * FROM users ORDER BY levenshtein_distance(users.name, 'john')