Multilingual sorting - I hope it helps

Multilingual sorting - I hope it helps

hellsinghellsing Posts: 22Questions: 8Answers: 0
edited June 2015 in Free community support

Hey Guys,

I am working on a multilingual website that has to handle non-ASCII letters. I was digging the forums here and searched the internet and saw that unicode ordering is a problem in JS. It's hardly manageable or makes heavy processor load. At least for me who's not a JS expert :)

My tables contain 10k+ rows so I had to find out something. This is where I got (and I hope it's not a duplicate):

1: MySQL

Fortunately the root of this problem is quite easily manageable in MySQL. If I use the proper collation then MySQL sorts strings as expected. My query looks like this (w/o the slashes):

SELECT *, /@rownum := /@rownum+1 AS posSK
FROM manufacturers
CROSS JOIN (SELECT /@rownum := 0) r
ORDER BY manNameSK COLLATE utf8_slovak_ci;

The /@rownum part is an incrementing counter and represents the position of the row based on the collation I use in sorting. Please note that ASC / DESC sorting are not available when ORDER BY … COLLATE … The collation makes almost no change in running time: on a 5.600 lines table this query is ready in average 0.0063s vs. 0.0054s w/o collation.

2: php + HTML5

I add to every td element an HTML5 attribute: data-order, and the values are the corresponding value of my MySQL counter (posSK), so it looks like this (totally simplified):

<tr><td data-order="1">bar</td></tr>
<tr><td data-order="2">blaf</td></tr>
<tr><td data-order="3">foo</td></tr>

3: dataTables

The easiest part: we don't have to do anything. If dt finds a data-order attribute it automatically uses it for ordering on that column, so the special non-ASCII letters will be on their right position.

Notes

I tried this method on Slovakian, Slovenian, Hungarian and Romanian languages and they were all fine. There are some special rules in (almost?) every language for sorting in ABC order, eg. some accented vowels in Hungarian are handled equally with their plain pair (E+É, I+Í, O+Ó, accents mean longer or slightly different sound), while others are not ( O vs. Ö, U vs. Ü, accents mean a strongly different sound). So if you see a word beginning with an É within the others beginning with E, thats fine :)

I hope it helps :)

Replies

  • allanallan Posts: 63,689Questions: 1Answers: 10,500 Site admin

    Very interesting - thanks for posting this.

    In theory, in Javascript the localeCompare method should address this, but it appears to be really poorly implemented in browsers. The new i18n API in "HTML5" (for lack of a better term) will hopefully help, although it isn't widely supported in browsers yet.

    Allan

This discussion has been closed.