How To: MySQL FullText searching
How To: MySQL FullText searching
In my project I've been retrofitting a search engine app with a slicker DataTables front. The main feature of the old application was this keyword search feature, using FullText indexes in MySQL. This may be of interest or use to someone else, so I'll mention how I integrate datatables with a fulltext field. Some of you may have suggestions or better solutions, and I'd love to hear those too.
Perhaps in the future DataTables might directly support FullText fields?
...Some background ...
FullText indexes and searching are supported in MySQL 3.23.23 and above and allow programmers and users to implement keyword based filtering and sorting, and provides several modes like Natural Language mode (default) and Boolean mode. I'll talk a little about these modes below. FullText statements in your SQL can go into your SELECT, or WHERE, but use an aliased value in the ORDER BY portions. Here is an example of a FullText query (note you must have defined your FullText indexes in the database before calling against the index, in the example the index is defined on headline and story):
[code]
SELECT headline, story, MATCH(headline,story) AGAINST ('Hurricane damage recent') as score
FROM news
WHERE MATCH (headline,story) AGAINST ('Hurricane damage recent')
ORDER BY score DESC;
[/code]
In the default Natural Language mode, MATCH .. AGAINST returns a floating point numeric rank (higher is more relevant). This allows you to sort results if you wish, and I've aliased the result to score and used it in my ORDER BY clause. Note that although I've used MATCH .. AGAINST twice in the query, MySQL is smart enough to have cached the first call so there is no performance penalty for the second call.
Some basic things to know about FullText is that white space and punctuation are ignored, and each word (separated by white space) is tokenized and used as a separate term - this makes this search different than DataTables' filters. There is both an upper and lower length limit for valid terms (default is 4 on the low side, but these are configurable) and words matching the Stop Words list are ignored (common words like "the", "and"; there's a default list but this can be configured as well).
... Performance ...
Of course features don't come without some cost, but FullText runs smoothly on datasets up to several million rows large. The performance cost is primarily in the UPDATE phase, in which the index must also be updated. For datasets larger than several million, you might want to look into alternative programs like Sphinx (http://sphinxsearch.com/) or Apache's Solr/Lucene (http://lucene.apache.org/solr/) or others.
... Natural Language mode vs Boolean mode ...
As mentioned above, Natural Language mode generates a relevance score for indexes matching your search terms so you can rank rows that have more than one hit against the index higher in the results list. This mode, however, doesn't have more powerful features like specifying that a term must be included, or should not be included. Boolean mode allows some of these features, but does not provide a relevance score (I may be wrong about this, check the documentation), so there's a trade-off. Operators in Boolean mode include the '+' (boolean AND; must include), '-' (boolean NOT; must not include) and '( )' for grouping terms, i.e. "+Reagan -Bush (Iran Contra)". There are other operators that let you indicate an increased or decreased weighting, they are like + and - but not as strict; these operators are '>', '<', '~', etc. See more at http://dev.mysql.com/doc/refman/5.0/en/fulltext-boolean.html.
Perhaps in the future DataTables might directly support FullText fields?
...Some background ...
FullText indexes and searching are supported in MySQL 3.23.23 and above and allow programmers and users to implement keyword based filtering and sorting, and provides several modes like Natural Language mode (default) and Boolean mode. I'll talk a little about these modes below. FullText statements in your SQL can go into your SELECT, or WHERE, but use an aliased value in the ORDER BY portions. Here is an example of a FullText query (note you must have defined your FullText indexes in the database before calling against the index, in the example the index is defined on headline and story):
[code]
SELECT headline, story, MATCH(headline,story) AGAINST ('Hurricane damage recent') as score
FROM news
WHERE MATCH (headline,story) AGAINST ('Hurricane damage recent')
ORDER BY score DESC;
[/code]
In the default Natural Language mode, MATCH .. AGAINST returns a floating point numeric rank (higher is more relevant). This allows you to sort results if you wish, and I've aliased the result to score and used it in my ORDER BY clause. Note that although I've used MATCH .. AGAINST twice in the query, MySQL is smart enough to have cached the first call so there is no performance penalty for the second call.
Some basic things to know about FullText is that white space and punctuation are ignored, and each word (separated by white space) is tokenized and used as a separate term - this makes this search different than DataTables' filters. There is both an upper and lower length limit for valid terms (default is 4 on the low side, but these are configurable) and words matching the Stop Words list are ignored (common words like "the", "and"; there's a default list but this can be configured as well).
... Performance ...
Of course features don't come without some cost, but FullText runs smoothly on datasets up to several million rows large. The performance cost is primarily in the UPDATE phase, in which the index must also be updated. For datasets larger than several million, you might want to look into alternative programs like Sphinx (http://sphinxsearch.com/) or Apache's Solr/Lucene (http://lucene.apache.org/solr/) or others.
... Natural Language mode vs Boolean mode ...
As mentioned above, Natural Language mode generates a relevance score for indexes matching your search terms so you can rank rows that have more than one hit against the index higher in the results list. This mode, however, doesn't have more powerful features like specifying that a term must be included, or should not be included. Boolean mode allows some of these features, but does not provide a relevance score (I may be wrong about this, check the documentation), so there's a trade-off. Operators in Boolean mode include the '+' (boolean AND; must include), '-' (boolean NOT; must not include) and '( )' for grouping terms, i.e. "+Reagan -Bush (Iran Contra)". There are other operators that let you indicate an increased or decreased weighting, they are like + and - but not as strict; these operators are '>', '<', '~', etc. See more at http://dev.mysql.com/doc/refman/5.0/en/fulltext-boolean.html.
This discussion has been closed.
Replies
Naturally if you're using FullText, you're using some server back-end to generate your table, so this does not apply to static tables. You might also want to use server-side processing for pagination, etc. This was my goal in integrated DataTables with FullText.
In my code, I am using the default Natural Language mode, with a relevance score (to allow user to sort by this column if the data becomes unsorted). You can change your code as appropriate by specifying Boolean Mode in your query
[code]
SELECT headline, story
FROM news
WHERE MATCH (headline,story) AGAINST ('+hurricane -"New Orleans"' IN BOOLEAN MODE);
[/code]
There are two main parts of code to consider. The client side (javascript) and server side (I'm using PHP based on Allan's sample code here: http://datatables.net/examples/data_sources/server_side.html).
... CLIENT SIDE ...
On the client side, I need to:
1. create a default column for cases where I have no keywords
2. alter the column when there are keywords
In my client-side column definitions when intializing DataTables, I set up a column for this relevance score:
[code]
oTable = $('#example').dataTable( {
"bProcessing": true,
"bServerSide": true,
"sAjaxSource": "back-end-server-side-code.php",
"aoColumns": [
{ "sName": "id", "sTitle" : 'ID' },
// ... more columns here
{ "sName": "0 as relevance", sTitle: "relevance" }
});
[/code]
So for any queries to the back-end, DataTables sends column "0 as relevance". This represents the case where there are no search words in the search textbox. Since there are no words, all rows will have 0 relevance score, but should still be returned. They are not filtered. When search terms are typed into the search bar, I need some client-side code to change this column's sName definition to a MATCH .. AGAINST clause:
[code]
// set up filter for mediaquery (main search)
$("#search-textbox").keyup( function () {
var col = set_relevance_column( this.value ); // set relevance column is a convenience function to find the col number of "relevance" and update sName for that column
if (col > -1)
oTable.fnSort([ [col,'desc'], [1,'asc'] ]); // you could add a function here to delay/collate calls to avoid excessive AJAX calls. I did, but took it out for this example
} );
function set_relevance_column(inputstring) {
if (inputstring.length < 3) relevance = "0 as relevance";
// note: "|" will be converted to comma in back-end-server-side-code.php
// this is because I explode the db fields input parameter passed to my PHP code by commas but don't want these split
// and of course my column names are not "field2", "field3", etc. replace these as needed
else relevance = "MATCH(id| field2| field3| field4| field5| field6| field7) AGAINST('" + inputstring+"') as relevance";
// get the datatable column and modify the sTitle
oSettings = oTable.fnSettings();
var col = -1;
for (i=0; i= 0) {
oSettings.aoColumns[col].sName = relevance;
}
if (inputstring.length < 3) return 0; // for lack of better idea, sort by index in col 0 if we have insufficient keywords. you might return -1 instead to avoid any AJAX call or sorting
return col;
}
[/code]
I am using the example server-side php as the base for my server code. You can find it at http://datatables.net/examples/data_sources/server_side.html
So now on the server side, I need to do a few things:
1. I used '|' instead of ',' in the MATCH .. AGAINST clause in set_relevance_column() above, so I'll need to remember to convert those to commas.
2. I can use the MATCH .. AGAINST code fine in SELECT and WHERE clauses, but not in the ORDER BY so make sure to modify the value before adding it to ORDER BY
3. When looping through the MySQL result rows, remember that we changed the column name in SELECT, so retrieve the value with the changed name.
Here I find the MATCH .. AGAINST column
[code]
//check if using MATCH query
for ($i = 0; $i < count($aColumns); $i++) {
if (strpos($aColumns[$i], "MATCH(id") !== false) {
$aColumns[$i] = str_replace('|', ',', $aColumns[$i]);
$fullText = true;
$fullText_order = " relevance desc "; // default to descending since higher scores are better
$fullText_where = str_replace(" as relevance", "", $aColumns[$i]); // for our WHERE clause, strip off the alias
$aColumns[$i] = "round(".$fullText_where.", 3) as relevance"; // since I'm displaying the score to user, round it to 3 decimal places
}
}
[/code]
Set up using the code in ORDER BY
[code]
/*
* Ordering
*/
$sOrder = ""; // default value
if ( isset( $_GET['iSortCol_0'] ) )
{
$sOrder = "ORDER BY ";
for ( $i=0 ; $i -1) {
$sOrder .= " relevance desc, ";
} else {
$sOrder .= $aColumns[ intval( $_GET['iSortCol_'.$i] ) ]."
".mysql_real_escape_string( $_GET['sSortDir_'.$i] ) .", ";
}
}
}
$sOrder = substr_replace( $sOrder, "", -2 );
if ( $sOrder == "ORDER BY" )
{
$sOrder = "";
}
}
[/code]
Some alteration to the WHERE clause as well. Most of the fields are using OR; the FullText search should be ANDed to them.
[code]
$sWhere = "";
if ( isset($_GET['sSearch']) && $_GET['sSearch'] != "" )
{
$sWhere = "WHERE (";
for ( $i=0 ; $i