What does _fnHtmlDecode do? / Bug

randomuser · September 2012

Hello there,

I have the following string: "Earth Wind & Fire".
Through TableTools Excel exporting it runs against the fnHtmlDecode function.
What comes out is "Earth Wind ".
That's because this function replaces anything followed by an ampersand if this ampersand is within the last 8 chars of a string (in reality, the last 8 chars of any 2048 char chunk).
The issue can be fixed if you don't split the string with _fnChunkData() but with sData.split('&');
Then take the first 8 chars of the splitted element, set it as innerHTML and read it with nodeValue. The rest of the string can be simply appended again. Simplified code:
[code]
var string=string.split('&');
while(i7?string[i].substr(7):'');
}
[/code]

allan · September 2012

Hi,

Thanks for finding and letting me know about that! Unfortunately, I don't thing we can just split on an &, the chunking is required to keep the data size down as nodeValue has a limited size - and it is possible that an HTML entity might not occur in the string until after the limit.

However, this is certainly a bug - the bug was that I wasn't doing anything with the `sInner` value of the HTML entity that might have been found. It is now appended to the next item in the array of chunked data (or inserts an item into the array if needed).

This fix is now in the 2.1.4 development version, which is available as the nightly on the downloads page: http://datatables.net/download/

Regards,
Allan

randomuser · September 2012

Hi Allan.

Great how fast you respond to such things!
Did you consider that .split('&') would keep all of the actual entities in the string entirely close to the beginning of the splitted sub-strings?
e.g. string="this is α very long string with special chars like & and [imagine amount of chars that exceeds nodeValue's limit here] blahblah & even more text"

would result in an array where:
string[0]="this is "
string[1]="alpha; very long string with special chars like "
string[2]="amp; and [OVER 9000 CHARS] blahblah "
string[3]="amp; even more text"

and with the code I posted above, once string[2] is being processed the resulting code would be:
[code]
n.innerHTML='&'+string[i].substr(0,7); // '& an'
sReturn+=n.childNodes[0].nodeValue+(string[i].length>7?string[i].substr(7):''); // '& and [OVER ... '
[/code]
That way no more than 8 chars are ever loaded into the Element and therefore nodeValue.

Or am I the one getting things wrong here?

Kind regards
Eric

allan · September 2012

> Great how fast you respond to such things!

Heh - don't like the idea of a data corruption bug there. The worrying thing is how long that bug has been there!

Ah! I see - thanks for clearing that up. I didn't get it the first time :-). Yes that does look like a nice idea. I wonder if it would be possible to do a regex split on the string to split on the entity, rather than using a magic 7? Something like split(/(&.*?;)/) and use the result to do the decoding. I'm not sure how to get the matching elements though...! Worth looking into, and if not then, then your solution looks good and much less code than mine :-)

Allan

randomuser · September 2012

Hey Allan,

I'm not too much of a fan of RegEx. Different syntaxes freak me out.
BUT: Out of curiosity i took the time to create a jsperf.com test case.
RegEx is 20x faster. See it here:
http://jsperf.com/html-numeric-entity-decode-regex-vs-split-nodevalue
Please note the test is slightly flawed, I couldn't find a regular expression that did exactly what we need (only one to replace numeric entities such as ') and can't be bothered writing it myself. And once you have it, it will probably run a lot slower than the example in this test case. But still faster than split+nodeValue...

allan · September 2012

Fantastic! I think I'll be implementing your short regex solution into TableTools this evening :-)

Allan

randomuser · September 2012

Did you notice that this won't replace regular HTML entities such as & ?
The example only covers numeric, decimal (unicode) entities.
http://en.wikipedia.org/wiki/Numeric_character_reference

Also, I realized that a completely simple RegEx solution won't be possible: RegEx itself doesn't know what to do with & -- we'll have to map that to the actual ampersand ourselves. Considering most of the performance problems with the split+nodeValue example comes from split(), we can try using nodeValue to complete what RegEx alone can't do.

LONG STORY SHORT:
You find the complete hybrid solution which is only 15-20% slower than the numeric RegEx replace alone at
http://jsperf.com/html-numeric-entity-decode-regex-vs-split-nodevalue

randomuser · September 2012

I added the current code (from the 2.1.4 nightly build, with sInner being appended to the next node) to the jsperf.com test suite:
http://jsperf.com/html-numeric-entity-decode-regex-vs-split-nodevalue/2
performs around 40% slower than the hybrid solution in FF, but only 7% slower in Chrome.

allan · September 2012

Absolutely fantastic! Thank you for looking into this in such detail. I've just implemented your hybrid solution into TableTools. Smaller code and faster - the perfect patch. Thank you :-)

Allan

What does _fnHtmlDecode do? / Bug

What does _fnHtmlDecode do? / Bug

Replies

Howdy, Stranger!

Categories

DataTables

What does _fnHtmlDecode do? / Bug

What does _fnHtmlDecode do? / Bug

Replies

Howdy, Stranger!

Quick Links

Categories

DataTables