archives

« Bugzilla Issues Index

#610 — Web harness breaks test cases with supplementary characters


The following test case contains a Unicode supplementary character as the first character of the first string:

var strings = ["


It looks like those characters also break bugs.ecmascript.org too, as this bug appears to be truncated :-/

Can you articulate the issue without characters this tool can't handle (or just email me directly). Thanks.


The characters in the Unicode Supplementary Plane are those at codepoints of U+010000 or higher.

The issue with Bugzilla itself is like due to what is discussed here:
http://mathiasbynens.be/notes/mysql-utf8mb4


Actually I've reproduced the issue. If I put a character outside the BMP (Unicode CodePoint > 0xFFFF) inside a test-case, running the test fails with "Uncaught SyntaxError: Unexpected token ILLEGAL". It looks like we've not encoding surrogate pairs properly in the JSON files. I'll dig into it.


Looks like a bug in jquery.base64.js not being able to handle UTF-8 encodings of greater than 3 bytes (required for U+FFFF). I've made a fix and sent the patch to Norbert to verify.


Created attachment 24
test case including supplementary character


I've added the test file for posterity.

The data loss in bugzilla is apparently caused by this bug:
https://bugzilla.mozilla.org/show_bug.cgi?id=405011

Bill's fix fixes the immediate problem, but isn't fully compliant with the UTF-8 spec (Unicode 6.1, chapter 3.9, pages 94-97): Not all byte value sequences are allowed in UTF-8: Code points end at U+10FFFF so sequences representing higher values are illegal, and "overlong" sequences representing code points that can be represented using shorter sequences are also illegal. Accepting illegal sequences can be a security issue, so a parser should throw exceptions for them, or at least replace them with U+FFFD.

The one thing we definitely need is a limitation of code points to U+10FFFF.


Committed Bill's fix with additional check for valid code points.