#610 — Web harness breaks test cases with supplementary characters

bug_id: 610
creation_ts: 2012-08-08 17:52:00 -0700
short_desc: Web harness breaks test cases with supplementary characters
delta_ts: 2012-10-09 22:24:13 -0700
product: Test262
component: Test Harness
version: unspecified
rep_platform: All
op_sys: All
bug_status: RESOLVED
resolution: FIXED
priority: Normal
bug_severity: major
everconfirmed: true
reporter: Norbert
assigned_to: Bill Ticehurst
cc: gphemsley

commentid: 1412
comment_count: 0
who: Norbert
bug_when: 2012-08-08 17:52:54 -0700

The following test case contains a Unicode supplementary character as the first character of the first string:

var strings = ["

commentid: 1417
comment_count: 1
who: Bill Ticehurst
bug_when: 2012-08-09 12:52:36 -0700

It looks like those characters also break bugs.ecmascript.org too, as this bug appears to be truncated :-/

Can you articulate the issue without characters this tool can't handle (or just email me directly). Thanks.

commentid: 1418
comment_count: 2
who: Gordon P. Hemsley
bug_when: 2012-08-09 13:28:38 -0700

The characters in the Unicode Supplementary Plane are those at codepoints of U+010000 or higher.

The issue with Bugzilla itself is like due to what is discussed here:
http://mathiasbynens.be/notes/mysql-utf8mb4

commentid: 1419
comment_count: 3
who: Bill Ticehurst
bug_when: 2012-08-09 13:30:07 -0700

Actually I've reproduced the issue. If I put a character outside the BMP (Unicode CodePoint > 0xFFFF) inside a test-case, running the test fails with "Uncaught SyntaxError: Unexpected token ILLEGAL". It looks like we've not encoding surrogate pairs properly in the JSON files. I'll dig into it.

commentid: 1420
comment_count: 4
who: Bill Ticehurst
bug_when: 2012-08-09 18:06:19 -0700

Looks like a bug in jquery.base64.js not being able to handle UTF-8 encodings of greater than 3 bytes (required for U+FFFF). I've made a fix and sent the patch to Norbert to verify.

commentid: 1422
comment_count: 5
attachid: 24
who: Norbert
bug_when: 2012-08-10 11:32:51 -0700

Created attachment 24
test case including supplementary character

commentid: 1423
comment_count: 6
who: Norbert
bug_when: 2012-08-10 11:39:57 -0700

I've added the test file for posterity.

The data loss in bugzilla is apparently caused by this bug:
https://bugzilla.mozilla.org/show_bug.cgi?id=405011

Bill's fix fixes the immediate problem, but isn't fully compliant with the UTF-8 spec (Unicode 6.1, chapter 3.9, pages 94-97): Not all byte value sequences are allowed in UTF-8: Code points end at U+10FFFF so sequences representing higher values are illegal, and "overlong" sequences representing code points that can be represented using shorter sequences are also illegal. Accepting illegal sequences can be a security issue, so a parser should throw exceptions for them, or at least replace them with U+FFFD.

The one thing we definitely need is a limitation of code points to U+10FFFF.

commentid: 1920
comment_count: 7
who: Norbert
bug_when: 2012-10-09 22:24:13 -0700

Committed Bill's fix with additional check for valid code points.

archives

#610 — Web harness breaks test cases with supplementary characters