Stage 2 Draft / July 5, 2021

Support for properties of strings in Unicode property escapes in regular expressions

Note: This proposal has been subsumed by the RegExp set notation + properties of strings proposal.

The syntax listed in 21.2.1 Patterns is modified as follows.

LeadSurrogate::Hex4Digitsbut only if the SV of Hex4Digits is in the inclusive range 0xD800 to 0xDBFF TrailSurrogate::Hex4Digitsbut only if the SV of Hex4Digits is in the inclusive range 0xDC00 to 0xDFFF NonSurrogate::Hex4Digitsbut only if the SV of Hex4Digits is not in the inclusive range 0xD800 to 0xDFFF IdentityEscape[U]::[+U]SyntaxCharacter [+U]/ [~U]SourceCharacterbut not UnicodeIDContinue DecimalEscape::NonZeroDigitDecimalDigitsopt[lookahead ∉ DecimalDigit] CharacterClassEscape[U, InCharacterClass]::d D s S w W [+InCharacterClass, +U]p{UnicodePropertyValueExpression} [~InCharacterClass, +U]p{UnicodePropertyValueOrSequenceExpression} [+U]P{UnicodePropertyValueExpression} UnicodePropertyValueExpression::UnicodePropertyName=UnicodePropertyValue LoneUnicodePropertyNameOrValue UnicodePropertyValueOrSequenceExpression::UnicodePropertyName=UnicodePropertyValue LoneUnicodePropertyNameOrValue UnicodePropertyName::UnicodePropertyNameCharacters UnicodePropertyNameCharacters::UnicodePropertyNameCharacterUnicodePropertyNameCharactersopt UnicodePropertyValue::UnicodePropertyValueCharacters LoneUnicodePropertyNameOrValue::UnicodePropertyValueCharacters UnicodePropertyValueCharacters::UnicodePropertyValueCharacterUnicodePropertyValueCharactersopt UnicodePropertyValueCharacter::UnicodePropertyNameCharacter 0 1 2 3 4 5 6 7 8 9 UnicodePropertyNameCharacter::ControlLetter _ CharacterClass[U]::[[lookahead ∉ { ^ }]ClassRanges[?U]] [^ClassRanges[?U]] ClassRanges[U]::[empty] NonemptyClassRanges[?U] NonemptyClassRanges[U]::ClassAtom[?U] ClassAtom[?U]NonemptyClassRangesNoDash[?U] ClassAtom[?U]-ClassAtom[?U]ClassRanges[?U] NonemptyClassRangesNoDash[U]::ClassAtom[?U] ClassAtomNoDash[?U]NonemptyClassRangesNoDash[?U] ClassAtomNoDash[?U]-ClassAtom[?U]ClassRanges[?U] ClassAtom[U]::- ClassAtomNoDash[?U] ClassAtomNoDash[U]::SourceCharacterbut not one of \ or ] or - \ClassEscape[?U] ClassEscape[U]::b [+U]- CharacterClassEscape[?U, InCharacterClass] CharacterEscape[?U]

21.2.1.1 Static Semantics: Early Errors is changed as follows:

UnicodePropertyValueExpression::UnicodePropertyName=UnicodePropertyValue UnicodePropertyValueOrSequenceExpression::UnicodePropertyName=UnicodePropertyValue UnicodePropertyValueOrSequenceExpression::LoneUnicodePropertyNameOrValue

21.2.2.8.3 Runtime Semantics: UnicodeMatchProperty is changed as follows:

Implementations must support the Unicode property names and aliases listed in Table 51 and, Table 52, and Table 1. To ensure interoperability, implementations must not support any other property names or aliases.

Table 1: Unicode sequence property aliases and their canonical property names
Property name and aliases Canonical property name
Basic_Emoji Basic_Emoji
RGI_Emoji_Modifier_Sequence RGI_Emoji_Modifier_Sequence
RGI_Emoji_Tag_Sequence RGI_Emoji_Tag_Sequence
RGI_Emoji_ZWJ_Sequence RGI_Emoji_ZWJ_Sequence
RGI_Emoji RGI_Emoji

21.2.2.1 Notation is changed as follows:

Furthermore, the descriptions below use the following internal data structures:


21.2.2.12 CharacterClassEscape is changed as follows:

The production CharacterClassEscape::p{UnicodePropertyValueOrSequenceExpression} evaluates as follows:

  1. Let v be the return value of UnicodePropertyValueOrSequenceExpression.
  2. If v is a CharSet, then
    1. Return the CharSet containing all Unicode code points included in v.
  3. Assert: v is a SequenceSet.
  4. Return the Disjunction containing an Alternative for each of the Unicode code point sequences in v.

The production CharacterClassEscape::P{UnicodePropertyValueExpression} evaluates by returning the CharSet containing all Unicode code points not included in the CharSet returned by UnicodePropertyValueExpression.

The productions UnicodePropertyValueExpression::UnicodePropertyName=UnicodePropertyValue and UnicodePropertyValueOrSequenceExpression::UnicodePropertyName=UnicodePropertyValue evaluate as follows:

  1. Let ps be SourceText of UnicodePropertyName.
  2. Let p be ! UnicodeMatchProperty(ps).
  3. Assert: p is a Unicode property name or property alias listed in the “Property name and aliases” column of Table 57.
  4. Let vs be SourceText of UnicodePropertyValue.
  5. Let v be ! UnicodeMatchPropertyValue(p, vs).
  6. Return the CharSet containing all Unicode code points whose character database definition includes the property p with value v.

The production UnicodePropertyValueOrSequenceExpression::LoneUnicodePropertyNameOrValue evaluates as follows:

  1. Let s be SourceText of LoneUnicodePropertyNameOrValue.
  2. If ! UnicodeMatchPropertyValue("General_Category", s) is identical to a List of Unicode code points that is the name of a Unicode general category or general category alias listed in the “Property value and aliases” column of Table 53, then
    1. Return the CharSet containing all Unicode code points whose character database definition includes the property “General_Category” with value s.
  3. If s is identical to a List of Unicode code points that is the name of a Unicode sequence property or sequence property alias listed in the “Property value and aliases” column of Table 1, then
    1. Return the SequenceSet containing each Unicode code point sequence included in the Unicode property s.
  4. Let p be ! UnicodeMatchProperty(s).
  5. Assert: p is a binary Unicode property or binary property alias listed in the “Property name and aliases” column of Table 52.
  6. Return the CharSet containing all Unicode code points whose character database definition includes the property p with value “True”.