Stage 3 Draft / June 20, 2017

Unicode property escapes in regular expressions

The syntax listed in 21.2.1 Patterns is modified as follows.

ControlLetter::one ofabcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ CharacterClassEscape[U]::d D s S w W [+U]p{UnicodePropertyValueExpression} [+U]P{UnicodePropertyValueExpression} UnicodePropertyValueExpression::UnicodePropertyName=UnicodePropertyValue LoneUnicodePropertyNameOrValue UnicodePropertyNameCharacter::ControlLetter _ UnicodePropertyNameCharacters::UnicodePropertyNameCharacterUnicodePropertyNameCharactersopt UnicodePropertyName::UnicodePropertyNameCharacters UnicodePropertyValueCharacter::UnicodePropertyNameCharacter 0 1 2 3 4 5 6 7 8 9 UnicodePropertyValueCharacters::UnicodePropertyValueCharacterUnicodePropertyValueCharactersopt UnicodePropertyValue::UnicodePropertyValueCharacters LoneUnicodePropertyNameOrValue::UnicodePropertyValueCharacters

The following items are appended to 21.2.1.1 Static Semantics: Early Errors.

UnicodePropertyName::UnicodePropertyNameCharacters UnicodePropertyValue::UnicodePropertyValueCharacters UnicodePropertyValueExpression::UnicodePropertyName=UnicodePropertyValue UnicodePropertyValueExpression::LoneUnicodePropertyNameOrValue

The following two abstract operations are appended to 21.2.2.8 Atom.

1Static Semantics: UnicodeMatchProperty ( p )

The abstract operation UnicodeMatchProperty takes a string parameter p and performs the following steps:

  1. Assert: p strictly matches a known Unicode property name or property alias.
  2. Let property be that unaliased property name of p.
  3. Return property.

Implementations must support the following non-binary Unicode properties and their property aliases:

Additionally, implementations must support the following binary Unicode properties and their property aliases:

Note 1

The listed properties form a superset of what UTS18 RL1.2 requires.

To ensure interoperability, implementations must not extend Unicode property support to the remaining properties.

Implementations must only recognize the canonical property aliases listed in PropertyAliases.txt.

Note 2

For example, Script_Extensions (property name) and scx (property alias) are valid, but script_extensions or Scx aren’t.

2Static Semantics: UnicodeMatchPropertyValue ( p, v )

The abstract operation UnicodeMatchPropertyValue takes two string parameters p and v and performs the following steps:

  1. Assert: p is a canonical, unaliased Unicode property name.
  2. Assert: v strictly matches a known property value or property value alias for Unicode property p.
  3. Let value be that unaliased property value of v.
  4. Return value.

Implementations must support any existing property values and their aliases for the following Unicode properties as required by UTS18 RL1.2: General_Category, Script, and Script_Extensions.

Only the canonical property values and property value aliases listed in PropertyValueAliases.txt must be recognized.

Note 1

For example, Xpeo and Old_Persian are valid Script_Extension values, but xpeo and Old Persian aren’t.

Note 2

This algorithm differs from the matching rules for symbolic values listed in UAX44: case, white space, U+002D (HYPHEN-MINUS), and U+005F (LOW LINE) are not ignored, and the Is prefix is not supported.


The following is appended to the list of productions in 21.2.2.12 CharacterClassEscape.

The production CharacterClassEscape::\p{UnicodePropertyValueExpression} evaluates by returning the CharSet containing all Unicode code points included in the CharSet returned by UnicodePropertyValueExpression.

The production CharacterClassEscape::\P{UnicodePropertyValueExpression} evaluates by returning the CharSet containing all Unicode code points not included in the CharSet returned by UnicodePropertyValueExpression.

The production UnicodePropertyValueExpression::UnicodePropertyName=UnicodePropertyValue evaluates as follows:

  1. Let p be ! UnicodeMatchProperty(UnicodePropertyName).
  2. Assert: p is not the name of a binary property in Unicode.
  3. Let v be ! UnicodeMatchPropertyValue(p, UnicodePropertyValue).
  4. Return the CharSet containing all Unicode code points whose character database definition includes the property p with value v.

The production UnicodePropertyValueExpression::LoneUnicodePropertyNameOrValue evaluates as follows:

  1. If ! UnicodeMatchPropertyValue("General_Category", LoneUnicodePropertyNameOrValue) is the name of a general category in Unicode, then
    1. Return the CharSet containing all Unicode code points whose character database definition includes the property General_Category with value LoneUnicodePropertyNameOrValue.
  2. Let property be ! UnicodeMatchProperty(LoneUnicodePropertyNameOrValue).
  3. Assert: property is the name of a binary property in Unicode.
  4. Return the CharSet containing all Unicode code points whose character database definition includes the property p with value True.

The following is appended to the bibliography.

ABibliography

  1. Unicode Standard Annex #18: Unicode Regular Expressions, available at <http://unicode.org/reports/tr18/>
  2. Unicode Standard Annex #24: Unicode Script Property, available at <http://unicode.org/reports/tr24/>
  3. Unicode Standard Annex #44: Unicode Character Database, available at <http://unicode.org/reports/tr44/>
  4. Unicode Technical Report #51: Unicode Emoji, available at <http://unicode.org/reports/tr51/>