Stage 3 Draft / November 20, 2017

Unicode property escapes in regular expressions

The syntax listed in 21.2.1 Patterns is modified as follows.

ControlLetter::one ofabcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ CharacterClassEscape[U]::d D s S w W [+U]p{UnicodePropertyValueExpression} [+U]P{UnicodePropertyValueExpression} UnicodePropertyValueExpression::UnicodePropertyName=UnicodePropertyValue LoneUnicodePropertyNameOrValue UnicodePropertyNameCharacter::ControlLetter _ UnicodePropertyNameCharacters::UnicodePropertyNameCharacterUnicodePropertyNameCharactersopt UnicodePropertyName::UnicodePropertyNameCharacters UnicodePropertyValueCharacter::UnicodePropertyNameCharacter 0 1 2 3 4 5 6 7 8 9 UnicodePropertyValueCharacters::UnicodePropertyValueCharacterUnicodePropertyValueCharactersopt UnicodePropertyValue::UnicodePropertyValueCharacters LoneUnicodePropertyNameOrValue::UnicodePropertyValueCharacters

The following items are appended to 21.2.1.1 Static Semantics: Early Errors.

UnicodePropertyName::UnicodePropertyNameCharacters UnicodePropertyValue::UnicodePropertyValueCharacters UnicodePropertyValueExpression::UnicodePropertyName=UnicodePropertyValue UnicodePropertyValueExpression::LoneUnicodePropertyNameOrValue

The following two abstract operations are appended to 21.2.2.8 Atom.

1Static Semantics: UnicodeMatchProperty ( p )

The algorithm uses values from the following tables, which associate supported Unicode property names and property aliases and their canonical property names.

Implementations must support the following non-binary Unicode properties and their property aliases:

Table 1: Non-binary Unicode property aliases and their canonical property names
Property name and aliases Canonical property name
  • General_Category
  • gc
General_Category
  • Script
  • sc
Script
  • Script_Extensions
  • scx
Script_Extensions

Additionally, implementations must support the following binary Unicode properties and their property aliases:

Table 2: Binary Unicode property aliases and their canonical property names
Property name and aliases Canonical property name
ASCII ASCII
  • ASCII_Hex_Digit
  • AHex
ASCII_Hex_Digit
  • Alphabetic
  • Alpha
Alphabetic
Any Any
Assigned Assigned
  • Bidi_Control
  • Bidi_C
Bidi_Control
  • Bidi_Mirrored
  • Bidi_M
Bidi_Mirrored
  • Case_Ignorable
  • CI
Case_Ignorable
Cased Cased
  • Changes_When_Casefolded
  • CWCF
Changes_When_Casefolded
  • Changes_When_Casemapped
  • CWCM
Changes_When_Casemapped
  • Changes_When_Lowercased
  • CWL
Changes_When_Lowercased
  • Changes_When_NFKC_Casefolded
  • CWKCF
Changes_When_NFKC_Casefolded
  • Changes_When_Titlecased
  • CWT
Changes_When_Titlecased
  • Changes_When_Uppercased
  • CWU
Changes_When_Uppercased
Dash Dash
  • Default_Ignorable_Code_Point
  • DI
Default_Ignorable_Code_Point
  • Deprecated
  • Dep
Deprecated
  • Diacritic
  • Dia
Diacritic
Emoji Emoji
Emoji_Component Emoji_Component
Emoji_Modifier Emoji_Modifier
Emoji_Modifier_Base Emoji_Modifier_Base
Emoji_Presentation Emoji_Presentation
  • Extender
  • Ext
Extender
  • Grapheme_Base
  • Gr_Base
Grapheme_Base
  • Grapheme_Extend
  • Gr_Ext
Grapheme_Extend
  • Hex_Digit
  • Hex
Hex_Digit
  • IDS_Binary_Operator
  • IDSB
IDS_Binary_Operator
  • IDS_Trinary_Operator
  • IDST
IDS_Trinary_Operator
  • ID_Continue
  • IDC
ID_Continue
  • ID_Start
  • IDS
ID_Start
  • Ideographic
  • Ideo
Ideographic
  • Join_Control
  • Join_C
Join_Control
  • Logical_Order_Exception
  • LOE
Logical_Order_Exception
  • Lowercase
  • Lower
Lowercase
Math Math
  • Noncharacter_Code_Point
  • NChar
Noncharacter_Code_Point
  • Pattern_Syntax
  • Pat_Syn
Pattern_Syntax
  • Pattern_White_Space
  • Pat_WS
Pattern_White_Space
  • Quotation_Mark
  • QMark
Quotation_Mark
Radical Radical
  • Regional_Indicator
  • RI
Regional_Indicator
  • Sentence_Terminal
  • STerm
Sentence_Terminal
  • Soft_Dotted
  • SD
Soft_Dotted
  • Terminal_Punctuation
  • Term
Terminal_Punctuation
  • Unified_Ideograph
  • UIdeo
Unified_Ideograph
  • Uppercase
  • Upper
Uppercase
  • Variation_Selector
  • VS
Variation_Selector
  • White_Space
  • space
White_Space
  • XID_Continue
  • XIDC
XID_Continue
  • XID_Start
  • XIDS
XID_Start

The abstract operation UnicodeMatchProperty takes a string parameter p and performs the following steps:

  1. Assert: p is a known Unicode property name or property alias as listed in the “Property name and aliases” column of Table 1 or Table 2.
  2. Let property be the canonical property name of p as given in the “Canonical property name” column of the corresponding row.
  3. Return property.

To ensure interoperability, implementations must not extend Unicode property support to the remaining properties.

Implementations must only recognize the property aliases listed in Table 1 and Table 2.

Implementations must only recognize the property value aliases and canonical property value names listed in Table 3 and Table 4.

Note 1

For example, Script_Extensions (property name) and scx (property alias) are valid, but script_extensions or Scx aren’t.

Note 2

The listed properties form a superset of what UTS18 RL1.2 requires.

2Static Semantics: UnicodeMatchPropertyValue ( p, v )

The algorithm uses values from the following tables, which associate canonical Unicode property names and their supported values and value aliases:

Table 3: Value aliases and canonical values for the Unicode property General_Category
Property value and aliases Canonical property value
  • Cased_Letter
  • LC
Cased_Letter
  • Close_Punctuation
  • Pe
Close_Punctuation
  • Connector_Punctuation
  • Pc
Connector_Punctuation
  • Control
  • Cc
  • cntrl
Control
  • Currency_Symbol
  • Sc
Currency_Symbol
  • Dash_Punctuation
  • Pd
Dash_Punctuation
  • Decimal_Number
  • Nd
  • digit
Decimal_Number
  • Enclosing_Mark
  • Me
Enclosing_Mark
  • Final_Punctuation
  • Pf
Final_Punctuation
  • Format
  • Cf
Format
  • Initial_Punctuation
  • Pi
Initial_Punctuation
  • Letter
  • L
Letter
  • Letter_Number
  • Nl
Letter_Number
  • Line_Separator
  • Zl
Line_Separator
  • Lowercase_Letter
  • Ll
Lowercase_Letter
  • Mark
  • M
  • Combining_Mark
Mark
  • Math_Symbol
  • Sm
Math_Symbol
  • Modifier_Letter
  • Lm
Modifier_Letter
  • Modifier_Symbol
  • Sk
Modifier_Symbol
  • Nonspacing_Mark
  • Mn
Nonspacing_Mark
  • Number
  • N
Number
  • Open_Punctuation
  • Ps
Open_Punctuation
  • Other
  • C
Other
  • Other_Letter
  • Lo
Other_Letter
  • Other_Number
  • No
Other_Number
  • Other_Punctuation
  • Po
Other_Punctuation
  • Other_Symbol
  • So
Other_Symbol
  • Paragraph_Separator
  • Zp
Paragraph_Separator
  • Private_Use
  • Co
Private_Use
  • Punctuation
  • P
  • punct
Punctuation
  • Separator
  • Z
Separator
  • Space_Separator
  • Zs
Space_Separator
  • Spacing_Mark
  • Mc
Spacing_Mark
  • Surrogate
  • Cs
Surrogate
  • Symbol
  • S
Symbol
  • Titlecase_Letter
  • Lt
Titlecase_Letter
  • Unassigned
  • Cn
Unassigned
  • Uppercase_Letter
  • Lu
Uppercase_Letter
Table 4: Value aliases and canonical values for the Unicode properties Script and Script_Extensions
Property value and aliases Canonical property value
  • Adlam
  • Adlm
Adlam
  • Ahom
  • Ahom
Ahom
  • Anatolian_Hieroglyphs
  • Hluw
Anatolian_Hieroglyphs
  • Arabic
  • Arab
Arabic
  • Armenian
  • Armn
Armenian
  • Avestan
  • Avst
Avestan
  • Balinese
  • Bali
Balinese
  • Bamum
  • Bamu
Bamum
  • Bassa_Vah
  • Bass
Bassa_Vah
  • Batak
  • Batk
Batak
  • Bengali
  • Beng
Bengali
  • Bhaiksuki
  • Bhks
Bhaiksuki
  • Bopomofo
  • Bopo
Bopomofo
  • Brahmi
  • Brah
Brahmi
  • Braille
  • Brai
Braille
  • Buginese
  • Bugi
Buginese
  • Buhid
  • Buhd
Buhid
  • Canadian_Aboriginal
  • Cans
Canadian_Aboriginal
  • Carian
  • Cari
Carian
  • Caucasian_Albanian
  • Aghb
Caucasian_Albanian
  • Chakma
  • Cakm
Chakma
  • Cham
  • Cham
Cham
  • Cherokee
  • Cher
Cherokee
  • Common
  • Zyyy
Common
  • Coptic
  • Copt
  • Qaac
Coptic
  • Cuneiform
  • Xsux
Cuneiform
  • Cypriot
  • Cprt
Cypriot
  • Cyrillic
  • Cyrl
Cyrillic
  • Deseret
  • Dsrt
Deseret
  • Devanagari
  • Deva
Devanagari
  • Duployan
  • Dupl
Duployan
  • Egyptian_Hieroglyphs
  • Egyp
Egyptian_Hieroglyphs
  • Elbasan
  • Elba
Elbasan
  • Ethiopic
  • Ethi
Ethiopic
  • Georgian
  • Geor
Georgian
  • Glagolitic
  • Glag
Glagolitic
  • Gothic
  • Goth
Gothic
  • Grantha
  • Gran
Grantha
  • Greek
  • Grek
Greek
  • Gujarati
  • Gujr
Gujarati
  • Gurmukhi
  • Guru
Gurmukhi
  • Han
  • Hani
Han
  • Hangul
  • Hang
Hangul
  • Hanunoo
  • Hano
Hanunoo
  • Hatran
  • Hatr
Hatran
  • Hebrew
  • Hebr
Hebrew
  • Hiragana
  • Hira
Hiragana
  • Imperial_Aramaic
  • Armi
Imperial_Aramaic
  • Inherited
  • Zinh
  • Qaai
Inherited
  • Inscriptional_Pahlavi
  • Phli
Inscriptional_Pahlavi
  • Inscriptional_Parthian
  • Prti
Inscriptional_Parthian
  • Javanese
  • Java
Javanese
  • Kaithi
  • Kthi
Kaithi
  • Kannada
  • Knda
Kannada
  • Katakana
  • Kana
Katakana
  • Kayah_Li
  • Kali
Kayah_Li
  • Kharoshthi
  • Khar
Kharoshthi
  • Khmer
  • Khmr
Khmer
  • Khojki
  • Khoj
Khojki
  • Khudawadi
  • Sind
Khudawadi
  • Lao
  • Laoo
Lao
  • Latin
  • Latn
Latin
  • Lepcha
  • Lepc
Lepcha
  • Limbu
  • Limb
Limbu
  • Linear_A
  • Lina
Linear_A
  • Linear_B
  • Linb
Linear_B
  • Lisu
  • Lisu
Lisu
  • Lycian
  • Lyci
Lycian
  • Lydian
  • Lydi
Lydian
  • Mahajani
  • Mahj
Mahajani
  • Malayalam
  • Mlym
Malayalam
  • Mandaic
  • Mand
Mandaic
  • Manichaean
  • Mani
Manichaean
  • Marchen
  • Marc
Marchen
  • Masaram_Gondi
  • Gonm
Masaram_Gondi
  • Meetei_Mayek
  • Mtei
Meetei_Mayek
  • Mende_Kikakui
  • Mend
Mende_Kikakui
  • Meroitic_Cursive
  • Merc
Meroitic_Cursive
  • Meroitic_Hieroglyphs
  • Mero
Meroitic_Hieroglyphs
  • Miao
  • Plrd
Miao
  • Modi
  • Modi
Modi
  • Mongolian
  • Mong
Mongolian
  • Mro
  • Mroo
Mro
  • Multani
  • Mult
Multani
  • Myanmar
  • Mymr
Myanmar
  • Nabataean
  • Nbat
Nabataean
  • New_Tai_Lue
  • Talu
New_Tai_Lue
  • Newa
  • Newa
Newa
  • Nko
  • Nkoo
Nko
  • Nushu
  • Nshu
Nushu
  • Ogham
  • Ogam
Ogham
  • Ol_Chiki
  • Olck
Ol_Chiki
  • Old_Hungarian
  • Hung
Old_Hungarian
  • Old_Italic
  • Ital
Old_Italic
  • Old_North_Arabian
  • Narb
Old_North_Arabian
  • Old_Permic
  • Perm
Old_Permic
  • Old_Persian
  • Xpeo
Old_Persian
  • Old_South_Arabian
  • Sarb
Old_South_Arabian
  • Old_Turkic
  • Orkh
Old_Turkic
  • Oriya
  • Orya
Oriya
  • Osage
  • Osge
Osage
  • Osmanya
  • Osma
Osmanya
  • Pahawh_Hmong
  • Hmng
Pahawh_Hmong
  • Palmyrene
  • Palm
Palmyrene
  • Pau_Cin_Hau
  • Pauc
Pau_Cin_Hau
  • Phags_Pa
  • Phag
Phags_Pa
  • Phoenician
  • Phnx
Phoenician
  • Psalter_Pahlavi
  • Phlp
Psalter_Pahlavi
  • Rejang
  • Rjng
Rejang
  • Runic
  • Runr
Runic
  • Samaritan
  • Samr
Samaritan
  • Saurashtra
  • Saur
Saurashtra
  • Sharada
  • Shrd
Sharada
  • Shavian
  • Shaw
Shavian
  • Siddham
  • Sidd
Siddham
  • SignWriting
  • Sgnw
SignWriting
  • Sinhala
  • Sinh
Sinhala
  • Sora_Sompeng
  • Sora
Sora_Sompeng
  • Soyombo
  • Soyo
Soyombo
  • Sundanese
  • Sund
Sundanese
  • Syloti_Nagri
  • Sylo
Syloti_Nagri
  • Syriac
  • Syrc
Syriac
  • Tagalog
  • Tglg
Tagalog
  • Tagbanwa
  • Tagb
Tagbanwa
  • Tai_Le
  • Tale
Tai_Le
  • Tai_Tham
  • Lana
Tai_Tham
  • Tai_Viet
  • Tavt
Tai_Viet
  • Takri
  • Takr
Takri
  • Tamil
  • Taml
Tamil
  • Tangut
  • Tang
Tangut
  • Telugu
  • Telu
Telugu
  • Thaana
  • Thaa
Thaana
  • Thai
  • Thai
Thai
  • Tibetan
  • Tibt
Tibetan
  • Tifinagh
  • Tfng
Tifinagh
  • Tirhuta
  • Tirh
Tirhuta
  • Ugaritic
  • Ugar
Ugaritic
  • Vai
  • Vaii
Vai
  • Warang_Citi
  • Wara
Warang_Citi
  • Yi
  • Yiii
Yi
  • Zanabazar_Square
  • Zanb
Zanabazar_Square

The abstract operation UnicodeMatchPropertyValue takes two string parameters p and v and performs the following steps:

  1. Assert: p is a canonical, unaliased Unicode property name as listed in the “Canonical property name” column of Table 1.
  2. Assert: v is a known property value or property value alias for Unicode property p as listed in the “Property value and aliases” column of Table 3 or Table 4.
  3. Let value be the canonical property value of v as given in the “Canonical property value” column of the corresponding row.
  4. Return value.

Only the canonical property values and property value aliases listed in Table 3 and Table 4 must be recognized.

Note 1

For example, Xpeo and Old_Persian are valid Script_Extension values, but xpeo and Old Persian aren’t.

Note 2

This algorithm differs from the matching rules for symbolic values listed in UAX44: case, white space, U+002D (HYPHEN-MINUS), and U+005F (LOW LINE) are not ignored, and the Is prefix is not supported.


The following is appended to the list of productions in 21.2.2.12 CharacterClassEscape.

The production CharacterClassEscape::\p{UnicodePropertyValueExpression} evaluates by returning the CharSet containing all Unicode code points included in the CharSet returned by UnicodePropertyValueExpression.

The production CharacterClassEscape::\P{UnicodePropertyValueExpression} evaluates by returning the CharSet containing all Unicode code points not included in the CharSet returned by UnicodePropertyValueExpression.

The production UnicodePropertyValueExpression::UnicodePropertyName=UnicodePropertyValue evaluates as follows:

  1. Let p be ! UnicodeMatchProperty(UnicodePropertyName).
  2. Assert: p is not a Unicode binary property or binary property alias as listed in the “Property name and aliases” column of Table 2.
  3. Let v be ! UnicodeMatchPropertyValue(p, UnicodePropertyValue).
  4. Return the CharSet containing all Unicode code points whose character database definition includes the property p with value v.

The production UnicodePropertyValueExpression::LoneUnicodePropertyNameOrValue evaluates as follows:

  1. If ! UnicodeMatchPropertyValue("General_Category", LoneUnicodePropertyNameOrValue) is the name of a known Unicode general category or general category alias as listed in the “Property value and aliases” column of Table 3, then
    1. Return the CharSet containing all Unicode code points whose character database definition includes the property General_Category with value LoneUnicodePropertyNameOrValue.
  2. Let property be ! UnicodeMatchProperty(LoneUnicodePropertyNameOrValue).
  3. Assert: property is a known binary Unicode property or binary property alias as listed in the “Property name and aliases” column of Table 2.
  4. Return the CharSet containing all Unicode code points whose character database definition includes the property p with value True.

The following is appended to the bibliography.

ABibliography

  1. Unicode Standard Annex #18: Unicode Regular Expressions, available at <https://unicode.org/reports/tr18/>
  2. Unicode Standard Annex #24: Unicode Script Property, available at <https://unicode.org/reports/tr24/>
  3. Unicode Standard Annex #44: Unicode Character Database, available at <https://unicode.org/reports/tr44/>
  4. Unicode Technical Report #51: Unicode Emoji, available at <https://unicode.org/reports/tr51/>