Python beat PCRE to Unicode support by several years. (So there at least was an ...

BugBrother · on June 4, 2014

I thought the Unicode support was still spotty (< 3.X)? Or you mean the support is better than in PCRE?

acdha · on June 4, 2014

It depends on what you mean by Unicode support - this turns out to be a surprisingly painful area if you need something like case folding:

strasse = straße

or treating combining characters the same as their single character equivalents:

ñ = ñ

(That's LATIN SMALL LETTER N WITH TILDE and LATIN SMALL LETTER N followed by COMBINING TILDE)

A surprising number of languages (mostly everything but Perl) won't handle advanced uses like this.

The good news is that the next version of the stdlib regex module is being developed independently:

https://pypi.python.org/pypi/regex

Simply "pip install regex" and:

    >>> regex.match(r"(?iV1)strasse", "stra\N{LATIN SMALL LETTER SHARP S}e").span()
    (0, 6)
    >>> regex.match(r"(?iV1)stra\N{LATIN SMALL LETTER SHARP S}e", "STRASSE").span()
    (0, 7)

maxerickson · on June 4, 2014

In 2000, PCRE simply didn't support Unicode. Python 1.6 and 2.0 did (at least, based on some quick searching PCRE added support for Unicode in 2004).

"spotty" probably isn't the right word either, the change in 3.0 was to default to treating text as always being Unicode, the 'unicode' type in 2.x is reasonably complete (as these things go), just not the default treatment for text.

BugBrother · on June 4, 2014

As late as 2004? I didn't know that.

'acdha' (in the brother comment) wrote what I meant with "spotty" better and more pedagogical than I ever could. :-)