Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Python beat PCRE to Unicode support by several years.

(So there at least was an advantage)



I thought the Unicode support was still spotty (< 3.X)? Or you mean the support is better than in PCRE?


It depends on what you mean by Unicode support - this turns out to be a surprisingly painful area if you need something like case folding:

strasse = straße

or treating combining characters the same as their single character equivalents:

ñ = ñ

(That's LATIN SMALL LETTER N WITH TILDE and LATIN SMALL LETTER N followed by COMBINING TILDE)

A surprising number of languages (mostly everything but Perl) won't handle advanced uses like this.

The good news is that the next version of the stdlib regex module is being developed independently:

https://pypi.python.org/pypi/regex

Simply "pip install regex" and:

    >>> regex.match(r"(?iV1)strasse", "stra\N{LATIN SMALL LETTER SHARP S}e").span()
    (0, 6)
    >>> regex.match(r"(?iV1)stra\N{LATIN SMALL LETTER SHARP S}e", "STRASSE").span()
    (0, 7)


In 2000, PCRE simply didn't support Unicode. Python 1.6 and 2.0 did (at least, based on some quick searching PCRE added support for Unicode in 2004).

"spotty" probably isn't the right word either, the change in 3.0 was to default to treating text as always being Unicode, the 'unicode' type in 2.x is reasonably complete (as these things go), just not the default treatment for text.


As late as 2004? I didn't know that.

'acdha' (in the brother comment) wrote what I meant with "spotty" better and more pedagogical than I ever could. :-)




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: