I have been experimenting with tokenization in Rust in https://github.com/rth/vt...

I have been experimenting with tokenization in Rust in https://github.com/rth/vtext, mostly relying on Unicode segmentation with a few custom rules for tokenization. The obtained performance was also around 10x faster than spacy for comparable precision (see benchmarks section of the readme).