I study German and decided to write app to simplify some tasks:
1) generate anki cards from text
2) extract highlighted words from paper and generate anki cards
3) collection of texts/dialogs on random topics
4) and so on... )
not sure if someone needs it, but very helpful for me )
Technologies: Python backend stack, some computer vision, PDF processing (split, merge, ocr, text extraction, fixing all weird issues, mupdf, ghostscript - worked on pdf related projects), AWS
Doesn't work very well.
First of all there are several separate problems: 1) detect if OCR is required 2) image optimization 3) preprocessing of broken pdf files. And all of them are not easy:
1) page could contain selectable text, but text can't be copied because embedded font doesn't contain glyph->symbol code mapping. Mapping table could contain complete garbage. Sometimes page could contain long urls (added by email services) but all text is provided as image. Sometimes text contains normal text and garbage. And many many other cases.
2) some old scanners generate pdf documents built from 2-5 pixel image stripes. Some of them try to do OCR (poorly). Some of them uses huge DPI. Sometimes you get uncompressed doc in which each page could take up to 200mb. So you need to convert pdf page to image. But you have to choose format and compression options. PNG is ok, but you have to choose correct options (for ghostscript). But output image will be huge. JPG is better, but quality could be low. Sometimes multistage optimization is required. Also tools like ghostscript, fitz or imagemagic doesn't handle all possible pdf/image.
3)weird pdfs - endless story. Poor fonts, broken fonts, very specific cases in pdf standard, issues with image extraction, table of content, viruses, embedded files, annotations, margins/paddings/rotations/translations.
For news and forum threads you might want to check out Fraidycat. Its a browser extension that handles your feeds (rss and some others). You can categorize feeds by importance (real-time, frequent, occasional, etc) and it updates the main page accordingly.
Technologies: Deep Learning/Python Backend stack - keras, pytorch, Flask, Docker, Kubernetes and so on. Fields - image processing, face detection, face recognition, object detection/classification, segmentation.
reply