Wednesday, June 25, 2008

Optical Character Recognition in Linux [Greg Laden's Blog]

I don't think Optical Character Recognition (OCR) works that well, frankly. But it can be done and it can be better than retyping piles of text. It does seem to work nicely when the text is nice and clean on nice clean white paper with a good contract between ink and background and no garbage on the page. But in my experience, when I have those conditions, it is because i have an electronic version already! When I have a PDF file that consists of scans of photocopies, OCR tends to see flecks