HI all,
I thought I'd share some thoughts on the issue of using digital
cameras, since I'm a long time scanner-user (my first being a hand-scanner -
remember them?) and a relatively new digital camera owner.
Since trying out book scanning (save some money on photocopies at the library) was one of my main reasons for buying one,
runeberg.org/admin/camera.html was an interesting thing to read before I did
so. But the document is a little bit old, so here are my experiences with my
Samsung NV10 10 megapixel camera:
One of my first experiences is that a ten megapixel camera can indeed get quite
good resolution (in DPI), comparable to that of a low-end scanner. The NV10 is
a compact camera, but at the shortest focus distance (about 6 cm) I get a resolution of about 2000 DPI, although that numbers should be taken with a grain
of salt, due to ISO and other interpolation techniques. But even half that is still a
pretty good number. For an A4 (or US letter) page, we're talking about 300-400 DPI.
Now, my experience (using finereader) is that 10 megapixels is more than
you need for good OCR. I also get the impression that the issue of lighting is
a bit overemphasized; I've had good results even with ambient lighting, as long
as the page/camera/photographer is positioned to avoid shadows.
To rank the pitfalls:
The definite number one issue is page geometry. The page needs to be as
flat as possible. While FineReader didn't seem to have much problems with
low contrast, it definitely had problems with any page that wasn't flat. So the
best way to photo books seems to be to position them open at a little over 90 degrees, with the page being photographed as flat as possible. (you can
speed this up by photographing the odd and even pages in two passes)
The second issue is sharpness, which seems quite vital as well. The camera and book need to be steady, and the camera needs to be correctly focused. With some practice, even hand-held results work fine, it's just more difficult.
On the positive side, the sharpness can be checked immediately after
taking the picture. If the picture is bad, it can still be salvaged in most cases
with some post-processing in Photoshop or similar.
Third, I would put the aforementioned issue with lighting. And fourth, I'd bring
up lens abberation, the slight 'fishbowl' effect you get with the camera. It usually isn't a big enough distortion to have an effect on OCR though, but it becomes
very noticable in images, due to the straight lines (of the image border, if nothing else). Images suffer a lot from uneven lighting and lack of color-correctness.
So to summarize: I've found using a digital camera to be a very fast, portable and convenient way to digitize text, and that even working with ambient
lighting and the camera hand-held ("field conditions") it's quite possible, with
some practice, to get results that OCR nearly as well as from a scanner.
The big drawback is images; it's not possible, however, to get a decent image
under "field conditions", at least not without spending quite a bit of time doing
post-processing, which means cancelling out one of the main benefits of using
a camera in the first place.
So for an illustrated book, I wouldn't recommend using a camera; at least not
as a time-saving device. But for plain text the camera can be a better option.
(depending on whether you have a sheet-fed scanner, and whether you want
to preserve the book or not, etc)
When it comes to older Swedish texts in particular, the vast majority of OCR errors would seem to be due to archaic spelling not included in the software's dictionary, and that's the same with either a camera or a scanner.
(Speaking of which, perhaps Runeberg.org could cook up some OCR dictionaries for different time periods?)
/Sven