dear Sven de Marothy,
 
You are doing great work with your camera I can understand from your report Digitizing with a camera - some results.
 
I can confirm all your findings in your report regarding using digital camera for OCR.
 
Some years ago I used a Canon S40 4Megapixel camera and did som OCR of single book pages with an old version of FineReader.
 
I photographed maybe 50 or 100 big books, mostly handheld (camera and book) with day light from window. But very few of them I OCRed.
 
My experience was that the OCR result was as good as with a scanner, but if I wanted to do OCR also I had to spend more time photographing each book then if I just stored the books as pictures, so therefore I saved time by just going quickly through most of the books.
 
Now I have a Canon A620 7Megapixel and the results are just a little better. I also travel with a super small tripod and a big screw for attachement to tables (I put the book on the floor between the window and table).
 
I do a 60-page book in 5-10 minutes with a double page on each photo and about double time with a single page on each photo. The time also depends on how soft the book binding is and how good my working position is under the window.
 
I think the biggest problems with FineReader is that it does not accept scans that are not flat, and that it is sensitive to shadows. There is also a need for an expanded internal dictionary in Finereader. An expanded dictionary that Finereader uses in the recognition phase (not later at spelling check). Do you know if Finereader has this possibility?
 
I also need a Finereader that has built in recognition of diacritacal marks like you know åöä in Swedish. I do Indian language books.
 
mvh Mats Eklöf Huskvarna


2007/2/9, runeberg-request@lists.lysator.liu.se <runeberg-request@lists.lysator.liu.se >:
Send Runeberg mailing list submissions to
       runeberg@lists.lysator.liu.se

To subscribe or unsubscribe via the World Wide Web, visit
       http://lists.lysator.liu.se/mailman/listinfo/runeberg
or, via email, send a message with subject or body 'help' to
       runeberg-request@lists.lysator.liu.se

You can reach the person managing the list at
       runeberg-owner@lists.lysator.liu.se

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Runeberg digest..."


Today's Topics:

  1. Re: Halland (Sven de Marothy)
  2. Digitizing with a camera - some results (Sven de Marothy)


----------------------------------------------------------------------

Message: 1
Date: Thu, 8 Feb 2007 21:04:21 +0100
From: "Sven de Marothy" < svendem@gmail.com>
Subject: Re: [Runeberg] Halland
To: runeberg@lists.lysator.liu.se
Message-ID:
       < 94420310702081204u3c0bdf88x8ad213fe3a51442d@mail.gmail.com>
Content-Type: text/plain; charset="iso-8859-1"

On 2/8/07, Lars Aronsson <lars@aronsson.se> wrote:


Vilka författare är från Halland?


Carl Bildt? I och för sig gäller det bara Carl Bildt (1949-) och inte
Carl Bildt (1850-1931), som hade kunnat vara aktuell.

Annars är Olof Dalin självskriven tycker jag. Läge att digitalisera
Then Swänska Argus?

Och hej förresten, jag är ny på listan. :)

/Sven
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.lysator.liu.se/pipermail/runeberg/attachments/20070208/7b195463/attachment.html

------------------------------

Message: 2
Date: Thu, 8 Feb 2007 21:54:31 +0100
From: "Sven de Marothy" < svendem@gmail.com>
Subject: [Runeberg] Digitizing with a camera - some results
To: Runeberg@lists.lysator.liu.se
Message-ID:
       <94420310702081254r4680b4a4r991fddc2fc84671e@mail.gmail.com>
Content-Type: text/plain; charset="iso-8859-1"

HI all,
I thought I'd share some thoughts on the issue of using digital
cameras, since I'm a long time scanner-user (my first being a hand-scanner -
remember them?) and a relatively new digital camera owner.

Since trying out book scanning (save some money on photocopies at the
library) was one of my main reasons for buying one,
runeberg.org/admin/camera.html was an interesting thing to read before I did
so. But the document is a little bit old, so here are my experiences with my
Samsung NV10 10 megapixel camera:

One of my first experiences is that a ten megapixel camera can indeed get
quite
good resolution (in DPI), comparable to that of a low-end scanner. The NV10
is
a compact camera, but at the shortest focus distance (about 6 cm) I get a
resolution of about 2000 DPI, although that numbers should be taken with a
grain
of salt, due to ISO and other interpolation techniques. But even half that
is still a
pretty good number. For an A4 (or US letter) page, we're talking about
300-400 DPI.

Now, my experience (using finereader) is that 10 megapixels is more than
you need for good OCR. I also get the impression that the issue of lighting
is
a bit overemphasized; I've had good results even with ambient lighting, as
long
as the page/camera/photographer is positioned to avoid shadows.

To rank the pitfalls:
The definite number one issue is page geometry. The page needs to be as
flat as possible. While FineReader didn't seem to have much problems with
low contrast, it definitely had problems with any page that wasn't flat. So
the
best way to photo books seems to be to position them open at a little over
90 degrees, with the page being photographed as flat as possible. (you can
speed this up by photographing the odd and even pages in two passes)

The second issue is sharpness, which seems quite vital as well. The camera
and book need to be steady, and the camera needs to be correctly focused.
With some practice, even hand-held results work fine, it's just more
difficult.
On the positive side, the sharpness can be checked immediately after
taking the picture. If the picture is bad, it can still be salvaged in most
cases
with some post-processing in Photoshop or similar.

Third, I would put the aforementioned issue with lighting. And fourth, I'd
bring
up lens abberation, the slight 'fishbowl' effect you get with the camera. It
usually isn't a big enough distortion to have an effect on OCR though, but
it becomes
very noticable in images, due to the straight lines (of the image border, if
nothing else). Images suffer a lot from uneven lighting and lack of
color-correctness.

So to summarize: I've found using a digital camera to be a very fast,
portable and convenient way to digitize text, and that even working with
ambient
lighting and the camera hand-held ("field conditions") it's quite possible,
with
some practice, to get results that OCR nearly as well as from a scanner.

The big drawback is images; it's not possible, however, to get a decent
image
under "field conditions", at least not without spending quite a bit of time
doing
post-processing, which means cancelling out one of the main benefits of
using
a camera in the first place.

So for an illustrated book, I wouldn't recommend using a camera; at least
not
as a time-saving device. But for plain text the camera can be a better
option.
(depending on whether you have a sheet-fed scanner, and whether you want
to preserve the book or not, etc)

When it comes to older Swedish texts in particular, the vast majority of OCR
errors would seem to be due to archaic spelling not included in the
software's dictionary, and that's the same with either a camera or a
scanner.

(Speaking of which, perhaps Runeberg.org could cook up some OCR dictionaries
for different time periods?)

/Sven
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.lysator.liu.se/pipermail/runeberg/attachments/20070208/27b1d854/attachment.htm

------------------------------

_______________________________________________
Runeberg mailing list
Runeberg@lists.lysator.liu.se
http://lists.lysator.liu.se/mailman/listinfo/runeberg


End of Runeberg Digest, Vol 22, Issue 2
***************************************