OCR errors - Runeberg

4 May 2004


      When I noticed that the same errors occured from time to time, I found 
it convenient to extract the raw OCR texts into a single file containing 
10, or sometimes 100 single pages. Then I could apply the replace 
command in MS World to correct the reoccuring errors. First removing the 
OCR thrash (tabs, spurious chars) and the extra spaces (which, although 
not appearing in HTML, are a bit annoying when proofreading), then the 
most common errors, such as 'rn' instead of 'm' etc. The swedish words 
'han' occuring as 'lima' and 'med' as 'umeå'(!) are among the more 
noteworthy.
Having the text in a separate file, makes it possible to arrange the 
fascimile text (browser) to the left of the screen and the OCR text to 
the right (Notepad), thus faciliating the proofreading.
It is possible to use MS Word to insert tags like <i> with a single 
keypress. After using Word, I find it necessary, however, to filter the 
text through Notepad, in order to be sure that no Microsoft special 
chars are hidden somewhere. For example, Word might convert short dashes 
into long, etc.
Bernhard Johanson
runeberg-request@lists.lysator.liu.se wrote:
...
Send Runeberg mailing list submissions to
   runeberg@lists.lysator.liu.se
To subscribe or unsubscribe via the World Wide Web, visit
   http://lists.lysator.liu.se/mailman/listinfo/runeberg
or, via email, send a message with subject or body 'help' to
   runeberg-request@lists.lysator.liu.se
You can reach the person managing the list at
   runeberg-admin@lists.lysator.liu.se
When replying, please edit your Subject line so it is more specific
than "Re: Contents of Runeberg digest..."
Today's Topics:

Most common proofreading edits (Lars Aronsson)
Re: Most common proofreading edits (Lars Aronsson)

--__--__--
Message: 1
Date: Tue, 4 May 2004 01:04:22 +0200 (CEST)
From: Lars Aronsson lars@aronsson.se
To: runeberg@lists.lysator.liu.se
Subject: [Runeberg] Most common proofreading edits
Project Runeberg,
Today (well, yesterday) I noticed that the work
http://runeberg.org/akrell/ was completely proofread.  If you have
been watching the "Recent changes" page, you have seen the signature
"jens.christian.berlin"  working on this title in the last months.
These are the political memoirs of three prominent Swedish gentlemen,
Carl Fredrik Akrell, Samuel Gustaf von Troil, and Per Sahlström,
published in 1884-1885, titled "Minnen från Carl XIV:s, Oscar I:s och
Carl XV:s dagar".  Among other things, they describe the political
debates about the introduction of the electric telepgraph in Sweden,
some early railroads and also the first macadam covered country road
in southern Sweden.  It is a single volume of 562 pages that I scanned
and OCR-ed in August 2003.
So today I wrote a program to see which kinds of edits are most
commonly needed to get the text in order after OCR.  This is easy to
do, as our website saves all old versions of every text.  These are my
findings, listed from the most frequent ones, down:
378  [---] {+--+}
   128  {+<i>+}
    91  {+</i>+}
Here, [- and -] surround parts that were removed, while {+ and +}
surround parts that were inserted. This means in 378 places, a single
dash "-" was changed into a double "--".  In 128 places, an opening
italics tag "<i>" was inserted, and in 91 places, a closing italics
tag "</i>" was inserted.  It could seem like a mystery that these
numbers are not equal.  However, this is only an unfortunate result of
how my program works.  The missing </i> tags are found further down
the list.
These are by far the most common edits done to this book.  I think you
all agree that a lot of work could be saved if the OCR software would
get this right in the first place.
In total over all 562 pages, I counted to 5268 different changes, or
an average of 9.3 changes per page.  The three above (long dashes and
italics) make up 378 + 128 + 91 = 597 changes or 11 percent of all
changes in the proofreading of this book.
Let's continue down the list. Here, some common OCR errors are
starting to show:
31  [-l-] {+1+}
28  {+<tab>+}
28  [-sorn-] {+som+}
26  [-.-]
21  [-deri-] {+den+}
20  {+</b>+}
19  {+*+}
18  [-ined-] {+med+}
17  [-rnig-] {+mig+}
17  [-Lubeck-] {+Lübeck+}
16  [-örn-] {+om+}
16  [-rned-] {+med+}
16  [-Munchen-] {+München+}
15  {+</b> <b>+}
12  [-eri-] {+en+}
12  [-*-]
11  [-ä-] {+à+}
10  [-rnin-] {+min+}
10  [-jäg-] {+jag+}
10  [-a-] {+à+}
10  [-'-]
 9  [-ätt-] {+att+}
 9  [-pä-] {+på+}
 8  [-rnå-] {+må+}
 7  {+* <b>+}
 7  [-fastade-] {+fästade+}
 7  [-dä-] {+då+}
 7  [-G-] {+C+}
 7  [-,-]
 6  {+<b>+}
 6  [-mön-] {+mon+}
 6  [-Wurtemberg-] {+Würtemberg+}
 5  [-å-] {+à+}
 5  [-upp--]
 5  [-och.-] {+och+}
 5  [-for-] {+för+}
 5  [-alt-] {+att+}
 5  [-Goswig-] {+Coswig+}
 5  [-Gassel-] {+Cassel+}
 5  [-Biilow-] {+Bülow+}
 5  [---]


Also among the changes that only occur once, patterns are to be seen,
e.g., the removal of page numbers:
 1  [-175-]
 1  [-174-]


the editing of numbers and fractals:
 1  [-2J/2-] {+2 1/2+}
 1  [-2:rie-] {+2:ne+}


and the removal of "OCR dirt", small dots that shouldn't be there:
 1  [-.kärft-] {+kärft+}
 1  [-.kryssning-] {+kryssning+}
 1  [-.komminister-] {+komminister+}
 1  [-.klass-] {+klass+}
 1  [-.intellektuel-] {+intellektuel+}
 1  [-.inqvartering-] {+inqvartering+}
 1  [-.idéen-] {+idéen+}
 1  [-.han-] {+han+}
 1  [-.hafva-] {+hafva+}
 1  [-.gjort-] {+gjort+}


Sometimes when the OCR software doesn't find a word in its dictionary,
it tries to split it into two recognized words, which the proofreader
than has to join:
 1  [-artilleri vetenskapen-] {+artillerivetenskapen+}