When I noticed that the same errors occured from time to time, I found it convenient to extract the raw OCR texts into a single file containing 10, or sometimes 100 single pages. Then I could apply the replace command in MS World to correct the reoccuring errors. First removing the OCR thrash (tabs, spurious chars) and the extra spaces (which, although not appearing in HTML, are a bit annoying when proofreading), then the most common errors, such as 'rn' instead of 'm' etc. The swedish words 'han' occuring as 'lima' and 'med' as 'umeå'(!) are among the more noteworthy.
Having the text in a separate file, makes it possible to arrange the fascimile text (browser) to the left of the screen and the OCR text to the right (Notepad), thus faciliating the proofreading.
It is possible to use MS Word to insert tags like <i> with a single keypress. After using Word, I find it necessary, however, to filter the text through Notepad, in order to be sure that no Microsoft special chars are hidden somewhere. For example, Word might convert short dashes into long, etc.
Bernhard Johanson
runeberg-request@lists.lysator.liu.se wrote:
Send Runeberg mailing list submissions to runeberg@lists.lysator.liu.se
To subscribe or unsubscribe via the World Wide Web, visit http://lists.lysator.liu.se/mailman/listinfo/runeberg or, via email, send a message with subject or body 'help' to runeberg-request@lists.lysator.liu.se
You can reach the person managing the list at runeberg-admin@lists.lysator.liu.se
When replying, please edit your Subject line so it is more specific than "Re: Contents of Runeberg digest..."
Today's Topics:
- Most common proofreading edits (Lars Aronsson)
- Re: Most common proofreading edits (Lars Aronsson)
--__--__--
Message: 1 Date: Tue, 4 May 2004 01:04:22 +0200 (CEST) From: Lars Aronsson lars@aronsson.se To: runeberg@lists.lysator.liu.se Subject: [Runeberg] Most common proofreading edits
Project Runeberg,
Today (well, yesterday) I noticed that the work http://runeberg.org/akrell/ was completely proofread. If you have been watching the "Recent changes" page, you have seen the signature "jens.christian.berlin" working on this title in the last months.
These are the political memoirs of three prominent Swedish gentlemen, Carl Fredrik Akrell, Samuel Gustaf von Troil, and Per Sahlström, published in 1884-1885, titled "Minnen från Carl XIV:s, Oscar I:s och Carl XV:s dagar". Among other things, they describe the political debates about the introduction of the electric telepgraph in Sweden, some early railroads and also the first macadam covered country road in southern Sweden. It is a single volume of 562 pages that I scanned and OCR-ed in August 2003.
So today I wrote a program to see which kinds of edits are most commonly needed to get the text in order after OCR. This is easy to do, as our website saves all old versions of every text. These are my findings, listed from the most frequent ones, down:
378 [---] {+--+} 128 {+<i>+} 91 {+</i>+}
Here, [- and -] surround parts that were removed, while {+ and +} surround parts that were inserted. This means in 378 places, a single dash "-" was changed into a double "--". In 128 places, an opening italics tag "<i>" was inserted, and in 91 places, a closing italics tag "</i>" was inserted. It could seem like a mystery that these numbers are not equal. However, this is only an unfortunate result of how my program works. The missing </i> tags are found further down the list.
These are by far the most common edits done to this book. I think you all agree that a lot of work could be saved if the OCR software would get this right in the first place.
In total over all 562 pages, I counted to 5268 different changes, or an average of 9.3 changes per page. The three above (long dashes and italics) make up 378 + 128 + 91 = 597 changes or 11 percent of all changes in the proofreading of this book.
Let's continue down the list. Here, some common OCR errors are starting to show:
31 [-l-] {+1+} 28 {+<tab>+} 28 [-sorn-] {+som+} 26 [-.-] 21 [-deri-] {+den+} 20 {+</b>+} 19 {+*+} 18 [-ined-] {+med+} 17 [-rnig-] {+mig+} 17 [-Lubeck-] {+Lübeck+} 16 [-örn-] {+om+} 16 [-rned-] {+med+} 16 [-Munchen-] {+München+} 15 {+</b> <b>+} 12 [-eri-] {+en+} 12 [-*-] 11 [-ä-] {+à+} 10 [-rnin-] {+min+} 10 [-jäg-] {+jag+} 10 [-a-] {+à+} 10 [-'-] 9 [-ätt-] {+att+} 9 [-pä-] {+på+} 8 [-rnå-] {+må+} 7 {+* <b>+} 7 [-fastade-] {+fästade+} 7 [-dä-] {+då+} 7 [-G-] {+C+} 7 [-,-] 6 {+<b>+} 6 [-mön-] {+mon+} 6 [-Wurtemberg-] {+Würtemberg+} 5 [-å-] {+à+} 5 [-upp--] 5 [-och.-] {+och+} 5 [-for-] {+för+} 5 [-alt-] {+att+} 5 [-Goswig-] {+Coswig+} 5 [-Gassel-] {+Cassel+} 5 [-Biilow-] {+Bülow+} 5 [---]
Also among the changes that only occur once, patterns are to be seen, e.g., the removal of page numbers:
1 [-175-] 1 [-174-]
the editing of numbers and fractals:
1 [-2J/2-] {+2 1/2+} 1 [-2:rie-] {+2:ne+}
and the removal of "OCR dirt", small dots that shouldn't be there:
1 [-.kärft-] {+kärft+} 1 [-.kryssning-] {+kryssning+} 1 [-.komminister-] {+komminister+} 1 [-.klass-] {+klass+} 1 [-.intellektuel-] {+intellektuel+} 1 [-.inqvartering-] {+inqvartering+} 1 [-.idéen-] {+idéen+} 1 [-.han-] {+han+} 1 [-.hafva-] {+hafva+} 1 [-.gjort-] {+gjort+}
Sometimes when the OCR software doesn't find a word in its dictionary, it tries to split it into two recognized words, which the proofreader than has to join:
1 [-artilleri vetenskapen-] {+artillerivetenskapen+}