Project Runeberg,
Today (well, yesterday) I noticed that the work http://runeberg.org/akrell/ was completely proofread. If you have been watching the "Recent changes" page, you have seen the signature "jens.christian.berlin" working on this title in the last months.
These are the political memoirs of three prominent Swedish gentlemen, Carl Fredrik Akrell, Samuel Gustaf von Troil, and Per Sahlström, published in 1884-1885, titled "Minnen från Carl XIV:s, Oscar I:s och Carl XV:s dagar". Among other things, they describe the political debates about the introduction of the electric telepgraph in Sweden, some early railroads and also the first macadam covered country road in southern Sweden. It is a single volume of 562 pages that I scanned and OCR-ed in August 2003.
So today I wrote a program to see which kinds of edits are most commonly needed to get the text in order after OCR. This is easy to do, as our website saves all old versions of every text. These are my findings, listed from the most frequent ones, down:
378 [---] {+--+} 128 {+<i>+} 91 {+</i>+}
Here, [- and -] surround parts that were removed, while {+ and +} surround parts that were inserted. This means in 378 places, a single dash "-" was changed into a double "--". In 128 places, an opening italics tag "<i>" was inserted, and in 91 places, a closing italics tag "</i>" was inserted. It could seem like a mystery that these numbers are not equal. However, this is only an unfortunate result of how my program works. The missing </i> tags are found further down the list.
These are by far the most common edits done to this book. I think you all agree that a lot of work could be saved if the OCR software would get this right in the first place.
In total over all 562 pages, I counted to 5268 different changes, or an average of 9.3 changes per page. The three above (long dashes and italics) make up 378 + 128 + 91 = 597 changes or 11 percent of all changes in the proofreading of this book.
Let's continue down the list. Here, some common OCR errors are starting to show:
31 [-l-] {+1+} 28 {+<tab>+} 28 [-sorn-] {+som+} 26 [-.-] 21 [-deri-] {+den+} 20 {+</b>+} 19 {+*+} 18 [-ined-] {+med+} 17 [-rnig-] {+mig+} 17 [-Lubeck-] {+Lübeck+} 16 [-örn-] {+om+} 16 [-rned-] {+med+} 16 [-Munchen-] {+München+} 15 {+</b> <b>+} 12 [-eri-] {+en+} 12 [-*-] 11 [-ä-] {+à+} 10 [-rnin-] {+min+} 10 [-jäg-] {+jag+} 10 [-a-] {+à+} 10 [-'-] 9 [-ätt-] {+att+} 9 [-pä-] {+på+} 8 [-rnå-] {+må+} 7 {+* <b>+} 7 [-fastade-] {+fästade+} 7 [-dä-] {+då+} 7 [-G-] {+C+} 7 [-,-] 6 {+<b>+} 6 [-mön-] {+mon+} 6 [-Wurtemberg-] {+Würtemberg+} 5 [-å-] {+à+} 5 [-upp--] 5 [-och.-] {+och+} 5 [-for-] {+för+} 5 [-alt-] {+att+} 5 [-Goswig-] {+Coswig+} 5 [-Gassel-] {+Cassel+} 5 [-Biilow-] {+Bülow+} 5 [---]
Also among the changes that only occur once, patterns are to be seen, e.g., the removal of page numbers:
1 [-175-] 1 [-174-]
the editing of numbers and fractals:
1 [-2J/2-] {+2 1/2+} 1 [-2:rie-] {+2:ne+}
and the removal of "OCR dirt", small dots that shouldn't be there:
1 [-.kärft-] {+kärft+} 1 [-.kryssning-] {+kryssning+} 1 [-.komminister-] {+komminister+} 1 [-.klass-] {+klass+} 1 [-.intellektuel-] {+intellektuel+} 1 [-.inqvartering-] {+inqvartering+} 1 [-.idéen-] {+idéen+} 1 [-.han-] {+han+} 1 [-.hafva-] {+hafva+} 1 [-.gjort-] {+gjort+}
Sometimes when the OCR software doesn't find a word in its dictionary, it tries to split it into two recognized words, which the proofreader than has to join:
1 [-artilleri vetenskapen-] {+artillerivetenskapen+}