Project Runeberg,
Today (well, yesterday) I noticed that the work http://runeberg.org/akrell/ was completely proofread. If you have been watching the "Recent changes" page, you have seen the signature "jens.christian.berlin" working on this title in the last months.
These are the political memoirs of three prominent Swedish gentlemen, Carl Fredrik Akrell, Samuel Gustaf von Troil, and Per Sahlström, published in 1884-1885, titled "Minnen från Carl XIV:s, Oscar I:s och Carl XV:s dagar". Among other things, they describe the political debates about the introduction of the electric telepgraph in Sweden, some early railroads and also the first macadam covered country road in southern Sweden. It is a single volume of 562 pages that I scanned and OCR-ed in August 2003.
So today I wrote a program to see which kinds of edits are most commonly needed to get the text in order after OCR. This is easy to do, as our website saves all old versions of every text. These are my findings, listed from the most frequent ones, down:
378 [---] {+--+} 128 {+<i>+} 91 {+</i>+}
Here, [- and -] surround parts that were removed, while {+ and +} surround parts that were inserted. This means in 378 places, a single dash "-" was changed into a double "--". In 128 places, an opening italics tag "<i>" was inserted, and in 91 places, a closing italics tag "</i>" was inserted. It could seem like a mystery that these numbers are not equal. However, this is only an unfortunate result of how my program works. The missing </i> tags are found further down the list.
These are by far the most common edits done to this book. I think you all agree that a lot of work could be saved if the OCR software would get this right in the first place.
In total over all 562 pages, I counted to 5268 different changes, or an average of 9.3 changes per page. The three above (long dashes and italics) make up 378 + 128 + 91 = 597 changes or 11 percent of all changes in the proofreading of this book.
Let's continue down the list. Here, some common OCR errors are starting to show:
31 [-l-] {+1+} 28 {+<tab>+} 28 [-sorn-] {+som+} 26 [-.-] 21 [-deri-] {+den+} 20 {+</b>+} 19 {+*+} 18 [-ined-] {+med+} 17 [-rnig-] {+mig+} 17 [-Lubeck-] {+Lübeck+} 16 [-örn-] {+om+} 16 [-rned-] {+med+} 16 [-Munchen-] {+München+} 15 {+</b> <b>+} 12 [-eri-] {+en+} 12 [-*-] 11 [-ä-] {+à+} 10 [-rnin-] {+min+} 10 [-jäg-] {+jag+} 10 [-a-] {+à+} 10 [-'-] 9 [-ätt-] {+att+} 9 [-pä-] {+på+} 8 [-rnå-] {+må+} 7 {+* <b>+} 7 [-fastade-] {+fästade+} 7 [-dä-] {+då+} 7 [-G-] {+C+} 7 [-,-] 6 {+<b>+} 6 [-mön-] {+mon+} 6 [-Wurtemberg-] {+Würtemberg+} 5 [-å-] {+à+} 5 [-upp--] 5 [-och.-] {+och+} 5 [-for-] {+för+} 5 [-alt-] {+att+} 5 [-Goswig-] {+Coswig+} 5 [-Gassel-] {+Cassel+} 5 [-Biilow-] {+Bülow+} 5 [---]
Also among the changes that only occur once, patterns are to be seen, e.g., the removal of page numbers:
1 [-175-] 1 [-174-]
the editing of numbers and fractals:
1 [-2J/2-] {+2 1/2+} 1 [-2:rie-] {+2:ne+}
and the removal of "OCR dirt", small dots that shouldn't be there:
1 [-.kärft-] {+kärft+} 1 [-.kryssning-] {+kryssning+} 1 [-.komminister-] {+komminister+} 1 [-.klass-] {+klass+} 1 [-.intellektuel-] {+intellektuel+} 1 [-.inqvartering-] {+inqvartering+} 1 [-.idéen-] {+idéen+} 1 [-.han-] {+han+} 1 [-.hafva-] {+hafva+} 1 [-.gjort-] {+gjort+}
Sometimes when the OCR software doesn't find a word in its dictionary, it tries to split it into two recognized words, which the proofreader than has to join:
1 [-artilleri vetenskapen-] {+artillerivetenskapen+}
Project Runeberg,
Another focused proofreader is Steen.Roennow, who has been working on http://runeberg.org/dbl/ or "Dansk biografisk Lexikon", first with indexing and now with proofreading. This work in 19 volumes is not yet completely proofread, but the most frequent changes so far are:
4320 {+</b>+} 2987 {+<i>+} 2957 {+</i> <b>+} 2371 [---] {+--+} 1555 [-f-] {+d.+} 1316 {+<b>+} 1308 [-i-] {+1+} 984 [-(f-] {+(d.+} 875 {+</i>+}
The single dash "-" that becomes a double "--" and the frequent use of italics are recognized from "akrell". DBL has a special pattern, where each article is started with a person's name in boldface, and ends with the article author's name in italics, which explains the frequent insertion of </i> just before <b>.
The OCRed "f" that is changed to "d." is the cross or dagger indicating the death year or date of a person, which we transcribe into "d." for "dead", since "f." is used for the birth date (født).
488 [---] {+<b> <tab> -- <tab>+} 421 [-f-] {+</b> d.+} 285 [-Kjøben- havn-] {+Kjøbenhavn+} 282 [-n-] {+11+} 167 [-Kjø- benhavn-] {+Kjøbenhavn+} 125 [-Chri- stian-] {+Christian+} 120 [-°g-] {+og+} 108 [-Dan- mark-] {+Danmark+} 102 [-ud- nævntes-] {+udnævntes+} 99 [-Virk- somhed-] {+Virksomhed+} 87 [-å-] {+à+}
Here we can see the effects of keeping hyphenated words in the OCR text. A lot of proofreading effort must be invested in joining those hyphenated words, which is unfortunate. In the more recently scanned works, the OCR software joins the hyphenated words without losing track of line breaks.
As with akrell, there are just a few cases that occur very many times. The above 20 different change patterns span over a range of occurances between 4320 and 87. The rest of the list consists of a large number of different cases that each occur only a few (tens of) times.
83 [-saa- ledes-] {+saaledes+} 83 [-For- hold-] {+Forhold+} 82 [-for- skjellige-] {+forskjellige+} 80 [-ble ven-] {+bleven+} 80 [-Med- lem-] {+Medlem+} 78 [-Sogne- præst-] {+Sognepræst+} 76 {+<tab>+} 76 [-til- bage-] {+tilbage+} 76 [-Oehlenschlager-] {+Oehlenschläger+} 74 [-Frede- rik-] {+Frederik+} 73 [-Gluckstadt-] {+Glückstadt+} 70 [-ii-] {+11+} 67 [-Muller-] {+Müller+} 63 [-Kjøben- havns-] {+Kjøbenhavns+} 60 [-der- efter-] {+derefter+} 58 [-Pro- fessor-] {+Professor+} 58 [-Oehlenschlagers-] {+Oehlenschlägers+} 57 [-oven- nævnte-] {+ovennævnte+} 55 [-alle- rede-] {+allerede+} 50 [-ff-] {+<i> H+} 50 [-For- bindelse-] {+Forbindelse+}
Special for DBL are the many corrections that are printed in the back of the volumes, that have been introduced into the text by the proofreader, where they seem to appear out of nowhere, e.g.:
1 {+G. døde 26. Maj 1895.+} 1 {+G. døde 24. Juni 1902.+} 1 {+G. døde 23. Juni 1895. <i>+} 1 {+G. døde 22. Okt. 1891. <i>+} 1 {+G. døde 21. Marts 1903. <i>+} 1 {+G. døde 18. Juni 1894. <i>+} 1 {+G. døde 15. Okt. 1902. <i>+} 1 {+G. døde 14. Sept. 1899. <i>+} 1 {+G. døde 13. Sept. 1893.+} 1 {+G. døde 11. Juli 1895. <i>+}