[Runeberg] Most common proofreading edits

4 May 2004


      Project Runeberg,
Today (well, yesterday) I noticed that the work
http://runeberg.org/akrell/ was completely proofread.  If you have
been watching the "Recent changes" page, you have seen the signature
"jens.christian.berlin"  working on this title in the last months.
These are the political memoirs of three prominent Swedish gentlemen,
Carl Fredrik Akrell, Samuel Gustaf von Troil, and Per Sahlström,
published in 1884-1885, titled "Minnen från Carl XIV:s, Oscar I:s och
Carl XV:s dagar".  Among other things, they describe the political
debates about the introduction of the electric telepgraph in Sweden,
some early railroads and also the first macadam covered country road
in southern Sweden.  It is a single volume of 562 pages that I scanned
and OCR-ed in August 2003.
So today I wrote a program to see which kinds of edits are most
commonly needed to get the text in order after OCR.  This is easy to
do, as our website saves all old versions of every text.  These are my
findings, listed from the most frequent ones, down:
378  [---] {+--+}
    128  {+<i>+}
     91  {+</i>+}
Here, [- and -] surround parts that were removed, while {+ and +}
surround parts that were inserted. This means in 378 places, a single
dash "-" was changed into a double "--".  In 128 places, an opening
italics tag "<i>" was inserted, and in 91 places, a closing italics
tag "</i>" was inserted.  It could seem like a mystery that these
numbers are not equal.  However, this is only an unfortunate result of
how my program works.  The missing </i> tags are found further down
the list.
These are by far the most common edits done to this book.  I think you
all agree that a lot of work could be saved if the OCR software would
get this right in the first place.
In total over all 562 pages, I counted to 5268 different changes, or
an average of 9.3 changes per page.  The three above (long dashes and
italics) make up 378 + 128 + 91 = 597 changes or 11 percent of all
changes in the proofreading of this book.
Let's continue down the list. Here, some common OCR errors are
starting to show:
31  [-l-] {+1+}
     28  {+<tab>+}
     28  [-sorn-] {+som+}
     26  [-.-]
     21  [-deri-] {+den+}
     20  {+</b>+}
     19  {+*+}
     18  [-ined-] {+med+}
     17  [-rnig-] {+mig+}
     17  [-Lubeck-] {+Lübeck+}
     16  [-örn-] {+om+}
     16  [-rned-] {+med+}
     16  [-Munchen-] {+München+}
     15  {+</b> <b>+}
     12  [-eri-] {+en+}
     12  [-*-]
     11  [-ä-] {+à+}
     10  [-rnin-] {+min+}
     10  [-jäg-] {+jag+}
     10  [-a-] {+à+}
     10  [-'-]
      9  [-ätt-] {+att+}
      9  [-pä-] {+på+}
      8  [-rnå-] {+må+}
      7  {+* <b>+}
      7  [-fastade-] {+fästade+}
      7  [-dä-] {+då+}
      7  [-G-] {+C+}
      7  [-,-]
      6  {+<b>+}
      6  [-mön-] {+mon+}
      6  [-Wurtemberg-] {+Würtemberg+}
      5  [-å-] {+à+}
      5  [-upp--]
      5  [-och.-] {+och+}
      5  [-for-] {+för+}
      5  [-alt-] {+att+}
      5  [-Goswig-] {+Coswig+}
      5  [-Gassel-] {+Cassel+}
      5  [-Biilow-] {+Bülow+}
      5  [---]
Also among the changes that only occur once, patterns are to be seen,
e.g., the removal of page numbers:
1  [-175-]
      1  [-174-]
the editing of numbers and fractals:
1  [-2J/2-] {+2 1/2+}
      1  [-2:rie-] {+2:ne+}
and the removal of "OCR dirt", small dots that shouldn't be there:
1  [-.kärft-] {+kärft+}
      1  [-.kryssning-] {+kryssning+}
      1  [-.komminister-] {+komminister+}
      1  [-.klass-] {+klass+}
      1  [-.intellektuel-] {+intellektuel+}
      1  [-.inqvartering-] {+inqvartering+}
      1  [-.idéen-] {+idéen+}
      1  [-.han-] {+han+}
      1  [-.hafva-] {+hafva+}
      1  [-.gjort-] {+gjort+}
Sometimes when the OCR software doesn't find a word in its dictionary,
it tries to split it into two recognized words, which the proofreader
than has to join:
1  [-artilleri vetenskapen-] {+artillerivetenskapen+}
-- 
  Lars Aronsson (lars@aronsson.se)
  Project Runeberg - free Nordic literature - http://runeberg.org/

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

[Runeberg] Most common proofreading edits