Project Runeberg,
Here is a status report of what's going on right now -- it's a lot! I'll be travelling until the middle of October, so I probably won't have time to post any summary of September's activities before then.
This letter first summarizes the new titles we are working on, then goes into some more detail about our software development and the current state of the project as such.
== Published works ==
All works listed here are in Swedish, except for three that are in Danish: "ankjaer", "gblevned", and "ildalihi" (see below).
Some works that we published long ago as facsimile images only have now been OCRed for the first time, and are waiting to be proofread. They are:
- August Strindberg, "Svenska öden och äfventyr", http://runeberg.org/afventyr/
- "En kort historik öfver Wikningskommissionens uppkomst och utveckling", http://runeberg.org/wiknings/
- "Vägledare för turister på Kinda Kanal", http://runeberg.org/kindakan/
- Frans Michael Franzén, "Samlade dikter", http://runeberg.org/fmfdikt/
- "Hermeskalendern 1934", http://runeberg.org/hermes/
The remaining volumes of Carl August Hagberg's Swedish translation of Shakespeare's dramatic works have been scanned, and all 12 volumes are waiting to be proofread, http://runeberg.org/hagberg/
Many new works have been scanned and are in the process of being indexed, and ready to proofread. Often (too often) you will find the words "under arbete" (work in progress) as the only text on the title page of these works.
Wildlife (this is Hans Persson's section):
- Chr. Aurivillius, "Svenska fåglarna", http://runeberg.org/faglarna/
- Thor Högdahl, "Gammel-Ante", http://runeberg.org/gmlante/
Science fiction (this too is Hans Persson's section):
- Elfred Berggren, "Robotarnas gud", http://runeberg.org/robotgud/
- Otto Witt, "Jordens inre", http://runeberg.org/jordinre/
- Otto Witt, "Den underbara spegeln", http://runeberg.org/spegeln/
Science and technology:
- "Radio-Amatören", tidskrift, 1924, http://runeberg.org/radioama/
Geography and topography:
- Stefan Ankjær, "Geografisk-Statistisk Haandbog", 2 volumes, 1858-63, printed in "Fraktur" (German / Gothic type), a Danish encyclopedia or "geographic-statistic dictionary" of places around the world, http://runeberg.org/ankjaer/
Literature and (auto-)biography:
- Georg Brandes, "Levned", 3 volumes, autobiography of Denmark's most famous literature critic, http://runeberg.org/gblevned/
- P. Hansen, "Illustreret dansk Litteraturhistorie", 3 volumes, http://runeberg.org/ildalihi/
- J. L. Runeberg, "Samlade skrifter", 6 volumes, collected works of the Swedish-Finlandic writer after which our entire project is named, http://runeberg.org/runeberg/
- Carl Snoilsky, "Minnesanteckningar och andra uppsatser", http://runeberg.org/snoilmin/
- Talis Qualis (C.V.A. Strandberg), "Samlade vitterhetsarbeten", 5 volumes, http://runeberg.org/talis/
- Karl Warburg, "Anna Maria Lenngren", both editions (1887 and 1917), http://runeberg.org/warlenng/
- Karl Warburg, "Carl Snoilsky. Hans lefnad och skaldskap", http://runeberg.org/warsnoil/
== Software development and status of our project ==
Hans Persson, Leif Stensson, and myself are improving the software that produces web pages ("digital facsimile editions") from the images and text of scanned books. This is an ongoing automation of our manual routines, that enables us to work faster and get a better overview of what we're doing. The last is not least important. For the first time in many years, I now feel that I have a good grasp of all works that we have published, or at least all facsimile editions that we have published after 1998. Some of the old works were only scanned as images with no OCR text, but those gaps are now being filled in.
Note to self: In the choice between documenting a complicated manual procedure and automating the procedure so that the need for an explanation goes away, I always prefer the latter. The hard part is to realize that this choice exists. It is often tempting to explain details in written documentation, only to find that nobody reads the documentation. That temptation should be resisted.
The overview of our published digital facsimile editions at http://runeberg.org/korr.html is updated daily. This page is currently only available in Swedish. A legend and explanation follows below. The overview lists a line for every work or volume that we've published, and the volumes are categorized into different tables according to their status. Newly scanned volumes start at the bottom table and work their way up to the top of this page.
Table headings:
- "Verk som...som klara" = Works that we consider to be done. Chapters have full HTML markup. Currently 12 volumes. All our text-only works (some 200 titles) could be added to this category.
- "Helt korrekturlästa..." = Fully proofread works, that lack HTML markup. These volumes need the attention of our editors, as we currently don't allow our readers to help with markup. Currently 15 volumes.
- "Nästan fullständigt..." = Almost completely proofread works, only a little effort remains. Some of our most active proofreaders are competing to move these volumes up to the next stage. If we ever introduce a system to award points for proofreading, I think this should render a bonus. Currently only 5 volumes.
- "Ofullständigt korre..." = Incompletely proofread works. A lot of hard proofreading work is need here. The table contains 122 volumes.
- "Ofullständigt index..." = Incompletely indexed works, we don't know how many chapters they contain. Currently 73 volumes, that fall into two categories: Encyclopedic works that need a lot of indexing, and newly scanned works that are in the process of being indexed.
- "Sammanfattning" = Summary
Column headings:
- "Verk eller volym" = Work or volume
- "Antal sidor" = Number of facsimile pages
- "Antal kapitel eller..." = Number of chapters or articles
- "Totalt" = Total
- "Indexerade" = Indexed. The process of indexing means writing a table of contents that lists every chapter or articles of the volume, and indicates which facsimile pages hold text that belongs to each chapter. Three dots ("...") at the end of a line in the table of contents indicates that more articles remain to be indexed. When a new volume is scanned, the first version of the table of contents contains a single line that says "sidor ..." ("pages ...") and points to every scanned page. For encyclopedic works, that will take a lot of time to index completely, a rough next version of the table of contents breaks this down into sections of five or ten pages, each having a "...". As detailed indexing advances, more and more pages are freed from the "...", and are counted as fully indexed.
- "OCR-tolkade" = OCR-ed. OCR means optical character recognition, the process that converts a scanned facsimile image of a printed page into a text file that can be edited (proofread). We are using the ABBYY FineReader 6.0 OCR software, which gives excellent results. For works printed in "Fraktur" (currently only "ankjaer"), the results are less perfect, but still useful.
- "Korrekturlästa" = Proofread. Two columns carry this heading, one for the number of pages, and one for the number of chapters. The latter means that all pages that belong to the chapter have been proofread.
- "Klara som HTML" = With full HTML markup. Currently, we only allow our editors to add HTML markup to the proofread pages of a chapter.
- "Kvar att korrekt..." = Remain to be proofread, meaning chapters where some pages are not yet proofread.
- "TOTALT 99 volymer" = Total 99 volumes, a summary line at the bottom of each table.
At the end of the overview is a summary, and the bottom line at the time of writing says there are 227 volumes (some of our published works consist of multiple volumes, each presented as a row in one of the tables), comprising 115,484 facsimile pages of which 112,143 have OCR text that you can proofread over the web. So far, 13,774 (or 12%) of them have been proofread. Earlier we used to erase the original OCR text after HTML markup had been added, so the volumes in the top two tables have low numbers for OCR pages and proofread pages.
Some 40,839 pages (35% of all facsmimile pages we ever published) are incompletely indexed. We need a lot of help with indexing the volumes of the "Nordisk familjebok" encyclopedia, and now also with Ankjær's Danish "Geografisk-Statistisk Haandbog".