On 15-Jan-05, Hans Persson unicorn@lysator.liu.se wrote:
På URL:http://runeberg.org/upload.pl?mode=ocrlist finns en lista över de verk vi har som idag saknar OCR-texter. På sidan finns länkar där man kan ladda hem alla bilderna till ett verk och andra länkar man kan använda för att ladda upp OCR-filer för ett verk. Notera att man behöver bredband för att ladda hem bildfilerna, för det är ganska stora filer det handlar om. Det har också tillkommit en länk "(download)" i sidfoten för alla sidor inom ett verk, och via den kan man också hitta motsvarande länkar.
Nu kan vi i redaktionen förhoppningsvis låta er andra sköta en del av OCR-jobbet, och själva scanna ännu fler nya verk eller skriva nya funktioner.
One possible problem which this procedure does not address, I think, is what in a database is sometimes called the updating anomaly. Suppose Jörg Vetenskaper and Frederik Pedant both happen to download the same text for OCR conversion, and therefore duplicate the work. It may seem unlikely, but if unnecessary duplication of effort can be avoided, it would be best to do so.
Is it possible to add a mechanism to the http://runeberg.org/upload.pl?mode=ocrlist page that records which files have been "checked out" for OCR conversion, so that no one else will download the same work unnecessarily?
Another point might be to somehow record the name and email address of the person who downloads a ZIP file of images, and then have a method to automatically send that person a friendly email periodically, say every fortnight, requesting a progress report, until such time as the corresponding OCR files are eventually uploaded.
Maybe it's a bit of trouble, but I think that, if I had the appropriate OCR software to do this sort of work, I would be reluctant to undertake it, knowing that another person was doing the same scan conversion at the same time.
Best regards to the directors and to all the wonderful volunteers with Project Runeberg.
Erik Bjørn Pedersen Victoria, BC, Canada
sön 2005-01-16 klockan 10.04 skrev Erik B. Pedersen:
One possible problem which this procedure does not address, I think, is what in a database is sometimes called the updating anomaly. Suppose Jörg Vetenskaper and Frederik Pedant both happen to download the same text for OCR conversion, and therefore duplicate the work. It may seem unlikely, but if unnecessary duplication of effort can be avoided, it would be best to do so.
Is it possible to add a mechanism to the http://runeberg.org/upload.pl?mode=ocrlist page that records which files have been "checked out" for OCR conversion, so that no one else will download the same work unnecessarily?
I'm considering this. I don't want to lock the work outright (everyone that wants to should be able to download the images), but perhaps there should be some kind of mechanism to state when you are downloading the images that you plan to run OCR, and if someone does this for a work, subsequent downloaders are told this for a certain period after that.
Another point might be to somehow record the name and email address of the person who downloads a ZIP file of images, and then have a method to automatically send that person a friendly email periodically, say every fortnight, requesting a progress report, until such time as the corresponding OCR files are eventually uploaded.
I can imagine people downloading the tiff images for other purposes than doing OCR (getting at high-resolution illustrations, perhaps), so I don't think this is a good idea.
Maybe it's a bit of trouble, but I think that, if I had the appropriate OCR software to do this sort of work, I would be reluctant to undertake it, knowing that another person was doing the same scan conversion at the same time.
I don't think this will turn out to be a problem in reality, but we'll see. For now, I'll let this be as it is for a little while and complete another related feature.
I'll get back to this feature in a little while, and hopefully it will then have seen some use so that I have some actual use cases to evaluate it with.
Hans
Hans Persson wrote:
I can imagine people downloading the tiff images for other purposes than doing OCR (getting at high-resolution illustrations, perhaps), so I don't think this is a good idea.
Yes, volunteering for OCR should be kept separate from downloading the TIFFs. The two are not necessarily connected.
Erik B. Pedersen wrote:
One possible problem which this procedure does not address, I think, is what in a database is sometimes called the updating anomaly. Suppose Jörg Vetenskaper and Frederik Pedant both happen to download the same text for OCR conversion, and therefore duplicate the work. It may seem unlikely, but if unnecessary duplication of effort can be avoided, it would be best to do so.
You're right. One way to avoid this could be:
1. Register a user identity and log in. (To be implemented.) 2. Sign up for OCRing a work. Your user ID and the date-time is recorded as a volunteer for OCR of this work. 3. If you haven't uploaded the results within 7 days, a reminder e-mail is sent to you and after 14 days, your name is removed so another user can volunteer for OCRing of the same work.
This could of course be a generic mechanism for any kind of work that takes longer than proofreading a single page.
Right now this stops at point 1, which someone (Erik? Hans?) has to implement. We have been talking about this among ourselves for ages already. (A bad sign: I've bought this book on "Procrastination", but I haven't found time to read it yet.)
Best regards to the directors and to all the wonderful volunteers with Project Runeberg.
Thank you!
On 16-Jan-05, at 10:06 AM, Lars Aronsson wrote:
Suppose Jörg Vetenskaper and Frederik Pedant both happen to download the same text for OCR conversion, and therefore duplicate the work. -- Erik Pedersen
You're right. One way to avoid [the updating anomaly] could be:
- Register a user identity and log in. (To be implemented.)
- Sign up for OCRing a work. Your user ID and the date-time is recorded as a volunteer for OCR of this work.
- If you haven't uploaded the results within 7 days, a reminder e-mail is sent to you and after 14 days, your name is removed so another user can volunteer for OCRing of the same work.
This could of course be a generic mechanism for any kind of work that takes longer than proofreading a single page.
This seems to be a better solution than my initial suggestion to remind EVERY downloader, since, as Hans Persson pointed out, there may be other legitimate reasons to download an image file than a commitment to OCR-convert it.
Right now this stops at point 1, which someone (Erik? Hans?) has to implement. We have been talking about this among ourselves for ages already. (A bad sign: I've bought this book on "Procrastination", but I haven't found time to read it yet.)
Ha! That's funny . . .
Regards, Erik