Dec 7th, 2011, 8:05 pm
Hello, I've had a dream for a long while of scanning some out of print books in my collection to digital format. Currently I have Aabby finereader Pro installed, using an epson V30. Sadly I'm more than a little confused as to how to get everything accomplished. Whenever I click on scan to word doc,it will run the scanner,say it's processing the page,and sit there until I go ahead and cancel it.Then I can hit 'read' and it will ocr the page.

Is there any way I can just use this software to scan page after page to a word doc? Should I try saving it in a different format? Goal is to complete an ebook in mobi format. Appreciate any help I can get.

Thanks!
Dec 7th, 2011, 8:05 pm
Dec 7th, 2011, 11:04 pm
don't use those quick tasks or whatever

stolen from another forum:

Doing the Actual Scan

The FIRST thing to do when starting a new project in FineReader is to go to File->Save FineReader document. Otherwise it will save scanned pages off into some temp location and cause you all sorts of grief.

Next go to Tools->Options and look at the Scan/Open tab
- Do not read and analyze acquired page images automatically
- Enable image preprocessing
- For most people, Split dual pages (I don't, but I often work with images that span pages)
- You should see your scanner listed in the 'Driver' dropdown. If you don't, either you don't have the drivers loaded or your scanner isn't connected

There are differing opinion on the best resolution and color to scan at. I prefer 300dpi grayscale, others like 600dpi black/white.

It is important to press down HARD on the spine to eliminate the spine shadow as much as possible.

After a few test scans to make sure everything is working right, you can set it to scan multiple pages so you can just flip pages without hitting any buttons.

Depending on the speed of your scanner and the quality of the book, you can get anywhere from 2.5 to 10+ pages/minute. Most reasonable novels should scan in less than an hour, and most closer to 30 minutes.

Congratulations, you just finished the easy part of converting a book to an ebook.


Optional: Image Prep

Page Flipping and Splitting - If you didn't get your options set right or you decided to manually split pages, now is the time.

Page -> Edit Page Image - There are options on the side to do different things to the image such as Rotate&Flip and Split. If all the pages need to be flipped, check 'Apply to all images'. If you want to manually split all pages, check 'Show next page after operation is finished' at the very bottom. Then as soon as you split one page it will take you to the next (hint: you can use the Enter key to accept a split so you never have to press any buttons and don't have to move your mouse back and forth, saving time)

Page Cropping - while not strictly necessary I like to do this for several reasons
1) it's neater
2) reduces OCR issues
3) when proofing it allows the text to be larger (easier to see) when zooming to the full page

Access Crop the same way (Page -> Edit Page Image). Drag around and resize the crop rectangle with the mouse until it just covers the text. Again be sure to check 'Show next page after operation is finished' and use the Enter key to accept a crop


Optional: Accents

If you're scanning a book with a lot of accented characters, you may want those recognized automatically.

(Of course if the book is completely in a foreign language, you should use that as the recognition language. This is for English with a smattering of accents thrown in)

Tools -> Language Editor

New -> Create a new language based on an existing one: 'English'

Language name: English with accents
Source language: English (United States)
Dictionary: Built-in dictionary
Alphabet -> click the [...] button at the end

Unicode subrange: Latin-1 Supplement

Drag select rows 2 and 3 and then individually click additional characters you don't want included (÷) or do want included (¡¿)

Click Ok and Ok and then on the left sidebar set 'Document Language' to 'English with accents'
Dec 7th, 2011, 11:04 pm
Dec 7th, 2011, 11:09 pm
OCR

Start the OCR process by clicking the 'Read' button on the toolbar (or Document -> Read)

This will take a little bit of the time as it analyzes each page and tries to convert the image to text

(If you ever need/want to read just selected pages instead of all pages, you can select the desired pages in the thumbnail pane and right click -> Read Page or Page -> Read Page)

Proofing

This is the real meat of the operation and where you'll spend the majority of your time. It's also where we separate the pretenders from the contenders.

Screen Layout: You should be in the 'standard' 4 pane layout with thumbnails on the far left, page image in the top left, page text in the top right and zoom window in the bottom.

If you aren't, check that the following options are set in the menu:
View -> Pages Window -> Show Pages Window
View -> Pages Window -> Thumbnails
View -> Pages Window -> Left
View -> Images/Text Window -> Show Page Image and Page Text
View -> Images/Text Window -> Highlight uncertain characters
View -> Zoom Window -> Show Zoom Window
View -> Zoom Window -> Dock Bottom

Setup
On the 'Save' button on the toolbar is a small down triangle. Click it and select 'Save as HTM' EVEN IF YOU WANT A DIFFERENT FORMAT

Beside the 'Save' button is a dropdown box. Select 'Flexible layout'. If you don't see 'Flexible layout' it's because you didn't set the save button to HTM as directed in the previous step.

Underneath the dropdown box are 2 sets of toggle buttons.
- set Keep/Remove pictures as desired
- set the other one to 'Remove headers and footers'

If you go to a page with headers/footers, you should see an empty green box where the header is on the text pane. This shows you exactly what it is removing from that page.

Proofing

This isn't complex but it does require good concentration and is absolutely vital to producing a quality scan.

Read the book in the text window, constantly referring to the image window if you have a question. Correct any errors as you find them.

The blue highlights are characters that the program isn't certain of. The blue won't show up in the final save (unless you want it to).

Don't worry so much about formatting, just go for textual correctness.

IF YOU NOTICE AN ERROR THAT WOULD PASS SPELLCHECK (for instance 'the' sometimes gets recognized as 'die') be sure to make a separate note of it so that when you are finished you can do a final check to see if the same error occurred in any other places that you missed when proofing.

It's just a good idea in general to keep a list of notes to yourself about special formatting (like smallcaps or lists or poetry or whatever) you want to revisit later to make sure it got formatted correctly.

After you have finished reading it, click the 'Check Spelling' button in the Text window or Tools -> Check Spelling

split words - FineReader is usually fairly intelligent about putting words back together that have been split at the end of a line. You can tell whether it will combine words by looking for the line-continuation character (like a dash except it has a 'tail' that drops down on the right side). If you see the line-continuation character you know that it will join the words when saving. If this is incorrect, simply replace it with a regular dash. If you don't see a line-continuation character where there should be one, you will have to manually join the word yourself.

FineReader will NEVER join words across page breaks, so if a word is split across pages, be sure to move it entirely to one side or the other.

Congrats, you've now completed the most difficult part!


Save

This is for html which is what I use, but other formats are similar.

To save . . . click the 'Save' button.

If you get a message like 'Some of the pages have not been recognized and cannot be saved', that means you skipped reading (OCRing) some of the pages. You can either go back and read those pages or just select the read pages in the thumbnail window and right click -> Save Selected Pages As -> HTML Document

On the bottom right of the Save dialog is an 'Options. . .' button, click that.

You should see a tab strip of all the different formats (RTF/DOC/DOCX, XLS/XLSX, PDF, HTML, etc) that is currently on the HTML tab.

Set retain layout to 'Formatted text' and Encoding to 'Unicode (UTF-8)'. Every other option should be UNchecked (including 'Use CSS'). The only exception is if you want to include pictures at this point. I process them separately, but whatever you want.

Click 'OK' to close the options window and then 'Save' to create your file.

The main annoyance here is that setting retain layout to 'Formatted text' affects the view back in the main FineReader window. When proofing you want it on 'Flexible layout' so you can easily match the text to the image. But when saving you want 'Formatted text' because you don't want to preserve linebreaks and stuff like that. So once you finish saving, be sure to go and switch the view back to 'Flexible layout'.


Polishing - General

These steps apply to any format you save in. I'll post some html specific tips later.

1. Go through your notes- Check all spots that need special formatting and check for any OCR errors you noted that might evade spellcheck

2. Check for common OCR issues

Search for instances of '1' and other special characters like
Code: Select all< > * / \ # @ | ^


3. Spellcheck in Word. Yes, you spellchecked in FineReader which is good, but Word has some different/better proofing tools which will catch things you missed.

DON'T EDIT HTML DIRECTLY IN WORD! It creates an unholy mess.

Instead create a copy of the html file and open the COPY in Word. Then use the html editor of your choice to manually fix errors in the original as you find them in Word.

4. Paragraph check

This is a simple check that can make a big difference in the quality of your final output. On the left half of your screen put FineReader with basically just the image and thumbnail pane showing. On the right half of the screen open your document. Scroll through both checking that the paragraphs match exactly. FineReader will sometimes mess-up where it puts paragraph breaks, especially at page boundaries where it likes to either unnecessarily split paragraphs or incorrectly join 2 separate paragraphs.

While this is mainly for paragraphs, keeps your eyes open. I never fail to catch at least one other issue while going through it, whether it's related to special formatting or something else.

This will also make sure all section breaks show up


5. Hyphen check

This is completely optional, but i like to find every instance of a hyphen (-) to check
a) if there aren't words that should be joined
b) if they should actually be em-dashes
c) if they are just stray OCR marks

6. Final read through

This is another optional check and we're certainly reaching the point of diminishing returns, but I like to do a read through of the document in its final format to make sure I didn't miss anything.
Dec 7th, 2011, 11:09 pm
Dec 8th, 2011, 10:29 am
rueyew wrote:stolen from another forum:
You mean, 'shared from another forum' ? :lol:

Awesome sharing!

Though I may never be able to test this (lack of time, impatient attitude and so on), I really appreciate your effort! Thanks, dude! :D
Dec 8th, 2011, 10:29 am

i can't reup dead links anymore
Dec 9th, 2011, 4:36 am
I've always guessed that scanning a book would be a lot of work, but I really had no idea of how much work it actually is. which is why I always say a mental "thank you" to the people that take the time to do it. Specially 10+ years ago when the tech wasn't even close to what it is now. Even today, that's still more than a few hours of work.
Dec 9th, 2011, 4:36 am

Image Image
Dec 9th, 2011, 5:27 am
well that makes it sound a lot more complicated than it actually is because it's trying to cover all the contingencies of what can go wrong.

Normally the scanning's pretty straightforward then you can spend as much or as little time proofing as you want.
Dec 9th, 2011, 5:27 am
Dec 9th, 2011, 7:45 am
Yeah, but the proof IS in the pudding.
Dec 9th, 2011, 7:45 am

Image Image
Dec 12th, 2011, 3:04 am
rueyew, Thanks so very much! I lost track of this post,until today. I really appreciate the tips and will try them,the next chance I have to sit down.
Dec 12th, 2011, 3:04 am
Dec 13th, 2011, 5:43 pm
zackddog wrote:Yeah, but the proof IS in the pudding.


:lol: The proof is in EATING the pudding!
Image

:lol:
Dec 13th, 2011, 5:43 pm

If a link is dead and you don't get a reply from me, please refer it to a Mod. Apologies for the inconvenience.