OCRStart the OCR process by clicking the 'Read' button on the toolbar (or Document -> Read)
This will take a little bit of the time as it analyzes each page and tries to convert the image to text
(If you ever need/want to read just selected pages instead of all pages, you can select the desired pages in the thumbnail pane and right click -> Read Page or Page -> Read Page)
ProofingThis is the real meat of the operation and where you'll spend the majority of your time. It's also where we separate the pretenders from the contenders.
Screen Layout: You should be in the 'standard' 4 pane layout with thumbnails on the far left, page image in the top left, page text in the top right and zoom window in the bottom.
If you aren't, check that the following options are set in the menu:
View -> Pages Window -> Show Pages Window
View -> Pages Window -> Thumbnails
View -> Pages Window -> Left
View -> Images/Text Window -> Show Page Image and Page Text
View -> Images/Text Window -> Highlight uncertain characters
View -> Zoom Window -> Show Zoom Window
View -> Zoom Window -> Dock Bottom
SetupOn the 'Save' button on the toolbar is a small down triangle. Click it and select 'Save as HTM' EVEN IF YOU WANT A DIFFERENT FORMAT
Beside the 'Save' button is a dropdown box. Select 'Flexible layout'. If you don't see 'Flexible layout' it's because you didn't set the save button to HTM as directed in the previous step.
Underneath the dropdown box are 2 sets of toggle buttons.
- set Keep/Remove pictures as desired
- set the other one to 'Remove headers and footers'
If you go to a page with headers/footers, you should see an empty green box where the header is on the text pane. This shows you exactly what it is removing from that page.
ProofingThis isn't complex but it does require good concentration and is absolutely vital to producing a quality scan.
Read the book in the text window, constantly referring to the image window if you have a question. Correct any errors as you find them.
The blue highlights are characters that the program isn't certain of. The blue won't show up in the final save (unless you want it to).
Don't worry so much about formatting, just go for textual correctness.
IF YOU NOTICE AN ERROR THAT WOULD PASS SPELLCHECK (for instance 'the' sometimes gets recognized as 'die') be sure to make a separate note of it so that when you are finished you can do a final check to see if the same error occurred in any other places that you missed when proofing.
It's just a good idea in general to keep a list of notes to yourself about special formatting (like smallcaps or lists or poetry or whatever) you want to revisit later to make sure it got formatted correctly.
After you have finished reading it, click the 'Check Spelling' button in the Text window or Tools -> Check Spelling
split words - FineReader is usually fairly intelligent about putting words back together that have been split at the end of a line. You can tell whether it will combine words by looking for the line-continuation character (like a dash except it has a 'tail' that drops down on the right side). If you see the line-continuation character you know that it will join the words when saving. If this is incorrect, simply replace it with a regular dash. If you don't see a line-continuation character where there should be one, you will have to manually join the word yourself.
FineReader will NEVER join words across page breaks, so if a word is split across pages, be sure to move it entirely to one side or the other.
Congrats, you've now completed the most difficult part!
SaveThis is for html which is what I use, but other formats are similar.
To save . . . click the 'Save' button.
If you get a message like 'Some of the pages have not been recognized and cannot be saved', that means you skipped reading (OCRing) some of the pages. You can either go back and read those pages or just select the read pages in the thumbnail window and right click -> Save Selected Pages As -> HTML Document
On the bottom right of the Save dialog is an 'Options. . .' button, click that.
You should see a tab strip of all the different formats (RTF/DOC/DOCX, XLS/XLSX, PDF, HTML, etc) that is currently on the HTML tab.
Set retain layout to 'Formatted text' and Encoding to 'Unicode (UTF-8)'. Every other option should be UNchecked (including 'Use CSS'). The only exception is if you want to include pictures at this point. I process them separately, but whatever you want.
Click 'OK' to close the options window and then 'Save' to create your file.
The main annoyance here is that setting retain layout to 'Formatted text' affects the view back in the main FineReader window. When proofing you want it on 'Flexible layout' so you can easily match the text to the image. But when saving you want 'Formatted text' because you don't want to preserve linebreaks and stuff like that. So once you finish saving, be sure to go and switch the view back to 'Flexible layout'.
Polishing - GeneralThese steps apply to any format you save in. I'll post some html specific tips later.
1. Go through your notes- Check all spots that need special formatting and check for any OCR errors you noted that might evade spellcheck
2. Check for common OCR issues
Search for instances of '1' and other special characters like
3. Spellcheck in Word. Yes, you spellchecked in FineReader which is good, but Word has some different/better proofing tools which will catch things you missed.
DON'T EDIT HTML DIRECTLY IN WORD! It creates an unholy mess.
Instead create a copy of the html file and open the COPY in Word. Then use the html editor of your choice to manually fix errors in the original as you find them in Word.
4. Paragraph check
This is a simple check that can make a big difference in the quality of your final output. On the left half of your screen put FineReader with basically just the image and thumbnail pane showing. On the right half of the screen open your document. Scroll through both checking that the paragraphs match exactly. FineReader will sometimes mess-up where it puts paragraph breaks, especially at page boundaries where it likes to either unnecessarily split paragraphs or incorrectly join 2 separate paragraphs.
While this is mainly for paragraphs, keeps your eyes open. I never fail to catch at least one other issue while going through it, whether it's related to special formatting or something else.
This will also make sure all section breaks show up
5. Hyphen check
This is completely optional, but i like to find every instance of a hyphen (-) to check
a) if there aren't words that should be joined
b) if they should actually be em-dashes
c) if they are just stray OCR marks
6. Final read through
This is another optional check and we're certainly reaching the point of diminishing returns, but I like to do a read through of the document in its final format to make sure I didn't miss anything.