Another online PDF to Word conversion service — this time with OCR included

After the not-so great results I obtained with free online OCR services for PDF files (the main problem being that most services do not do OCR but just convert editable PDF text to Word and do not process embedded text graphics), I may have found a service that actually delivers on this promise: OnlineOCR.net. From the site’s own description:

OnlineOCR.net is a web-based Optical Character Recognition (OCR) service that allows you to convert scanned images and documents into editable Word, Text, Excel, PDF, Html output formats.

A couple of minor caveats
  • You need to get a (free) account if you want to convert PDF>DOC
  • The activation email I received ended up in my Gmail spam. So you may want to check your Spam folder if you think you have not received the activation message.
Testing the system

I did a test with a two-page PDF file containing editable text in fancy formatting on page 1 and text pasted in as lo-res graphics on page 2.

image The first thing that you’ll notice when uploading your first document is the language choice: this is very positive, as it means that the service will compare the scanned text to a language-specific wordlist to correct any errors.
image Options allow to specify that you are uploading a multi-page document and the pages that you want to convert.
image After the document has been processed (which took about 30 seconds in my test), you are taken to the “Workspace”, where a list of all processed documents is available. From there you just need to click on the link of your converted document to download it.

Results

The system worked fairly well with my test document. Page 1 was rendered without any spelling errors and this confirms my impression that the editable text contained in the PDF is preserved without running it through OCR, which is great. The system has added frames, section breaks and tables in order to render the “fancy” multi-column formatting of the source PDF file.

Page 2 of the DOC file, which contained the graphic text, was rendered with some errors. This was low resolution text, and you might obtain better results if using better-quality embedded graphic text. In this case, too, the formatting was rendered by inserting tables and section breaks.

One advantage that was immediately noticeable was the fact that OnlineOCR does a rather good job at preserving the original’s formatting and does this without adding superfluous carriage returns, which are such a nuisance for translators since they disrupt the sentence-by-sentence sequence used by most CAT tools.

Verdict

I could not find any information on the website that would indicate a payment plan for this service, so I would assume it’s offered for free. Considering the price, I think that this system is well worth a try if you need to convert a PDF file into an editable format. If the PDF document only (or mainly) contains editable text, you will be pleased by the results. If the file also contains text that has been pasted as graphic pages, the output will likely require some post-editing, but I think that will be comparable to what you may obtain with the majority of commercial OCR packages.

OnlineOCR.net

Published by Roberto Savelli

English to Italian translator, translation technology enthusiast. http://www.albatrossolutions.com

Join the Conversation

5 Comments

  1. I just tried it, and would be great but it is only “free to try” … first five pages. At this time there clearly a pay method you must use for more than your trial of 5 pages. You buy blocks of pages … 10c/pg for 30 pages ($3), 5c/pg for 200 pages ($20). Good service, but definitely NOT “free”.

    Thanks for your review.

  2. Thanks , I’ve recently been searching for information approximately
    this subject for a long time and yours is the greatest I’ve came upon till now.
    However, what about the bottom line? Are you sure in regards to the supply?

Leave a comment

Leave a Reply to justhinkin Cancel reply

Your email address will not be published. Required fields are marked *