About this blog

Translator's Shack is a collection of links, news, reviews and opinions about translation technologies. It's edited and updated by Roberto Savelli, an English to Italian translator, project manager and company owner of Albatros Soluzioni Linguistiche, a team of English-Italian translators, which hosts and supports this blog.

The Life as a PM category, managed by Gabriella Ascari, contains topics that are less technical in nature, but which we're sure will be appreciated by owners of small translation businesses and freelancers.

Here are links to my pages on some social networks:

Highly recommended:


Another online PDF to Word conversion service — this time with OCR included

After the not-so great results I obtained with free online OCR services for PDF files (the main problem being that most services do not do OCR but just convert editable PDF text to Word and do not process embedded text graphics), I may have found a service that actually delivers on this promise: OnlineOCR.net. From the site’s own description:

OnlineOCR.net is a web-based Optical Character Recognition (OCR) service that allows you to convert scanned images and documents into editable Word, Text, Excel, PDF, Html output formats.

A couple of minor caveats
  • You need to get a (free) account if you want to convert PDF>DOC
  • The activation email I received ended up in my Gmail spam. So you may want to check your Spam folder if you think you have not received the activation message.
Testing the system

I did a test with a two-page PDF file containing editable text in fancy formatting on page 1 and text pasted in as lo-res graphics on page 2.

image The first thing that you’ll notice when uploading your first document is the language choice: this is very positive, as it means that the service will compare the scanned text to a language-specific wordlist to correct any errors.
image Options allow to specify that you are uploading a multi-page document and the pages that you want to convert.
image After the document has been processed (which took about 30 seconds in my test), you are taken to the “Workspace”, where a list of all processed documents is available. From there you just need to click on the link of your converted document to download it.

Results

The system worked fairly well with my test document. Page 1 was rendered without any spelling errors and this confirms my impression that the editable text contained in the PDF is preserved without running it through OCR, which is great. The system has added frames, section breaks and tables in order to render the “fancy” multi-column formatting of the source PDF file.

Page 2 of the DOC file, which contained the graphic text, was rendered with some errors. This was low resolution text, and you might obtain better results if using better-quality embedded graphic text. In this case, too, the formatting was rendered by inserting tables and section breaks.

One advantage that was immediately noticeable was the fact that OnlineOCR does a rather good job at preserving the original’s formatting and does this without adding superfluous carriage returns, which are such a nuisance for translators since they disrupt the sentence-by-sentence sequence used by most CAT tools.

Verdict

I could not find any information on the website that would indicate a payment plan for this service, so I would assume it’s offered for free. Considering the price, I think that this system is well worth a try if you need to convert a PDF file into an editable format. If the PDF document only (or mainly) contains editable text, you will be pleased by the results. If the file also contains text that has been pasted as graphic pages, the output will likely require some post-editing, but I think that will be comparable to what you may obtain with the majority of commercial OCR packages.

OnlineOCR.net

5 comments to Another online PDF to Word conversion service — this time with OCR included

Leave a Reply

Your email address will not be published. Required fields are marked *