About this blog

Translator's Shack is a collection of links, news, reviews and opinions about translation technologies. It's edited and updated by Roberto Savelli, an English to Italian translator, project manager and company owner of Albatros Soluzioni Linguistiche, a team of English-Italian translators, which hosts and supports this blog.


The Life as a PM category, managed by Gabriella Ascari, contains topics that are less technical in nature, but which we're sure will be appreciated by owners of small translation businesses and freelancers.


Here are links to my pages on some social networks:


Highly recommended:


Free PDFUnlock! web service allows to remove limitations from PDF files [@PDFUnlock]

Reference material plays a very important part in most translation projects. We often receive reference files from our clients, and sometimes we have to find them ourselves through web searches or by browsing the client’s website.

The management and usage of reference files is one aspect that has been introduced in memoQ’s LiveDocs feature, which allows to create searchable corpora of monolingual source and target files. So it’s finally time to put all those reference PDFs to good use! But wait, there’s a catch…

Very often, publishers put locks on PDF files for various reasons, e.g. intellectual property protection, forced consistency by preventing unwanted changes, etc. Here is an example of the possible locks that can be applied to a PDF file (in this case the file is completely unlocked):

image

Today we needed to unlock a few PDF files in order to use them in LiveDocs. While looking for a possible solution, I came across the PDFUnlock! web service. It’s very simple to use: you upload a locked PDF file and you immediately receive a link to download the unlocked file. Here are some features from the site’s description:

PDF files can be secured with restrictions that prevent you from for example copying text from them or editing, printing, merging or splitting them. PDFUnlock! can remove these restrictions (a.k.a “owner password”).

If a password is required to open the uploaded file, you will be asked to enter it (a.k.a “user password”). PDFUnlock! cannot, however, recover lost or unknown user passwords.

A PDF file can also be subject to non-standard encryption, such as DRM. PDFUnlock! does not remove such.

There is a further limitation: the maximum file size is 5 MB. And, of course, the rule of thumb that applies to all free, unencrypted, unprotected web services: do not send anything confidential for conversion.

PDFUnlock!

Another online PDF to Word conversion service — this time with OCR included

After the not-so great results I obtained with free online OCR services for PDF files (the main problem being that most services do not do OCR but just convert editable PDF text to Word and do not process embedded text graphics), I may have found a service that actually delivers on this promise: OnlineOCR.net. From the site’s own description:

OnlineOCR.net is a web-based Optical Character Recognition (OCR) service that allows you to convert scanned images and documents into editable Word, Text, Excel, PDF, Html output formats.

A couple of minor caveats
  • You need to get a (free) account if you want to convert PDF>DOC
  • The activation email I received ended up in my Gmail spam. So you may want to check your Spam folder if you think you have not received the activation message.
Testing the system

I did a test with a two-page PDF file containing editable text in fancy formatting on page 1 and text pasted in as lo-res graphics on page 2.

image The first thing that you’ll notice when uploading your first document is the language choice: this is very positive, as it means that the service will compare the scanned text to a language-specific wordlist to correct any errors.
image Options allow to specify that you are uploading a multi-page document and the pages that you want to convert.
image After the document has been processed (which took about 30 seconds in my test), you are taken to the “Workspace”, where a list of all processed documents is available. From there you just need to click on the link of your converted document to download it.

Results

The system worked fairly well with my test document. Page 1 was rendered without any spelling errors and this confirms my impression that the editable text contained in the PDF is preserved without running it through OCR, which is great. The system has added frames, section breaks and tables in order to render the “fancy” multi-column formatting of the source PDF file.

Page 2 of the DOC file, which contained the graphic text, was rendered with some errors. This was low resolution text, and you might obtain better results if using better-quality embedded graphic text. In this case, too, the formatting was rendered by inserting tables and section breaks.

One advantage that was immediately noticeable was the fact that OnlineOCR does a rather good job at preserving the original’s formatting and does this without adding superfluous carriage returns, which are such a nuisance for translators since they disrupt the sentence-by-sentence sequence used by most CAT tools.

Verdict

I could not find any information on the website that would indicate a payment plan for this service, so I would assume it’s offered for free. Considering the price, I think that this system is well worth a try if you need to convert a PDF file into an editable format. If the PDF document only (or mainly) contains editable text, you will be pleased by the results. If the file also contains text that has been pasted as graphic pages, the output will likely require some post-editing, but I think that will be comparable to what you may obtain with the majority of commercial OCR packages.

OnlineOCR.net

High-quality, free PDF to Word conversion

i_index Using PDF files as the source for a translation is always challenging, especially with documents that have a non-linear text flow like brochures and presentations.

Our standard policy is to ask our clients to send the original file that was used to produce the PDF file that they want us to translate. This is usually the best option and allows us to deliver a translated document that is editable with the same program that was used to produce the original (although I do not like working with the verbose “tag soup” produced by XPress or InDesign converters and would sometimes rather convert the PDF to Word when dealing with these two translator-unfriendly formats).

Freewaregenius has published a review about PDF to Word Free. Here is a short excerpt

[…] in terms of conversion quality this is hands down the best free PDF to DOC/RTF converter that I have seen; there is simply nothing that comes close.

The service is still in private beta. The freewaregenius review contains an invite code that may allow you to join the program.

Update: Lifehacker also posted a review about PDF-to-Word. One piece of information that is added by this review is the fact that the service actually performs an OCR extraction of the source file. So the conversion to Word allows to extract text from those lousy “static” (non-editable) PDF files that contain text pages pasted as images (yes, sadly we sometimes receive that sort of file from clients). Lifehacker also offers an invite code to test the service.

Update 2: as pointed out by readers on the LifeHacker blog and on this MemoQ forum thread, PDFtoWord does not perform any OCR.

PDF to Word Free: a web service that delivers free, high quality PDF to DOC conversions | freewaregenius.com
PDF-to-Word Converter Pulls Readable Text from Scanned Images