About this blog

Translator's Shack is a collection of links, news, reviews and opinions about translation technologies. It's edited and updated by Roberto Savelli, an English to Italian translator, project manager and company owner of Albatros Soluzioni Linguistiche, a team of English-Italian translators, which hosts and supports this blog.

The Life as a PM category, managed by Gabriella Ascari, contains topics that are less technical in nature, but which we're sure will be appreciated by owners of small translation businesses and freelancers.

Here are links to my pages on some social networks:

Highly recommended:


Translating text in AutoCAD drawings

The excellent Translator’s Tools blog  contains a post on TranslateCAD, a utility that can be used to translate files saved in the AutoCAD format. We receive files in this format only occasionally, but I will definitely go back to that post the next time we receive this type of request.

via Translator’s Tools | Translating text in AutoCAD® drawings

High-quality, free PDF to Word conversion

i_index Using PDF files as the source for a translation is always challenging, especially with documents that have a non-linear text flow like brochures and presentations.

Our standard policy is to ask our clients to send the original file that was used to produce the PDF file that they want us to translate. This is usually the best option and allows us to deliver a translated document that is editable with the same program that was used to produce the original (although I do not like working with the verbose “tag soup” produced by XPress or InDesign converters and would sometimes rather convert the PDF to Word when dealing with these two translator-unfriendly formats).

Freewaregenius has published a review about PDF to Word Free. Here is a short excerpt

[…] in terms of conversion quality this is hands down the best free PDF to DOC/RTF converter that I have seen; there is simply nothing that comes close.

The service is still in private beta. The freewaregenius review contains an invite code that may allow you to join the program.

Update: Lifehacker also posted a review about PDF-to-Word. One piece of information that is added by this review is the fact that the service actually performs an OCR extraction of the source file. So the conversion to Word allows to extract text from those lousy “static” (non-editable) PDF files that contain text pages pasted as images (yes, sadly we sometimes receive that sort of file from clients). Lifehacker also offers an invite code to test the service.

Update 2: as pointed out by readers on the LifeHacker blog and on this MemoQ forum thread, PDFtoWord does not perform any OCR.

PDF to Word Free: a web service that delivers free, high quality PDF to DOC conversions | freewaregenius.com
PDF-to-Word Converter Pulls Readable Text from Scanned Images

Caterpillar 1.3 tested

Since the previous post generated some interest, I have decided to put the program to the test. I saved the page from http://www.apple.com/mac/ as a local HTML file and fed it to the program. Here’s a screenshot of Caterpillar’s main window:

Caterpillar main window

Caterpillar main window

Just place the source files in the folder indicated by the In path, choose an extraction path and hit Extract. The processing speed on this single, short page was very high and the resulting TXT file was created instantaneously.

Here’s what the program’s output looks like:

Caterpillar output

Caterpillar output

As you can see, after the first couple of lines containing headers that will help the program reconstruct the file after it has been translated, the structure contains the following fields:

ID=

Type=

Source=

Target=

The translator will then have to translate all the fields preceded by “Target=” (using the CAT-tool of choice) and then reconstruct the translated HTM file by using Caterpillar’s Integration command.

Here’s what I found out during this brief test:

  • The tags are completely taken out of the equation when the text is converted to TXT. This can be either good or bad, depending on the translator’s tastes and on the type and complexity of the file being processed
  • The program does not assign internal/external styles to the file, so if the translator wants to use a CAT tool to translate it, the choice is between moving the cursor to the beginning of the “Target=” header after having translated each sentence, or to prep the file by assigning the “translatable” attribute to the Target sentence and by making the rest of the test untranslatable
  • I noticed that if the HTML file contains diacritics (accented letters) or characters that are rendered by using Unicode in the HTML file, these become corrupted during the conversion to the translatable TXT file. This issue might or might not be addressed by the Encoding Converter option available in the program, which I did not test
  • Caterpillar has an option that allows to merge the source files into one single translatable TXT file. This sounds particularly interesting for translators who like the auto-propagation feature offered by some CAT tools and for those complex projects comprising hundreds of tiny HTML files in multiple sub-folders
  • Interestingly, the file types that are available for processing include (besides HTML) PHP, XML and ASP. I did not test all these formats. However, I did test the program with one of those dreaded XML files that contain embedded HTML code. Surprisingly, Caterpillar did a decent job of extracting the translatable text. On the downside, the program creates a segmentation break at each tag that is preceded and followed by translatable text, so for the following code:
    <p><b>Note:</b> When searching, look for the
    <img src="/images/search/plus_icon.gif"
    width=9 height=9> icon to see which items are
    only available to ACME <span class="hlt">Plus</span>
    customers.]]></Data></Cell>

    is rendered as follows:

    Source=When searching, look for the
    Target=When searching, look for the
    ID=133
    Type=text
    Source=icon to see which items are only available
    to ACME
    Target=icon to see which items are only available
    to ACME
    ID=134
    Type=text
    Source=Plus
    Target=Plus
    ID=135
    Type=text
    Source=customers.]]>
    Target=customers.]]>

In conclusion, although this tool still requires some extra work for prepping the files in order to process them with a CAT tool and it takes a rather radical approach to tags (by deleting them from the working file), it migh prove a useful addition to the utilities folder of those translators who use basic CAT tools that cannot prep HTML and tagged files, and to the advanced users who need a quick way of simplifying complex tagged files, for instance XML with embedded HTML.

However, full reliablility of Caterpillar should be tested carefully before using the tool on a large-scale project.

Caterpillar 1.3: Wordfast-compatible HTML prepping tool

At 30.00 EUR, this inexpensive tool may be a valuable solution for translators who are using tools that do not have file-prepping capabilities (i.e. externalizing or separating and protecting HTML code that does not need to be translated or otherwise changed).

Has any one heard of the tool or used it for translation projects?

Caterpillar is a high-speed HTML Text Extractor and Integrator written for translators working with web sites. Process whole folders of web pages with a single click, then translate using your choice of software.

By generating a single output file containing all the text requiring translation Caterpillar provides a simple way to incorporate web page localisation into your existing translation work flow.

WordFast compatible – now you can translate web sites in the familiar environment of MS Word.

Caterpillar 1.3 – HTML Extractor and Integrator