translating RSS

June 13th, 2002

Jenny asks

my next question is how do we get on-the-fly translation into news aggregators, but I’m sure we’re a ways off from that.

Not that far off at all! I wrote Jenny an RSS cleanser in PHP recently, and it took me about 20 minutes tonight to add translation to it. I’ve showed it to Jenny but won’t post it here because (a) it really burdens the translation service and (b) for some reason accented characters in the description break the RSS feed, even if I use their html entity equivalents.

2 comments to “translating RSS”

  1. See my blog entry http://www.textartisan.com/caveatlector/archive/2002_06.html#e000211 for why entities fail in RSS feeds and what the fix is.

    I would explain here to save you the step, but it’s a little involuted.

    HTH. Good work you’re doing!

  2. Encoding text in RSS feeds has been something I’ve done considerable work researching. To cut to the chase, it’s often best to use UTF-8 encoding. Not ISO-8859 as some folks insist on believing and certainly not HTML entities.

    If not UTF-8 then it’s *very* important to make a proper association between using the right XML document encoding as that of the text being included inside items.

    A *great* many feeds are not doing it right.

    In addition, the RSS spec supports a language tag that is supposed to follow internationally recognized formatting. Many feeds are beginning to include a language tag:

    Translating from one language to another could be done quite easily if the feeds themselves start by including tags and text that’s encoded properly. The actual linguistic translation is, of course, as difficult as it’s always been.

    Here’s one that’s wrong:

    It’s got the wrong language tag and is using the wrong XML encoding. As a result you’d never even hope to automatically translate it without serious guesswork. A fix would be to encode the text using UTF-8 (or UTF-16), properly indicate it in the XML declaration and add a language tag.

    I’m generally finding that most developers are simply naive in recognizing how much work has gone into making internationalization work in every browser since NS4. A lot of the preconceived notions about not being able to use UTF-8 are usually based on lack of information.

    To that end, I’ve collected a number of links that might help others learn more about charsets, encoding, RSS and languages, see the reference section on Syndic8.com:

    Being able to do on-the-fly translations from one language to another is something that could be done *very* easily when existing standards are followed in creating the content. It’s really pretty easy once you understand it!