How to convert Unicode characters to XML character references

I scoured the web for how to convert in-line Unicode characters to XML character references in Python, but I could not find a clear way to do it. After hours of experimenting, I finally understood - and as usual, since it was past 9:30 pm (my standard programming cut-off time), that was why it was really just one line of code.

It is not necessary to import the codecs module or do anything fancy with languages or encoding. It is necessary only to read the documentation carefully and completely.

To convert a string containing a unicode character to the appropriate XML character references automatically:

strResult = strInput.encode('ascii', 'xmlcharrefreplace')

Where strInput is a valid Python string. That's it. One line of code.

The longer explanation: encode is a built-in method of a Python string. The first parameter is the encoding to output; the second parameter is a directive for what to do if a character cannot be encoded. Several default parameters are 'strict' (raise an error); 'ignore' (just drop the character); and 'xmlcharrefreplace' (convert the offending character to the appropriate character reference). More parameters in the docs.

Very handy.

Many thanks to Sam Ruby for "Unicode and weblogs", which showed me that someone is trying to use Unicode in Cheetah, and which has a quick little collection of Unicode characters courtesy the commenters. Also a big thank-you to David Goodger for his Encoding Unicode data for XML and HTML Python recipe, the example primarily responsible for lighting my bulb.


Written by Andrew Ittner in misc on Thu 20 October 2005. Tags: programming, python