Unicode to ascii mappings for standard characters from wordprocessed documents

SEPTEMBER 25, 2006

Anyone who has converted some old wordprocessed documents to plain ascii text will know that wordprocessors love to insert their only special versions for a few of the standard characters such as ' and " (- also comes up pretty frequently). I personally came across while using odt2txt plus openoffice to convert some old .rtf and .doc files to plain text. By default odt2txt writes the files as utf-8 which is fine except there is really no reason these shouldn’t be full on ascii (plus the standard vim distribution on mac osx doesn’t support unicode!). So after some digging around here are the relevant code conversions you will usually need:

\u2018 (curly right single quote) -> '
\u2019 (curly left single quote) -> '
\u201c (curly right double quote) -> "
\u201d (curly left double quote) -> "

Or in python:

 out = <your-unicode-text>
 out = out.replace( u'\u2018', u"'")
 out = out.replace( u'\u2019', u"'")
 out = out.replace( u'\u201c', u'"')
 out = out.replace( u'\u201d', u'"')
 out.encode('ascii')