Fixing Unicode mistakes and more: the ftfy package

There’s been a great response to my earlier post, Fixing common Unicode mistakes with Python. This is clearly something that people besides me needed. In fact, someone already made the code into a web site, at fixencoding.com. I like the favicon.

I took the suggestion to split the code into a new standalone package. It’s now called ftfy, standing for “fixes text for you”. You can install it with pip install ftfy.

I observed that I was doing interesting things with Unicode in Python, and yet I wasn’t doing it in Python 3, which basically makes me a terrible person. ftfy is now compatible with both Python 2 and Python 3.

Something else amusing happened: At one point, someone edited the previous post and WordPress barfed HTML entities all over its text. All the quotation marks turned into ", for example. So, for a bit, that post was setting a terrible example about how to handle text correctly!

I took that as a sign that I should expand ftfy so that it also decodes HTML entities (though it will leave them alone in the presence of HTML tags). While I was at it, I also made it turn curly quotes into straight ones, convert Windows line endings to Unix, normalize Unicode characters to their canonical forms, strip out terminal color codes, and remove miscellaneous control characters. The original fix_bad_unicode is still in there, if you just want the encoding fixer without the extra stuff.

5 thoughts on “Fixing Unicode mistakes and more: the ftfy package

  1. Pingback: Fixing common Unicode mistakes with Python — after they’ve been made | Luminoso Blog

  2. Thanks for posting this information – I had been plagued with these Unicode errors in a SQL script, the web site you recommended (http://fixencoding.com/) was (almost) perfect for fixing the issues. The window there can handle up to 999 lines of code and as my code had a little over 5000 lines it only took six chunks to clear up all the issues. Now if some bright person could create a small PHP program that would accomplish what your Python program does it could be run on local hosts to clear up hidden errors that hide inside program data strings.

  3. Unfortunately, the domain registration at fixencoding.com has lapsed.

    Nice utility, but why on Earth would I want to replace properly formatted quotes with straight ones? That seems like a very strange default.

    As it turns out, it isn’t even doing that, reliably, on this piece of text that I was sent from a broken epub:
    “foo,�

    Clearly, utf-8 has been mangled as cp-1252, and it should be
    “foo”
    or
    “foo”
    (with your defaults)

    If I put a newline before the right-double-quote, the left-double-quote is fixed correctly, which makes me think that the right quote has been completely mangled before it reached me.

    • Straightening quotes is the default for the same reason that NFKC normalization is the default: because there’s a difference between text content and typography, and often in code you want to disregard typography and only work with the content. Have you ever seen a regex, for example, that searches for quotation marks and actually matches curly ones?

      If you want to preserve the typography of text you run through ftfy, you should definitely turn off quote-straightening, and also NFKC normalization, and possibly even escape-sequence removal. Maybe all you want to do is fix mojibake, in which case you should call fix_text_encoding directly.

      NFKC normalization, like quote straightening, is useful in many cases, but it definitely loses typographical information. It will convert an ellipsis character into three periods, for example, which could affect the spacing when displayed. It will turn the ligature “fl” into two separate letters. It will also turn ‘A’, which is U+FF21 FULLWIDTH LATIN CAPITAL LETTER A, into a plain ASCII “A”, and that will certainly affect the typography of CJK text.

      And this is why options are options. Just be aware of what they are and choose the right ones for your use case.

      The example you pasted is indeed irreversibly mangled. It has been decoded with Windows-1252, but it ends with the Unicode replacement character, �, which of course is not a particular character in Windows-1252. Data has been lost: the code that produced that string would have produced the same string for “foo‐, ending with a Unicode hyphen instead of a right double quote.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s