Fixing common Unicode mistakes with Python — after they’ve been made

Update: not only can you fix Unicode mistakes with Python, you can fix Unicode mistakes with our open source Python package ftfy. It’s on PyPI and everything.

You have almost certainly seen text on a computer that looks something like this:

If numbers aren’t beautiful, I don’t know what is. –Paul Erdős

Somewhere, a computer got hold of a list of numbers that were intended to constitute a quotation and did something distinctly un-beautiful with it. A person reading that can deduce that it was actually supposed to say this:

If numbers aren’t beautiful, I don’t know what is. –Paul Erdős

Here’s what’s going on. A modern computer has the ability to display text that uses over 100,000 different characters, but unfortunately that text sometimes passes through a doddering old program that believes there are only the 256 that it can fit in a single byte. The program doesn’t even bother to check what encoding the text is in; it just uses its own favorite encoding and turns a bunch of characters into strings of completely different characters.

Now, you’re not the programmer causing the encoding problems, right? Because you’ve read something like Joel Spolsky’s The Absolute Minimum Every Developer Absolutely, Positively Must Know About Unicode And Character Sets or the Python Unicode HOWTO and you’ve learned the difference between text and bytestrings and how to get them right.

But the problem is that sometimes you might have to deal with text that comes out of other code. We deal with this a lot at Luminoso, where the text our customers want us to analyze has often passed through several different pieces of software, each with their own quirks, probably with Microsoft Office somewhere in the chain.

So this post isn’t about how to do Unicode right. It’s about a tool we came up with for damage control after some other program does Unicode wrong. It detects some of the most common encoding mistakes and does what it can to undo them.

Here’s the type of Unicode mistake we’re fixing.

  • Some text, somewhere, was encoded into bytes using UTF-8 (which is quickly becoming the standard encoding for text on the Internet).
  • The software that received this text wasn’t expecting UTF-8. It instead decodes the bytes in an encoding with only 256 characters. The simplest of these encodings is the one called “ISO-8859-1”, or “Latin-1” among friends. In Latin-1, you map the 256 possible bytes to the first 256 Unicode characters. This encoding can arise naturally from software that doesn’t even consider that different encodings exist.
  • The result is that every non-ASCII character turns into two or three garbage characters.

The three most commonly-confused codecs are UTF-8, Latin-1, and Windows-1252. There are lots of other codecs in use in the world, but they are so obviously different from these three that everyone can tell when they’ve gone wrong. We’ll focus on fixing cases where text was encoded as one of these three codecs and decoded as another.

A first attempt

When you look at the kind of junk that’s produced by this process, the character sequences seem so ugly and meaningless that you could just replace anything that looks like it should have been UTF-8. Just find those sequences, replace them unconditionally with what they would be in UTF-8, and you’re done. In fact, that’s what my first version did. Skipping a bunch of edge cases and error handling, it looked something like this:

# POSSIBLE_UTF8_SEQUENCE is a big nasty compiled regex of all sequences that
# look like valid UTF-8.
def naive_unicode_fixer(text):
    while True:
        match =
        if match:
            fixed ='latin-1').decode('utf-8')
            text = text[:match.start()] + fixed + text[match.end():]
            return text

This does a perfectly fine job at decoding UTF-8 that was read as Latin-1 with hardly any false positives. But a lot of erroneous text out there in the wild wasn’t decoded as Latin-1. It was instead decoded in a slightly different codec, Windows-1252, the default in widely-used software such as Microsoft Office.

Windows-1252 is totally non-standard, but you can see why people want it: it fills an otherwise useless area of Latin-1 with lots of word-processing-friendly characters, such as curly quotes, bullets, the Euro symbol, the trademark symbol, and the Czech letter š. When these characters show up where you didn’t expect them, they’re called “gremlins“.

When we might encounter text that was meant to be UTF-8 with these characters in it, the problem isn’t so simple anymore. I started finding things that people might actually say that included these characters and were also valid in UTF-8. Maybe these are improbable edge cases, but I don’t want to write a Unicode fixer that actually introduces errors.

>>> print naive_unicode_fixer(u'“I'm not such a fan of Charlotte Brontë…”')
“I'm not such a fan of Charlotte Bront녔

>>> print naive_unicode_fixer(u'AHÅ™, the new sofa from IKEA®')
AHř, the new sofa from IKEA®

An intelligent Unicode fixer

Because encoded text can actually be ambiguous, we have to figure out whether the text is better when we fix it or when we leave it alone. The venerable Mark Pilgrim has a key insight when discussing his chardet module:

Encoding detection is really language detection in drag. –Mark Pilgrim, Dive Into Python 3

The reason the word “Bront녔” is so clearly wrong is that the first five characters are Roman letters, while the last one is Hangul, and most words in most languages don’t mix two different scripts like that.

This is where Python’s standard library starts to shine. The unicodedata module can tell us lots of things we want to know about any given character:

>>> import unicodedata
>>> unicodedata.category(u't')
>>> unicodedata.category(u'녔')

Now we can write a more complicated but much more principled Unicode fixer by following some rules of thumb:

  • We want to apply a consistent transformation that minimizes the number of “weird things” that happen in a string.
  • Obscure single-byte characters, such as and ƒ, are weird.
  • Math and currency symbols adjacent to other symbols are weird.
  • Having two adjacent letters from different scripts is very weird.
  • Causing new decoding errors that turn normal characters into � is unacceptable and should count for much more than any other problem.
  • Favor shorter strings over longer ones, as long as the shorter string isn’t weirder.
  • Favor correctly-decoded Windows-1252 gremlins over incorrectly-decoded ones.

That leads us to a complete Unicode fixer that applies these rules. It does an excellent job at fixing files full of garble line-by-line, such as the University of Leeds Internet Spanish frequency list, which picked up that “más” is a really common word in Spanish text because there is so much incorrect Unicode on the Web.

The final code appears below, as well as in this recipe and in our open source (MIT license) natural language wrangling package, metanl.

# -*- coding: utf-8 -*-

import unicodedata

def fix_bad_unicode(text):
    Something you will find all over the place, in real-world text, is text
    that's mistakenly encoded as utf-8, decoded in some ugly format like
    latin-1 or even Windows codepage 1252, and encoded as utf-8 again.

    This causes your perfectly good Unicode-aware code to end up with garbage
    text because someone else (or maybe "someone else") made a mistake.

    This function looks for the evidence of that having happened and fixes it.
    It determines whether it should replace nonsense sequences of single-byte
    characters that were really meant to be UTF-8 characters, and if so, turns
    them into the correctly-encoded Unicode character that they were meant to

    The input to the function must be Unicode. It's not going to try to
    auto-decode bytes for you -- then it would just create the problems it's
    supposed to fix.

        >>> print fix_bad_unicode(u'único')

        >>> print fix_bad_unicode(u'This text is fine already :þ')
        This text is fine already :þ

    Because these characters often come from Microsoft products, we allow
    for the possibility that we get not just Unicode characters 128-255, but
    also Windows's conflicting idea of what characters 128-160 are.

        >>> print fix_bad_unicode(u'This — should be an em dash')
        This — should be an em dash

    We might have to deal with both Windows characters and raw control
    characters at the same time, especially when dealing with characters like
    \x81 that have no mapping in Windows.

        >>> print fix_bad_unicode(u'This text is sad .â\x81”.')
        This text is sad .⁔.

    This function even fixes multiple levels of badness:

        >>> wtf = u'\xc3\xa0\xc2\xb2\xc2\xa0_\xc3\xa0\xc2\xb2\xc2\xa0'
        >>> print fix_bad_unicode(wtf)

    However, it has safeguards against fixing sequences of letters and
    punctuation that can occur in valid text:

        >>> print fix_bad_unicode(u'not such a fan of Charlotte Brontë…”')
        not such a fan of Charlotte Brontë…”

    Cases of genuine ambiguity can sometimes be addressed by finding other
    characters that are not double-encoding, and expecting the encoding to
    be consistent:

        >>> print fix_bad_unicode(u'AHÅ™, the new sofa from IKEA®')
        AHÅ™, the new sofa from IKEA®

    Finally, we handle the case where the text is in a single-byte encoding
    that was intended as Windows-1252 all along but read as Latin-1:

        >>> print fix_bad_unicode(u'This text was never Unicode at all\x85')
        This text was never Unicode at all…
    if not isinstance(text, unicode):
        raise TypeError("This isn't even decoded into Unicode yet. "
                        "Decode it first.")
    if len(text) == 0:
        return text

    maxord = max(ord(char) for char in text)
    tried_fixing = []
    if maxord < 128:
        # Hooray! It's ASCII!
        return text
        attempts = [(text, text_badness(text) + len(text))]
        if maxord < 256:
            tried_fixing = reinterpret_latin1_as_utf8(text)
            tried_fixing2 = reinterpret_latin1_as_windows1252(text)
            attempts.append((tried_fixing, text_cost(tried_fixing)))
            attempts.append((tried_fixing2, text_cost(tried_fixing2)))
        elif all(ord(char) in WINDOWS_1252_CODEPOINTS for char in text):
            tried_fixing = reinterpret_windows1252_as_utf8(text)
            attempts.append((tried_fixing, text_cost(tried_fixing)))
            # We can't imagine how this would be anything but valid text.
            return text

        # Sort the results by badness
        attempts.sort(key=lambda x: x[1])
        #print attempts
        goodtext = attempts[0][0]
        if goodtext == text:
            return goodtext
            return fix_bad_unicode(goodtext)

def reinterpret_latin1_as_utf8(wrongtext):
    newbytes = wrongtext.encode('latin-1', 'replace')
    return newbytes.decode('utf-8', 'replace')

def reinterpret_windows1252_as_utf8(wrongtext):
    altered_bytes = []
    for char in wrongtext:
        if ord(char) in WINDOWS_1252_GREMLINS:
            altered_bytes.append(char.encode('latin-1', 'replace'))
    return ''.join(altered_bytes).decode('utf-8', 'replace')

def reinterpret_latin1_as_windows1252(wrongtext):
    Maybe this was always meant to be in a single-byte encoding, and it
    makes the most sense in Windows-1252.
    return wrongtext.encode('latin-1').decode('WINDOWS_1252', 'replace')

def text_badness(text):
    Look for red flags that text is encoded incorrectly:

    Obvious problems:
    - The replacement character \ufffd, indicating a decoding error
    - Unassigned or private-use Unicode characters

    Very weird things:
    - Adjacent letters from two different scripts
    - Letters in scripts that are very rarely used on computers (and
      therefore, someone who is using them will probably get Unicode right)
    - Improbable control characters, such as 0x81

    Moderately weird things:
    - Improbable single-byte characters, such as ƒ or ¬
    - Letters in somewhat rare scripts
    assert isinstance(text, unicode)
    errors = 0
    very_weird_things = 0
    weird_things = 0
    prev_letter_script = None
    for pos in xrange(len(text)):
        char = text[pos]
        index = ord(char)
        if index < 256:
            # Deal quickly with the first 256 characters.
            weird_things += SINGLE_BYTE_WEIRDNESS[index]
            if SINGLE_BYTE_LETTERS[index]:
                prev_letter_script = 'latin'
                prev_letter_script = None
            category = unicodedata.category(char)
            if category == 'Co':
                # Unassigned or private use
                errors += 1
            elif index == 0xfffd:
                # Replacement character
                errors += 1
            elif index in WINDOWS_1252_GREMLINS:
                lowchar = char.encode('WINDOWS_1252').decode('latin-1')
                weird_things += SINGLE_BYTE_WEIRDNESS[ord(lowchar)] - 0.5

            if category.startswith('L'):
                # It's a letter. What kind of letter? This is typically found
                # in the first word of the letter's Unicode name.
                name =
                scriptname = name.split()[0]
                freq, script = SCRIPT_TABLE.get(scriptname, (0, 'other'))
                if prev_letter_script:
                    if script != prev_letter_script:
                        very_weird_things += 1
                    if freq == 1:
                        weird_things += 2
                    elif freq == 0:
                        very_weird_things += 1
                prev_letter_script = script
                prev_letter_script = None

    return 100 * errors + 10 * very_weird_things + weird_things

def text_cost(text):
    Assign a cost function to the length plus weirdness of a text string.
    return text_badness(text) + len(text)

# The rest of this file is esoteric info about characters, scripts, and their
# frequencies.
# Start with an inventory of "gremlins", which are characters from all over
# Unicode that Windows has instead assigned to the control characters
# 0x80-0x9F. We might encounter them in their Unicode forms and have to figure
# out what they were originally.

    # adapted from
    0x02DC,  # SMALL TILDE
    0x2013,  # EN DASH
    0x2014,  # EM DASH
    0x2020,  # DAGGER
    0x2021,  # DOUBLE DAGGER
    0x2022,  # BULLET
    0x2030,  # PER MILLE SIGN
    0x20AC,  # EURO SIGN
    0x2122,  # TRADE MARK SIGN

# a list of Unicode characters that might appear in Windows-1252 text

# Rank the characters typically represented by a single byte -- that is, in
# Latin-1 or Windows-1252 -- by how weird it would be to see them in running
# text.
#   0 = not weird at all
#   1 = rare punctuation or rare letter that someone could certainly
#       have a good reason to use. All Windows-1252 gremlins are at least
#       weirdness 1.
#   2 = things that probably don't appear next to letters or other
#       symbols, such as math or currency symbols
#   3 = obscure symbols that nobody would go out of their way to use
#       (includes symbols that were replaced in ISO-8859-15)
#   4 = why would you use this?
#   5 = unprintable control character
# The Portuguese letter à (0xc3) is marked as weird because it would usually
# appear in the middle of a word in actual Portuguese, and meanwhile it
# appears in the mis-encodings of many common characters.

#   0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f
    5, 5, 5, 5, 5, 5, 5, 5, 5, 0, 0, 5, 5, 5, 5, 5,  # 0x00
    5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,  # 0x10
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,  # 0x20
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,  # 0x30
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,  # 0x40
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,  # 0x50
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,  # 0x60
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5,  # 0x70
    2, 5, 1, 4, 1, 1, 3, 3, 4, 3, 1, 1, 1, 5, 1, 5,  # 0x80
    5, 1, 1, 1, 1, 3, 1, 1, 4, 1, 1, 1, 1, 5, 1, 1,  # 0x90
    1, 0, 2, 2, 3, 2, 4, 2, 4, 2, 2, 0, 3, 1, 1, 4,  # 0xa0
    2, 2, 3, 3, 4, 3, 3, 2, 4, 4, 4, 0, 3, 3, 3, 0,  # 0xb0
    0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,  # 0xc0
    1, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0,  # 0xd0
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,  # 0xe0
    1, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0,  # 0xf0

# Pre-cache the Unicode data saying which of these first 256 characters are
# letters. We'll need it often.
    for i in xrange(256)

# A table telling us how to interpret the first word of a letter's Unicode
# name. The number indicates how frequently we expect this script to be used
# on computers. Many scripts not included here are assumed to have a frequency
# of "0" -- if you're going to write in Linear B using Unicode, you're
# probably aware enough of encoding issues to get it right.
# The lowercase name is a general category -- for example, Han characters and
# Hiragana characters are very frequently adjacent in Japanese, so they all go
# into category 'cjk'. Letters of different categories are assumed not to
# appear next to each other often.
    'LATIN': (3, 'latin'),
    'CJK': (2, 'cjk'),
    'ARABIC': (2, 'arabic'),
    'CYRILLIC': (2, 'cyrillic'),
    'GREEK': (2, 'greek'),
    'HEBREW': (2, 'hebrew'),
    'KATAKANA': (2, 'cjk'),
    'HIRAGANA': (2, 'cjk'),
    'HIRAGANA-KATAKANA': (2, 'cjk'),
    'HANGUL': (2, 'cjk'),
    'DEVANAGARI': (2, 'devanagari'),
    'THAI': (2, 'thai'),
    'FULLWIDTH': (2, 'cjk'),
    'MODIFIER': (2, None),
    'HALFWIDTH': (1, 'cjk'),
    'BENGALI': (1, 'bengali'),
    'LAO': (1, 'lao'),
    'KHMER': (1, 'khmer'),
    'TELUGU': (1, 'telugu'),
    'MALAYALAM': (1, 'malayalam'),
    'SINHALA': (1, 'sinhala'),
    'TAMIL': (1, 'tamil'),
    'GEORGIAN': (1, 'georgian'),
    'ARMENIAN': (1, 'armenian'),
    'KANNADA': (1, 'kannada'),  # mostly used for looks of disapproval
    'MASCULINE': (1, 'latin'),
    'FEMININE': (1, 'latin')

24 thoughts on “Fixing common Unicode mistakes with Python — after they’ve been made

    1. chardet is great, and when used responsibly would make this code unnecessary.

      At one point I decided I was just going to add “utf-8 misread as latin-1” and “utf-8 misread as windows-1252” as new encodings by registering them with the Python “codecs” library, and then try to wrap chardet so it understood these new encodings. There are a few reasons I decided against that.

      One of them is that, as useful as the chardet module is, it’s not the most extensible thing ever. It’s mostly a bunch of undocumented finite state machines.

      Another is that chardet reads in bytes, while fix_bad_unicode reads in what is supposedly Unicode. Working with the UTF-8 representation of it would be rather indirect and make the problem much harder (now you’re three steps away from what the text should be instead of two). And then everything that fix_bad_unicode could fix would also appear to chardet as totally valid UTF-8 — I’d have to actually change its UTF-8 detector.

      Basically, chardet is what you should use for reading text in an unknown encoding. fix_bad_unicode is what you should use when someone else didn’t know they were reading text in the wrong encoding. (Perhaps they should have used chardet.)

  1. Thanks for the post, that’s wonderfully enlightening.

    I’m decoding characters sent from a variety of sources, and haven’t (yet) got each source to specify what encoding they are sending me. Many of them are not technically astute, and will likely not be able to tell me. However, when integrating with a new source, I’m in close communication with them, and will be able to request that they send me a particular string. Is there a string of characters I could ask them to send which will disambiguate the encoding they are using?

    I guess the answer depends on the set of possible encodings I expect to be receiving. Just common English language ones, for the time being.

    I just realised this is a bit off-topic, and probably better suited to a Stack Overflow question:

    1. That’s a pretty fun question. I’m not sure it has a single good answer.

      All these encodings can basically be disambiguated in a single character. But when it comes to the single-byte encodings, the characters where they disagree are generally going to be characters another single-byte encoding doesn’t support at all.

      I’d ask them to send something like “¢a$h mon€¥”, which aside from looking like an irritating teenager’s username would make a pretty good test. The plain ASCII characters are a reference point. The ¢ and ¥ will distinguish whether they’re using UTF-8 or a single-byte encoding. The € will distinguish ISO-8859-15 (which I don’t know if anyone uses) from Windows-1252. The ¥ will distinguish all of these from crazy Japanese encodings that put ¥ where a backslash should be.

      However, in many encodings including ISO-8859-1, they won’t have any representation for the Euro symbol. They might not be able to see that you’re asking them for the Euro symbol. On Windows, you could ask them to type Alt+0128 and see what happens.

      1. Interesting reply, thanks for the thoughts Rob.

        I’m communicating with each sender out of band, person-to-person, as they initially integrate, so I’m reasonably confident that I’ll be able to communicate the required string to them, for example by email, assuming that handles the encodings ok, and iterate if the bytes they send don’t make sense to me.

        I guess if I ask for a character they can’t encode, then things with either blow up on their end, or I’ll get some junk or a missing character. I guess I need to try a few experiments. I’ll report back if anything interesting happens.

  2. «the University of Leeds Internet Spanish frequency list, which picked up that “más” is a really common word in Spanish text because there is so much incorrect Unicode on the Web.»

    Actually that’s not true. The page linked is just a text/plain page that is UTF-8 encoded but is not reporting an encoding in the HTTP response headers. So your browser is incorrectly guessing Latin-1, resulting in what you see.

    So the problem is in how your browser is rendering that listing (and the server being slightly rude and not telling it how to decode, forcing a guess), not the amount of incorrect Unicode in the web.

    1. Nope — this just shows how sneaky Unicode problems can be! Indeed, when I click on the link to that file in some browsers, the entire thing looks like UTF-8 misread as Latin-1.

      But if you read the file correctly as UTF-8, you will see that the 24th most common word is “más” and the 175th most common word is “más”.

      Look at the 175th most common word in your browser, then, and it will look like “más”. It’s gone through the mis-encoding cycle *twice*. And that’s why I made sure fix_bad_unicode works recursively.

  3. Brilliant thank you!

    What I took from this: The software was not expecting UTF-8 text, so  kept appearing at “random” places.

    I tried enocding it with Latin-1 (.encode(‘latin-1’)) as per example, and it worked!


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s