We are proud to have launched our second installment of Brand Wars in collaboration with our partners over at iModerate! While the first edition focused on two athletic consumer brands, Nike and Under Armour, this one focuses on brands that provide streaming services: Netflix, Hulu, and Amazon. Of the three, which do you think will be the winner?
Have you ever wondered about the foundations of Luminoso, and from where we came? Check out this short five-part interview featuring Luminoso’s CEO & co-founder, Catherine Havasi. These videos, produced by MIT’s Industrial Liaison Program (ILP), highlight the reasons why Luminoso’s solutions are remarkably different from other solutions out there. The first secret? Common sense!
Please follow the link here to learn more about this, how we work, and about our future in Catherine’s exclusive interview!
ftfy is a Python tool that takes in bad Unicode and outputs good Unicode. I developed it because we really needed it at Luminoso — the text we work with can be damaged in several ways by the time it gets to us. It’s become our most popular open-source project by far, as many other people have the same itch that we’re scratching.
The coolest thing that ftfy does is to fix mojibake — those mix-ups in encodings that cause the word
más to turn into
mÃ¡s or even
mÃƒÂ¡s. (I’ll recap why this happens and how it can be reversed below.) Mojibake is often intertwined with other problems, such as un-decoded HTML entities (
más), and ftfy fixes those as well. But as we worked with the ftfy 3 series, it gradually became clear that the default settings were making some changes that were unnecessary, and from time to time they would actually get in the way of the goal of cleaning up text.
ftfy 4 includes interesting new fixes to creative new ways that various software breaks Unicode. But it also aims to change less text that doesn’t need to be changed. This is the big change that made us increase the major version number from 3 to 4, and it’s fundamentally about Unicode normalization. I’ll discuss this change below under the heading “Normalization”.
Mojibake and why it happens
Mojibake is what happens when text is written in one encoding and read as if it were a different one. It comes from the Japanese word “•¶Žš‰»‚¯” — no, sorry, “文字化け” — meaning “character corruption”. Mojibake turns everything but basic ASCII characters into nonsense.
Suppose you have a word such as “más”. In UTF-8 — the encoding used by the majority of the Internet — the plain ASCII letters “m” and “s” are represented by the familiar single byte that has represented them in ASCII for 50 years. The letter “á”, which is not ASCII, is represented by two bytes.
Text: m á s Bytes: 6d c3 a1 73
The problem occurs when these bytes get sent to a program that doesn’t quite understand UTF-8. This program probably thinks that every character is one byte, so it decodes each byte as a character, in a way that depends on the operating system it’s running on and the country it was set up for. (This, of course, makes no sense in an era where computers from all over the world can talk to each other.)
If we decode this text using Windows’ most popular single-byte encoding, which is known as “Windows-1252″ and often confused with “ISO-8859-1″, we’ll get this:
Bytes: 6d c3 a1 73 Text: m Ã ¡ s
The real problem happens when this text needs to be sent back over the Internet. It may very well send the newly-weirdified text in a way that knows it needs to encode UTF-8:
Intended text: m á s Actual text: m Ã ¡ s Bytes: 6d c3 83 c2 a1 73
So, the word “más” was supposed to be four bytes of UTF-8, but what we have now is six bytes of what I propose to call “Double UTF-8″, or “WTF-8″ for short.
WTF-8 is a very common form of mojibake, and the fortunate thing is that it’s reasonably easy to detect. Most possible sequences of bytes are not UTF-8, and most mojibake forms sequences of characters that are extremely unlikely to be the intended text. So ftfy can look for sequences that would decode as UTF-8 if they were encoded as another popular encoding, and then sanity-check by making sure that the new text looks more likely than the old text. By reversing the process that creates mojibake, it turns mojibake into the correct text with a rate of false positives so low that it’s difficult to measure.
Weird new mojibake
We test ftfy on live data from Twitter, which due to its diversity of languages and clients is a veritable petri dish of Unicode bugs. One thing I’ve found in this testing is that mojibake is becoming a bit less common. People expect their Twitter clients to be able to deal with Unicode, and the bugs are gradually getting fixed. The “you fail at Unicode” character � was 33% less common on Twitter in 2014 than it was in 2013.
Some software is still very bad at Unicode — particularly Microsoft products. These days, Microsoft is in many ways making its software play nicer in a pluralistic world, but they bury their head in the sand when it comes to the dominance of UTF-8. Sadly, Microsoft’s APIs were not designed for UTF-8 and they’re not interested in changing them. They adopted Unicode during its awkward coming-of-age in the mid ’90s, when UTF-16 seemed like the only way to do it. Encoding text in UTF-16 is like dancing the Macarena — you probably could do it under duress, but you haven’t willingly done it since 1997.
Because they don’t match the way the outside world uses Unicode, Microsoft products tend to make it very hard or impossible to export and import Unicode correctly, and easy to do it incorrectly. This remains a major reason that we need ftfy.
Although text is getting a bit cleaner, people are getting bolder about their use of Unicode and the bugs that remain are getting weirder. ftfy has always been able to handle some cases of files that use different encodings on different lines, but what we’re seeing now is text that switches between UTF-8 and WTF-8 in the same sentence. There’s something out there that uses UTF-8 for its opening quotation marks and Windows-1252 for its closing quotation marks, before encoding it all in UTF-8 again,
â€œlike this”. You can’t simply encode and decode that string to get the intended text
ftfy 4.0 includes a heuristic that fixes some common cases of mixed encodings in close proximity. It’s a bit conservative — it leaves some text unfixed, because if it changed all text that might possibly be in a mixed encoding, it would lead to too many false positives.
Another variation of this is that ftfy looks for mojibake that some other well-meaning software has tried to fix, such as by replacing byte A0 with a space, because in Windows-1252 A0 is a non-breaking space. Previously, ftfy would have to leave the mojibake unfixed if one of its characters was changed. But if the sequence is clear enough, ftfy will put back the A0 byte so that it can fix the original mojibake.
Does this seem gratuitous? These are things that show up both in ftfy’s testing stream and in real data that we’ve had to handle. We want to minimize the cases where we have to tell a customer “sorry, your text is busted” and maximize the cases where we just deal with it.
NFC (the Normalization Form that uses Composition) is a process that should be applied to basically all Unicode input. Unicode is flexible enough that it has multiple ways to write exactly the same text, and NFC merges them into the same sensible way. Here are two ways to write
más, as illustrated by the
This is the NFC normalized way:
U+006D m [Ll] LATIN SMALL LETTER M U+00E1 á [Ll] LATIN SMALL LETTER A WITH ACUTE U+0073 s [Ll] LATIN SMALL LETTER S
And this is a different way that’s not NFC-normalized (it’s NFD-normalized instead):
U+006D m [Ll] LATIN SMALL LETTER M U+0061 a [Ll] LATIN SMALL LETTER A U+0301 ́ [Mn] COMBINING ACUTE ACCENT U+0073 s [Ll] LATIN SMALL LETTER S
If you want the same text to be represented by the same data, running everything through NFC normalization is a good idea. ftfy does that (unless you ask it not to).
Previous versions of ftfy were, by default, not just using NFC normalization, but the more aggressive NFKC normalization (whose acronym is quite unsatisfying because the K stands for “Compatibility”). For a while, it seemed like normalizing even more was even better. NFKC does things like convert
ｆｕｌｌｗｉｄｔｈ ｌｅｔｔｅｒｓ into normal letters, and convert the single ellipsis character
… into three periods.
But NFKC also loses meaningful information. If you were to ask me what the leading cause of mojibake is, I might answer “Excel™”. After NFKC normalization, I’d instead be blaming something called “ExcelTM”. In cases like this, NFKC is hitting the text with too blunt a hammer. Even when it seems appropriate to normalize aggressively because we’re going to be performing machine learning on text, the resulting words such as “exceltm” are not helpful.
So in ftfy 4.0, we switched the default normalization to NFC. We didn’t want to lose the nice parts of NFKC, such as normalizing fullwidth letters and breaking up the kind of ligatures that can make the word “ﬂuﬃeﬆ” appear to be five characters long. So we added those back in as separate fixes. By not applying NFKC bluntly to all the text, we change less text that doesn’t need to be changed, even as we apply more kinds of fixes. It’s a significant change in the default behavior of ftfy, but we hope you agree that this is a good thing. A side benefit is that ftfy 4.0 is faster overall than 3.x, because NFC normalization can run very quickly in common cases.
Future-proofing emoji and other changes
ftfy’s heuristics depend on knowing what kind of characters it’s looking at, so it includes a table where it can quickly look up Unicode character classes. This table normally doesn’t change very much, but we update it as Python’s
unicodedata gets updated with new characters, making the same table available even in previous versions of Python.
One part of the table is changing really fast, though, in a way that Python may never catch up with. Apple is rapidly adding new emoji and modifiers to the Unicode block that’s set aside for them, such as 🖖🏽, which should be a brown-skinned Vulcan salute. Unicode will publish them in a standard eventually, but people are using them now.
Instead of waiting for Unicode and then Python to catch up, ftfy just assumes that any character in this block is an emoji, even if it doesn’t appear to be assigned yet. When emoji burritos arrive, ftfy will be ready for them.
Developers who like to use the UNIX command line will be happy to know that ftfy can be used as a pipe now, as in:
curl http://example.com/api/data.txt | ftfy | sort | uniq -c
The details of all the changes can be found, of course, in the CHANGELOG.
Has ftfy solved a problem for you? Have you stumped it with a particularly bizarre case of mojibake? Let us know in the comments or on Twitter.
Have you seen this new whitepaper? This whitepaper was developed by Data Driven Business with contributions from the speakers associated with the 14th Text Analytics Summit East – taking place on June 15th & 16th in New York City.
In fact, we will be attending the event, and delivering a presentation with our esteemed client, Intel, on June 16 at 9:00 a.m. ET. Josh Ritchie, Regional Sales Director at Luminoso, and Roop Gill, Manager, Insights & Analytics at Intel, will present on the advantages of automated insights and machine learning based on work performed using Luminoso.
But, back to the report – it really hits the nail on the head when it comes to the text analytics world that we know so well, so much that it almost feels like it was written FOR us and our customers!
If you haven’t see the report, we highly recommend that you download it from this link here. We’ve highlighted a few passages below that really speak to us.
This passage could be referring to the work that we do at-large.
They could be directly speaking to our real-time enterprise listening solution, Compass.
If you’re interested in attending the Text Analytics Summit, please let us know! We can provide you with a discount code that will give you $300 off of the regular purchase (not to mention, we’d just love to see you there anyway). Please reach out to me at firstname.lastname@example.org, if interested.
As always, if you have any questions that we can answer for you about our work, and how we can help you and your business with our best-in-class enterprise feedback and analytics solutions, please contact me at the e-mail above or generally at email@example.com.
Our development team has been working hard to make some really valuable improvements to our solutions. Check out what they’ve been working on here, including:
- We now support Dutch! (Analytics Platform)
- Saved Topics are now presented as a scrolling list (Analytics Platform)
- Fix for issue, browsers crashing in projects with large numbers of subsets and topic (Analytics Platform)
- Now you can specify negative seeds to exclude keywords used in listening for Twitter content (Compass)
- Fix of issue where initial topic velocity was sometime incorrectly calculated (Compass)
- Fix for issue where topic velocity and acceleration did not take into account the sampling rate for Twitter messages (Compass)
If you ever have any questions at all, please don’t hesitate to reach out to us at firstname.lastname@example.org.
Greetings from the world of listening and understanding! And now for another installment from our Denise Christie and what she’s been following through our real-time enterprise listening solution, Compass.
What we saw this week:
- Drought-shaming on Twitter is a thing. People post photos of neighbors who are watering their lawns or otherwise wasting water and add snarky comments.
- Meanwhile, William Shatner has proposed a solution to the drought: building a pipeline from California to Seattle to take advantage of all the rain they get. He’s planning to launch a $30 billion Kickstarter campaign. Most tweeters mocked the plan.
- Starbucks received heavy criticism from tweeters and the general public after an article revealed that one of its bottled water brands, Ethos, is sourced and bottled in California. Californian tweeters were angry that part of the state’s limited water supply is being shipped to and sold in other states without a water crisis. In response, Starbucks announced that they would be moving their bottled-water operations to Pennsylvania.
- Jezebel knows how to get click-throughs! Earlier this week they had an article with the headline “Is the California Drought Producing More Homeless Kittens?” And most of Twitter facepalmed.
- And finally today- it is raining in So Cal. Tweeters are very happy. (Side note: as a native Californian, I have to say that this is the first time I’ve seen Californians happy about a rainstorm.)
More news from the local doughnut shop:
- Dunkies announced this week that they would remove titanium dioxide from their doughnuts after coming under pressure from a public interest group who said that it is not safe for human consumption. (It is apparently used to make powdered sugar appear brighter, and is also used in sunscreen and paint.)
- An entire cluster formed around people tweeting in dismay about how long the line at Dunkin’ Donuts is.
- In other news, Periscope has reached a new low.
Ms. Fiorina announced that she’s running for president on Monday.
- Things did not start out well for Ms. Fiorina. She failed to register carlyfiorina.org as a domain name, and one of her critics took advantage of the oversight to build a visualization of how many people she laid off at Hewlett-Packard.
- Fiorina responded by joking about it during an interview with Seth Meyers. She bought sethmeyers.org, set it up to redirect to her campaign website, and joked with him that he’d better be nice to her during the interview.
- Finally this week, she set up a global town hall meeting on Periscope. Most tweets were just links to the Periscope chat.
And, as always, never hesitate to reach out to us if you’d like to learn more about Luminoso and our enterprise listening and analytics solutions. You can contact us at email@example.com. Until next week, folks!
There were some pretty interesting things going on last week, as there is every week! Let’s take a look at what our Denise Christie was following and what she discovered on Compass!
What we saw last week:
- Blueberry iced coffee and doughnuts are apparently a thing, and lots of people seem to want them.
- On the other hand, people scoffed at a Dunkin’ Donuts commercial announcing that Dunkin now serves guacamole “made with REAL avocados!” Many tweeted something along the lines of “how else would you make guac???”
Obergefell v Hodges
The Supreme Court heard arguments on marriage equality last Tuesday.
- On Monday, people were reminding their followers that arguments for the Obergefell v Hodges case would be heard on Tuesday. There were general messages of support, links to news articles, and requests for people to join their rallies.
- On the flip side, there was a backlash against Ruth Bader Ginsburg’s statement that she has “already made up her mind on gay marriage.” Some supported her statement, while others said that it was clear she would not be listening to the arguments with an open mind, and a few people tweeted that she should go read her Bible.
- Meanwhile, people in Ireland are gearing up for a marriage equality referendum that will take place on May 22.
Prom and Promposals
- Apparently, one boy rented a private plane for prom. (???) Other high schoolers were just as dumbfounded- they commented about it being a waste of money, questioned how he was able to afford it, and wondered why the couple needed to fly in the first place.
- Justin Bieber crashed somebody’s prom. A cluster of conversation very quickly formed around girls asking him if he would go to prom with them too.
- Promposals that have gone viral:
- A straight guy asked his gay best friend to go to prom with him. People on Twitter shared the link to the photo/video, and commented about how sweet a gesture it was.
- A teenager of Middle Eastern descent was suspended and banned from prom for wearing a fake bomb vest as part of his promposal. He defended his actions and said school administrators were racist. While most tweets were just links to relevant articles, those who commented thought the teenager’s actions were idiotic.
Questions? Comments? Controversy? We’d love to hear from you! If you’re curious about how Compass works and how it could work for you and your business, please reach out to us at firstname.lastname@example.org.