How disruptive will the new dawn of artificial intelligence be?

Luminoso was recently featured in a report put out by the Global Research Team at UBS about how artificial intelligence will affect the automation of job functions, and those further business implications.

Luminoso was identified as a natural language processing winner that can assist with monitoring (news, e-mails, social media), and with consumer analytics in order to uncover hidden patterns in qualitative research.

Check out the report and its findings here!

wordfreq 1.2 is better at Chinese, English, Greek, Polish, Swedish, and Turkish

Wordfreq 1.2 example code

Examples in Chinese and British English. Click through for copyable code.

In a previous post, we introduced wordfreq, our open-source Python library that lets you ask “how common is this word?”

Wordfreq is an important low-level tool for Luminoso. It’s one of the things we use to figure out which words are important in a set of text data. When we get the word frequencies figured out in a language, that’s a big step toward being able to handle that language from end to end in the Luminoso pipeline. We recently started supporting Arabic in our product, and improved Chinese enough to take the “BETA” tag off of it, and having the right word frequencies for those languages was a big part of it.

I’ve continued to work on wordfreq, putting together more data from more languages. We now have 17 languages that meet the threshold of having three independent sources of word frequencies, which we consider important for those word frequencies to be representative.

Here’s what’s new in wordfreq 1.2:

  • The English word list has gotten a bit more robust and a bit more British by including SUBTLEX, adding word frequencies from American TV shows as well as the BBC.
  • It can fearlessly handle Chinese now. It uses a lovely pure-Python Chinese tokenizer, Jieba, to handle multiple-word phrases, and Jieba’s built-in wordlist provides a third independent source of word frequencies. Wordfreq can even smooth over the differences between Traditional and Simplified Chinese.
  • Greek has also been promoted to a fully-supported language. With new data from Twitter and OpenSubtitles, it now has four independent sources.
  • In some applications, you want to tokenize a complete piece of text, including punctuation as separate tokens. Punctuation tokens don’t get their own word frequencies, but you can ask the tokenizer to give you the punctuation tokens anyway.
  • We added support for Polish, Swedish, and Turkish. All those languages have a reasonable amount of data that we could obtain from OpenSubtitles, Twitter, and Wikipedia by doing what we were doing already.

When adding Turkish, we made sure to convert the case of dotted and dotless İ’s correctly. We know that putting the dots in the wrong places can lead to miscommunication and even fatal stabbings.

The language in wordfreq that’s still only partially supported is Korean. We still only have two sources of data for it, so you’ll see the disproportionate influence of Twitter on its frequencies. If you know where to find a lot of freely-usable Korean subtitles, for example, we would love to know.

Let’s revisit the top 10 words in the languages wordfreq supports. And now that we’ve talked about getting right-to-left right, let’s add a bit of code that makes Arabic show up with right-to-left words in left-to-right order, instead of middle-to-elsewhere order like it came out before.

Code showing the top ten words in each language wordfreq 1.2 supports.

Wordfreq 1.2 is available on GitHub and PyPI.

#NPS, #WorkforceAnalytics & #Science Webinars in November!

Luminoso has scheduled a series of webinars in November that will cover varying topics regarding analytics, and how they can support and improve many parts of your business.

Please join us for any and all!


Tapping Into Customer Feedback To Understand NPS Scores

Thursday, November 5, 2015
Time: 2:00 p.m. ET

A global leader in web hosting services surveyed their customer base in an attempt to gauge satisfaction, pain points, and drivers of loyalty. In addition to providing a quantitative NPS score, respondents also answered an open ended question in which they explained the reasons for their score.

With the knowledge that customer satisfaction scores have a direct impact on revenue, customer retention, word of mouth recommendations, and future business, this global web hosting services company put a priority on more than just collecting scores.

Join Luminoso’s 30-minute NPS-focused webinar to learn about how we helped this global web hosting services company to identify and understand key issues and themes driving them.


Workforce Analytics

Tuesday, November 17, 2015
Time: 2:00 p.m. ET

Is Amazon a reflection that the tech sector has a work-life balance problem? Or is a work-hard/play-hard culture an allure for top talent? The NY Times offers one view; backlash and counterpoints to the Times story keep cropping up. Data reveals something altogether different.

We looked at company reviews and uncovered surprising insights.

Join Luminoso’s 30-minute workforce analytics webinar to hear

  • Counterintuitive links between cutthroat culture and employee satisfaction
  • How Amazon compares to peers competing for talent
  • A methodology to put employee feedback in context
  • Ways to identify if a problem needs fixing, and how pervasive it is


What is ConceptNet?

Thursday, November 19, 2015
Time: 2:00 p.m. ET

You might have heard about ConceptNet in the news. You might also have heard about ConceptNet because not only is it fundamental to our Platform, but our founders were its creators during their time at the MIT Media Lab.

ConceptNet is a semantic network of relationships between words and phrases that computers can use to interpret how humans think about the world. Luminoso is the only company that has been able to commercialize ConceptNet in order to help its customers to improve various parts of the organization.

Please join us for 30 minutes to learn more about the brain that powers Luminoso.


Register anyways! We’ll be sure to send you a recording of the webinar.

Celebrating our Partnerships

One of our newest partners is a big box retailer. We use the term “partner” in the sense that we consider all of our customers to be partners. We started to develop a relationship with this company a few months ago when we learned that they were seeking innovative methods to managing the customer experience and customer service feedback.

This big box retailer was primarily working with customer service-focused data, including:

  • post-call chat survey data
  • OpinionLab data (multi-touch customer feedback)
  • traditional open-ended survey data about customer experience and customer service

We worked with this company to identify its goals. They sought to be nimble enough to discover where customers struggle and when they become frustrated on their path to purchase, and to discover particular in areas in which they weren’t initially focused. They also wanted to have the ability to quickly rectify those issues, and to engage customers with solutions.

The road to partnership is not always easy. There are others out there who also seek to become partners with the same companies that we do. However, we come to the table with a unique and nimble approach, which our partners have validated time and time again.

When we form new partnerships, we also make a point to explore why we were successful. In this case, here’s why we made a good partner for this big box retailer.

  1. Providing immediate time to value. Our ability to derive insight from your data immediately is unrivaled. Because we don’t require the need to perform any setup, including building dictionaries, keywords, and ontologies, we save you weeks (not to mention professional services dollars)!
  2. Automated and dynamic insight discovery. With our approach to learning your data on the fly, Luminoso has the ability to find those “unknown unknowns” that you might not be focused on or think to search for in the first place.
  3. Continuing on a theme, because we have no need to pre-program our platform, we are extremely adaptable. Our newest partner found considerable value in our ability to identify themes across industry products and services, and jargon that’s particular to their business.
  4. Lastly, we were ultimately successful because of our approach to partnership development. We’re not just trying to sell you a box. We see your problem as our own, and we seek to work together to develop solutions.

We hope that you celebrate your partnerships in the same way that we do. We would certainly enjoy the opportunity to learn how we can work together to develop customer experience and customer service solutions.

Capturing the Whole Voice – #CX in the age of #IoT

Last week, Dmitry Grenader, Luminoso’s Director of Product Management, attended Xperience2015, a conference on the Internet of Things, put together by the nice folks at Xively (LogMeIn). The following is his account of the conference.

It was a fascinating conference in many ways, and opened my eyes on the brave new world of connected devices, ever-increasing computing power and the potential super-aware future, where everything is linked to everything. Also that means everything is sending messages and signals.  Like the toaster that is done, or an air-conditioning unit that is overheated, or a printer about to run out of paper.

It just blew my mind – because that means every Thing will have a voice, just like people. Granted a printer does not communicate like humans do, it sends discrete signals about its status, and is probably not going to wax poetically to customer service about its issues. Still additional messages will be sent – it’s like the twittersphere for things – maybe those printers will complain to each other about lack of paper :)  but even if not all of that provides enormous opportunity for companies embracing IoT.

Instead of focusing on devices that are “sick” or in need of “support”, they will focus on keeping them healthy. Customer support will become not just more pro-active, but pre-emptive – as in instead of a customer calling and complaining about an issue, the company will fix issue before it arises and then notify the customer of the issue that was avoided.

What is amazing is that there are companies today who already do that sort of thing. Lutron for lighting, Sato for connected printers, SureFlap for pets, and Symmons right here in my Boston backyard for connected shower-heads. I am sure there are many others.

Wow. That is a cool world – even I do not see all implications (like someone hacking into my toaster). What I do see however is a very interesting convergence of various signals:

  • Voice of the Customer, as spoken (shouted?) or otherwise shared through support and other channels
  • Voice of the Device, as captured by to-be-built systems as the IoT era is ushered in

Combining “Voice of the Customer” using NLP and AI and “Voice of the Device” will provide businesses of the future the ability to capture Whole Voice.

Are you an IoT Company and get signals from your devices? Do you get a flood of feedback from customers?  Reach out to me to brainstorm on how to connect the feedback from devices with the feedback from humans.

Thou Shalt Use Open-Ended Questions (or thou shalt not hear your customer’s voice)

Dmitry Grenader, Luminoso’s Director of Product Management, writing to challenge the status-quo in market-research and voice-of-the-customer surveys where multiple-choice questions dominate the scene and dwarf open-ended questions.

Why do we love multiple-choice questions so much? There are a few reasons.

  • They are easy to create!  Let’s say I am surveying folks on their favorite ice-cream flavors – it soooo easy for me to just fill “strawberry, vanilla, chocolate, green tea, …”.
  • They are also easy to take – they are a perfect “don’t make me think” task – as a respondent I just pick the one I like and move on.
  • They are easy to analyze – you get your bar-graphs and pie-charts quickly, and the statistically inclined might even look at the mean and (ahem) standard deviation.
  • If distributed to enough people, the results are statistically significant.

What about open-ended questions? Why do we shun them like the Wizarding world of Harry Potter shuns squibs? The main reason is that the analysis is just too damn hard.

Imagine some poor schmuck – sorry, I mean a Product Manager (like me) – trying to analyze results and having to look through 10,000 responses. Impossible! I mean c’mon people – I only get a two-day extension from CEO to research and validate the product viability before we commit like two million dollars to it, and I had to promise my first-born for even that reprieve – I surely don’t have time to read what thousands of people said. So what do I do? Well, I read through a few of them, and pick the few juicy comments that support my view of the world plus a couple of negative ones to appear balanced, and copy-n-paste into my powerpoint deck – and call it done. Ok, I don’t really do that, I am sure you don’t either – but you know, those other people do.

Bottomline – multiple-choice questions are the main dish, whilst the open-ended ones are a side-dish at best, an after-thought forever destined to be a lowly “catch-all”. And everyone is ok with it.  Well, I am not!

There are huge problems with this tyranny of the multiple-choice:

  • Options are leading the witness! Let’s face it – when you ask about a favorite ice-cream flavor and you give a choice of five, and the user selects one you have no idea that the user actually would have selected a different one. They have a real favorite – but it was not listed.
  • By pre-creating the choices you are quite literally biasing the user to respond within the framework you have created – so ultimately you are only getting answers within the space of answers you yourself kind of pre-cooked (there is something from quantum theory in this – something about the eye of the beholder and how what you find is determined by how you look for it)
  • They are not allowing your customers to fully express their voice. It is not hard to imagine this internal dialog that happens in your customer’s head when they are asked to choose between strawberry and vanilla ice-cream flavors on the survey: “Well, I kind of like strawberry but I wish it were more aromatic and not as sweet. Vanilla flavor is ok, but I have recently tried this french vanilla by their competitor and it had a hint of pistachios, which I absolutely adored.  I wonder if they can tweak their vanilla to be bolder.” – Wow! Mic drop. No seriously – this is incredible feedback, a voice of the customer at its purest, glorious, unrefined, primordial and incredibly useful state. The kind of feedback that if falls on hearing ears can make the product and company great and create lifelong brand-champions. But NOOOOO, instead this comment gets lost and ends up on the cutting floor by default – ‘cause nobody had time to read it. Or worse nobody dared to even ask the question in the first place.

I hope you are sharing in my righteous indignation, and pounding your fist on the table in outrage and screaming “IS THERE A BETTER WAY?!” The answer is yes.

There is a way to understand the feedback without having to yourself read through the 10,000 responses. You can do that by relying on the text-analytics of a modern voice-of-the-customer platform. The technology has recently come of age where you can rely on machine-learning based approaches to do the feedback reading for you and provide you with insights and the tools to understand what your customers are saying and how they are feeling. Distilling feedback into concepts and topics, finding the connections between them and surfacing the emotional context enable you to quickly and effectively analyze responses. Think of it as “augmented reality” where you can wear a lens with which you can read thousands of responses and understand what is contained in them. And you know what else? This technology can turn qualitative into quantitative – i.e. you can get actual number that are statistically significant on how people feel about various aspects of your business.

This breakthrough allows you to rethink how you construct your surveys and hear more of your customer’s voice. Delve into nuances! Be provocative!  Ask really Open Ended Questions like

What can we do to serve you better, Ms./Mr. Customer?

Do not shy away from the uncertainty – embrace it, you now can with voice-of-customer technology in your corner.  You will be amazed at the results you get.

P.S. As for the multiple-choice questions, keep those too of course – but remember they are not panacea.

Can we do Arabic?

And now, a message from our Senior Linguistics Developer, and one of our original Luminosi, Dr. Lance Nathan…

What happens when the CEO of Luminoso comes to my office and asks, “Can we do Arabic?”.

In general, when people ask me whether Luminoso’s software can handle a language we don’t yet support–Estonian, Esperanto, Klingon, what have you–my answer is always “yes, of course”. Admittedly, I follow this up with “That is to say, you can put it into the system and see what happens”…which is my answer because “handling” a language involves a number of complicated factors. We’d like to have some background knowledge in the language, and we’d like a word frequency list (see Rob’s blog post from earlier this month for more on that topic).

But the thing we need most is software to parse the text: to break it up into words and to give us base forms we can use to represent those words. Without that, analysts are left looking at our software and thinking, “Well, here’s what e-book users say about ‘reading’, and here’s what they say about ‘read’, and here’s what they say about ‘reads’, and…why are these different concepts?”. Of course, they’re not different concepts, but if you did put Klingon into our system, it wouldn’t know that be’Hom and be’Hompu’ are the same concept. (Those mean “girl” and “girls”. I had to look them up.) You would still find insights–you’d probably learn that “battle” and “happiness” are closely related in Klingon–they just wouldn’t be quite as solid as they would be if we had a parser.

So when the CEO comes to my office and asks, “Can we do Arabic?”, I give this explanation, ending with something like “So all we would need is software that can convert plurals to singulars and so forth.” At which point she says to me, “Terrific! Get right on that”–and I am reminded that talking to your CEO is different than talking to most other people. (Of course, to be fair, she knew we already have software that would do most of the work; my real task would be evaluating it and working around any idiosyncracies I found.)

In truth, though, while the project looked daunting, it also looked exciting. Developing Russian for our product was an interesting journey, but in some ways a very familiar one. Russian has a different alphabet, but like English it forms plurals by putting a suffix on a noun, and forms tenses and other verb variations by putting a suffix on a verb, and so forth. All a parser has to do is recognize the word, take some letters off the end, and voilà: a root word that represents the base concept! Arabic doesn’t work that way at all.

How does Arabic work?

It turns out that there were two basic challenges to parsing Arabic, and its approach to suffixes was only the first one.

Take the Arabic root كتب, which is just the three consonants k, t, and b. It means “write”, and interspersing certain vowels will give you the words for “he wrote” (kataba), or “he writes” (yaktubu), or even “he dictates”, along with other vowels for the “I” form, the “you” form, and so forth. Add different vowels and you get a slew of related nouns: “book” (kitaab) or “library” (maktaba) or “office” (maktab)…to say nothing of the vowels you would change those to if you wanted a plural like “books” (kitub) or “offices” (makatib). All of which would be complicated enough, except that outside of the Qur’an, most of the vowels are almost never written, leaving a parser to reconstruct “yaktubu” from just “yktb”, and to know that “ytkb” is the same concept as the verb “write” but not the noun “book”. This bears so little relation to English or French or Russian that I hesitated to even believe anyone could write a parser to handle it.

Fortunately, I didn’t have to write the parser; once I had one that worked, I would merely need to offer some guidance, correct it when it went astray, and decide which of its many outputs I wanted (yaktubu? yktb? ktb? something in between?). Unfortunately, the language’s rules for word formation was only the first problem; my second problem was that no one speaks Arabic.

Now, obviously that can’t be true; with over 240 million speakers, Arabic is the fifth most spoken language in the world. It turns out, however, that what no one speaks is standard Arabic–that is, Modern Standard Arabic, or MSA. When speaking formally or in an international setting, as at the United Nations or on Al-Jazeera, speakers do indeed use this standard form. Outside of such settings, speakers use their local dialect: Moroccan, Sudanese, Egyptian, Levantine, and many others, and that extends to writing, especially in online forums like Twitter. Often the local written form matches the local spoken form–not unknown in online English, where someone might write “deez” instead of “these”, but much more common in written Arabic, and in this case rather than getting a nonsense word from a small variation in the spelling of “these”, you get a word meaning “delirious”. (Which actually happens.)

Early in the career of a computational linguist, you learn that most language-processing systems are designed to work on standard versions of languages: a French parser may not handle quirks of Québecois French, an English parser probably used news articles as training data and won’t know many of the words it sees on Twitter. Any Arabic parser would similarly be based on Modern Standard Arabic; could it be convinced to handle dialects?

Of course, there was also a third problem I haven’t even mentioned: I don’t speak Arabic. But here at Luminoso, we don’t let minor technicalities stop us, so we contracted a native speaker to help me, I downloaded a few apps to teach me the alphabet, and off we went.

What a parser can (and can’t) do

On the bright side, writing a program to parse Arabic wouldn’t really be my job; I only needed to evaluate the ones available and build on those. Some initial exploration suggested that pretty good parsers did indeed already exist. All the same, putting Arabic in our system wouldn’t be as simple as dropping one into our software and letting it roam free.

Many Arabic parsers are built on the grammatical structures seen in the Qur’an, which is written in language essentially the same as Modern Standard Arabic. Therefore, they may classify the prefix “l-” as ambiguous between the preposition “to” and an indicator of emphasis on the noun, but the latter is only used in literary Arabic (for instance, the Qur’an). We had to tell our software that if the parser categorized anything as “emphatic particle”, it should go back and find another option.

But there were other, subtler problems inherent to the nature of Arabic grammar. An “a-” prefix on a verb might indicate a causative form; it’s this form that turns “he writes” into “he dictates” (i.e., he causes someone to write), or “to know” into “to inform” (i.e., to cause someone to know something). On the other hand, an “a-” prefix can also indicate that “I” is the subject of the verb. A good Arabic parser may return both alternatives, but we found that we couldn’t necessarily rely on our parser to guess which right in a particular sentence. For this, I had to sit down with our native speaker and simply look at a lot of sentences and their parses, asking for each, “Did the parser return the right result here? What about here? If the result was wrong, was it at least a reasonable interpretation in context, or can we determine which result we wanted?”

In the end, we did have to accept some limitations of the parser. The Arabic word ما (“maa”) means “what”, but it is also used for negation in some circumstances, and deciding which as which proved too difficult for the computer. You see ambiguity in all languages, of course: in English, “can” might mean “is able to”, in which case it’s an ignorable common word, or it might mean “metal container”, in which case we wouldn’t want to ignore it. But most cases are easy to distinguish–you don’t even need the whole sentence to know which “can” is which in the phrases “the can” or “can see”. In this case, where both meanings are common function words, it became much harder to get reliable results.

The dialect problem never went away, but we did learn to minimize its effects. We included several common dialect spellings of function words on our “words to ignore” list, so that even if the parser thought they were nouns or verbs, we knew to skip them in our analysis. And we found that in an international data set like hotel reviews, there was enough Modern Standard Arabic for us to successfuly gain insights from it. I’d want to fine-tune the program before loading, say, thousands of sentences of a single dialect, especially if that dialect varies significantly from the standard (Tunisian Arabic, for example, has influences from several European and African languages), but after the development we’ve already done, I’d be confident in our ability to do that fine-tuning.

A final unexpected challenge came when we looked at the results in our visualizer: many things were backwards! Not the words, fortunately, but arrows would point in the wrong direction, text would align flush against the wrong edge, even quotation marks would appear at the wrong edge of the text. It turns out that many, many programs, including web browsers, simply despair when you mix text that reads left-to-right (like English) with text that reads right-to-left (like Arabic).


Pictured: on the left, left to right (wrong); on the right, right to left (right). It’s as confusing as it sounds.

That one turned out to be far easier to fix than we expected: style sheets for web pages allow you to specify that the direction of the text is right-to-left, at which point the browser everything flips to look the way it should.

What now?

In the end, I’m quite pleased at how well our system handles Arabic. Starting as a task that I knew would be hard and I feared would be simply impossible, this project has ended with the ability to find insights in Arabic text that I’d readily put up against our French or Russian capabilities. I can now tell people that I’ve taught a computer to understand Arabic, which may be an exaggeration, but it does still understand more Arabic than I do.

Adding Arabic also means that we can now find insights in the language of nearly 40% of the world’s population, including all six languages of the United Nations; and that we cover four of the five most spoken languages in the world–and who knows, perhaps Hindi will be next (unless Klingon turns out have higher demand than I anticipated, in which case, Heghlu’meH QaQ jajvam).