February was an exciting month for us! Among other things, we came in first place in nearly every category we participated in at SemEval 2017, an annual international workshop on semantic evaluation. If those last two words have you scratching your head, read on. (If you and semantic vectors are old friends, check out our Chief Science Officer Rob Speer’s more detailed post on the topic here.)
One of the first questions we’re asked when we talk about Luminoso’s software is “So do I have to provide the training data set?” or “What do you use to train the system?” We’re usually met with baffled expressions when we truthfully say, “We don’t; we use word embeddings.”
Many people are familiar with traditional methodologies for analyzing text-based data. A long list of keywords that might be present in the data must be created, and then a computer hunts through text data and returns documents containing those keywords. Alternatively, you can train an algorithm to return documents matching certain criteria – but you need an initial dataset to train your algorithm, and the algorithm won’t know what to do with data that doesn’t match exactly with the training dataset.
Introducing word embeddings
Over the past few years, a new approach called “word embeddings” has been developed to more quickly and accurately analyze text-based data. It’s gotten popular enough that people have started making memes about it, which is saying something.
Rather than matching a new dataset to a predefined list of keywords (the equivalent of doing a command-f search in a word doc), word embeddings turn language into mathematical vectors. Vectors that are similar to each other represent words or phrases that are similar or highly related to each other in a dataset.
These word embeddings aren’t created in a vacuum; there are a number of different sources for creating these vectors. The most well-known are Google’s word2vec and Facebook’s fastText, but there a number of others, including MIT’s ConceptNet, Stanford’s GloVe, and Luminoso’s proprietary ensemble of multiple systems.
How Luminoso uses word embeddings to analyze your data
Luminoso‘s software starts out with its proprietary ensemble of systems that provide background knowledge about how the world works. In other words, our software has a good idea of what words mean before it sees a single sentence of your data.
Our software then reads through your data and refines its understanding of what words and phrases mean based on how they’re used in your data, allowing it to accurately understand jargon, common misspellings, and domain-specific meanings of words. This makes it easier to quickly understand data sets with lots of industry-, company-, or brand-specific terms.
SemEval 2017: Pitting Luminoso against everyone else
SemEval is a long-running evaluation of computational semantic systems, including word embeddings like the ones Luminoso uses. It does an important job of counteracting publication bias. Most organizations will only publish the results of evaluations where their system performs well, and omit findings where it didn’t.
The evaluation organized by SemEval asks many groups to compete head-to-head on an evaluation they haven’t seen yet (e.g. a test with no practice questions or study guide), with results released all at the same time. When SemEval results come out, you can see a fair comparison of everyone’s approach, with positive and negative results.
This year’s evaluation was a typical word-relatedness task. You get a list of pairs of words, and your system has to assess how related they are, which is a useful thing to know in applications our clients ask for such as text classification, search, and topic detection. The score is how well your system’s responses correlate with the responses that people give. Systems were evaluated on how well pairs of words were related to each other in a single language (comparing a German word to another German word) or multiple languages (comparing a German word to a Farsi word).
We are thrilled to announce that we outperformed all other evaluated systems in every category we participated in, save one (in which we came in third… still not too shabby!).
In the first task we were asked to complete, identifying the relatedness of words in a single language, Luminoso came in first place in English, German, Spanish, and Italian. We came in third in Farsi. The first- and second-place winners in that category were Farsi-only systems.
The second task was identifying word relatedness for word pairs where each word is in a different language. Luminoso won in every category SemEval offered, including English-Spanish, English-Farsi, German-Spanish, and Spanish-Italian, amongst others.
For the full list of who participated in SemEval 2017 and what our exact scores were, you can find them here. If have you have other questions about how Luminoso uses word embeddings, or the pros and cons of word embeddings compared to other systems, drop us a line; we’d love to hear from you!