Updates: Timelines, Subsets and Vectors

Here at Luminoso, we are committed to consistently improving the science of our system. Over the weekend, a few upgrades were made live in our Dashboard.

As an example to help show the what the new updates look like, we’ve taken a few screen shots from our Amazon Kindle Fire project.

Timeline Accuracy

Topic timelines are now based on all of the documents instead of using a sampling, making them more accurate than ever before.

Subsets

The scaling of topic-subset correlations has also been improved, so users can now make direct comparisons to topic timelines, in addition to more accurate comparisons between subsets.

Previous Timelines and Subsets

Old Timeline

Old Subsets

New Timelines and Subsets

New Timeline New Subsets

Vectors

We’ve increased the accuracy of our vectors (the fundamental portrayal on which our visualizations and statistics are based), resulting in small improvements in the accuracy of correlation values.

 

Highlights

Searching through documents in order to understand association with sentiment and topics of discussion can be a cumbersome process, especially with multi-page responses. We’ve introduced highlighting to our system – a feature that brings exact and conceptual matches to the readers attention, regardless of their location within the document. This provides users with instant insight into what drives discussion on a document level.

In this example, the term “charger” has been queried. The system automatically highlights all exact matches in blue, and conceptual matches in light grey.

chargerhighlight

 

Highlights work on complex phrases and parts of multi-word concepts as well.

The system highlights exact matches, as well as logical iterations of the word.

nothighlight

“Kindle Fire” – a portion of a multi-word concept – is highlighted when searched, while the entire concept is highlighted if “Kindle Fire HD” is queried.

kindlefiremagnify

For API users, we make the indices into the text for each term available on every document, so you can find relevant portions of documents faster.

kindlefirehdmagnify

Token Authentication

The Luminoso API now supports standard token-based authentication, making it easier to write an API client in any programming language: each request is authenticated simply by including a header containing an API token.

Picking a March Madness bracket using natural language text

The NCAA basketball playoffs are here. It’s time to make complicated bets on the outcome of a single-elimination tournament. Time for basketball fans to act like they know things about statistics, and statistics fans to act like they know things about basketball, and fans of both basketball and statistics to win stuff. Time for March Madness.

I enjoy following the March Madness results but I never have a good plan for how to fill out a bracket. When I saw a link to Coder’s Bracket, I thought it was an awesome idea: instead of making lots of arbitrary decisions, write a computer program to pick the bracket for you. But what do I know about any of these basketball stats that it uses as input?

What I know about is text analytics, and so Luminoso’s text analytics are going to pick a bracket for me. It may not win a billion dollars from Warren Buffett, but it’ll be fun.

Picking winners based on Twitter activity

I told our Twitter listener to find March Madness-related terms such as “NCAA”, “basketball”, “bracket”, and of course “March Madness”. I also gave it the full names of all the teams according to Wikipedia, in case those helped. I collected tweets from about 12 to 5 PM today.

My assumption is that the teams that are more likely to win are the ones that more people are talking about on Twitter. If the Florida Gators end up matched against the Tulsa Golden Hurricane, we can see that more people are talking about the Gators than the Golden Hurricane and pick them as the winners.

The only problem is how to measure this. The full names of college basketball teams are often complicated and unlikely to be used in casual conversation, such as the “Stephen F. Austin Lumberjacks” or the “Louisiana-Lafayette Ragin’ Cajuns”. But shorter versions of the names are ambiguous. If someone says “Iowa”, are they talking about the Iowa Hawkeyes or the Iowa State Cyclones? If you say “Go Wildcats”, which of four teams are you cheering for? And are people who tweet #GoState even trying to communicate anything?

First, I should check my assumption. Can I just count the number of occurrences of the full name of the team? Not really. This fails for a few reasons:

  • The results are noisy. The numbers are small (about 5 to 100) and they change drastically when you use a slight variant of the team name.
  • The results may be biased, in that some teams may be more likely to be called by their full name than others.
  • Most of the people who talk like that are spammers auto-generating tweets. Unless I believe that spammers have some key insight on who’s going to win, I should probably disregard this data.

On the one hand, this sounds like a job for Luminoso and semantic similarity.

On the other hand, basically all of the semantic information that would link teams to nicknames is missing here. People will put “Gonzaga” next to other teams in a list of teams they’re rooting for more often than they’ll put it next to “Bulldogs” to form the whole team name, so there is no accurate way to determine automatically that “Gonzaga” and “Bulldogs” are the same team.

But Luminoso’s term statistics can tell us which terms seem to be interesting, and many of those are team names.

Assigning relevance to teams

The relevance function in Luminoso finds words and 2-3 word phrases that appear more than you’d expect from the general distribution of words in English. This isn’t just word frequency: phrases can be more relevant than their individual words if those words usually appear together, and it de-emphasizes common words like “next” (not to mention “the”) in favor of ones that are more interesting in this set of data.

We have an ordered list of relevant terms that we extracted from the collected tweets. It contains things like “tournament bracket, ncaatournament, Iowa, Gators, Duke, MSU, …”. The only problem, as described before, is assigning these relevant terms to teams.

We know now that we can’t hope for a clear one-to-one mapping (some of these aren’t teams anyway). But we can go through the list of terms in order, and ask: “Does this term appear in exactly one team name?”

Each team then ends up with the relevance score of its most relevant unique term. In the list above, for example, we’d start by skipping “tournament bracket” and “ncaatournament” because they don’t go with a team. We then skip Iowa because there are multiple teams from Iowa. We keep “Gators”, and we’d assign its score to the Florida Gators if they don’t already have a higher score.

I had to make sure to allow for abbreviations, such as “UConn” for the University of Connecticut Huskies and “NC State” for the North Carolina State Wolfpack. But once this was done, it seemed to be a fairly resilient way to rank the teams by their Twitter buzz. It was probably a bit unfair to the Kentucky Wildcats — both of their names are ambiguous, so they could only count for the full phrase “Kentucky Wildcats” — but that just goes to show that they should pick a more creative name.

The results, including which term was the best for each team, are shown in this spreadsheet.

The bracket

Here’s the bracket and the code to generate it, on Coder’s Bracket.

Luminoso's March Madness bracket

There’s a tweak in the code that says that 15th and 16th seeds can’t win. I added that, not just because it’s a very good assumption, but also because otherwise we’d give too much weight to a team such as Cal Poly that was just about to play a play-in game as the data was being collected.

Finally, a concept cloud

One of these posts wouldn’t seem complete without showing the semantic space of all these mostly-March Madness tweets as a concept cloud. The words and phrases here are sized by their relevance score, colored by major topics (the tournament in general, and teams that people are rooting for), and positioned so that related concepts are near each other.

Disregard Putin, he’s getting into everything these days.

March Madness concept cloud

The State of the Union

Tonight, President Obama gives his State of the Union address. It’s that time of year when we reflect on where the country is going, and then gossip about who in the audience made a faux pas on national TV, invent drinking games, and start up text analytics software.

The SotU address is a pretty good case for text analytics. It’s important. You want to know fundamentally what it’s about. But it’s also rather long and frequently takes a while to get to the point. Can’t we get a computer to listen to it, and just glance at a computer-generated visualization of what the President said, instead?

This is, of course, the kind of thing we do at Luminoso. And while tonight’s speech hasn’t happened yet, we can apply Luminoso’s software to examine the content of State of the Union addresses over the history of our country. Political rhetoric changes with the times, but we can use Luminoso to open new avenues of investigation into a general model of how politicians talk.

We built this model from transcripts of all 227 SotU speeches. The model captures the semantics of how words and phrases are used, based on its general background knowledge and the way words are used in these speeches in particular. For example, the model knows that the President is talking about approximately the same thing when he mentions “negotiation” and “arbitration”, or for that matter “poverty” and the “needy”. It knows that “combat”, “conquer”, “troops”, and “invasion” are all ways to talk about “war”, which is important because there is a lot of talk about war.

Here’s the overview Concept Cloud for all of the speeches.

SotUCloud

There’s more going on in a Concept Cloud than in a typical word cloud visualization. The words are grouped together in space so that words that are used in similar ways appear near each other, or in similar colors. The size of the word indicates how much information we get from the word — this is why the largest words aren’t “the”, “I”, and “America”, they’re words relating to particular meaningful topics. Some of the informative words were predictable (“taxpayers”, “troops”, “negotiated”) and some of them less so (“civilized”, “gratifying”, “enlightened”).

We can track the usage of named entities (“Congress”, “Britain”, “Communists”)­, actions (“enacted”, “negotiated”), and descriptions (“civilized”, “belligerent”, “patriotic”). Some of these topics were legislative, some involved foreign policy, and quite a lot of our analytical topics were military, but then again, so were many of the speeches.

One thing we can do is break down the word usage by political party:

Concept Cloud for Democratic-Republican presidents (29 speeches)
Concept Cloud for Democratic-Republican presidents (29 speeches)

SotUCloudRepublican
Concept Cloud for Republican presidents (89 speeches)

SotUCloudDemocrat
Concept Cloud for Democratic presidents (90 speeches)

The Concept Clouds for speeches from different political parties turned up some interesting contrasts. Democratic-Republican presidents demonstrated presidential priorities from a different century, including the “Barbary” coast pirates, “fortifications”, and management of “militias”. Republican presidents tended to discuss foreign relations more bluntly (“confront”, “cooperate”, “allies”), while Democrats tended to be multi-lateral (“United Nations”, “aggression”, “neighbors”). On economic policies, Republicans discussed, “inflation”, “incentives”, and “taxpayers”, while Democrats talked of “unemployment” and the “minimum wage”.

Now we can break it down by President, looking at the correlation between the content of their speeches and various topics. This is something we measure as a percentage, with 0% being the typical correlation between two arbitrary words. These numbers measure whether each President talks about various topics more or less than usual, based on the appearance of the key word and many related words in their speeches.

Presidents who discussed… The most The least
God Ronald Reagan (11% more)
George H. W. Bush (9% more)
George W. Bush (7% more)
Calvin Coolidge (11% less)
Warren G. Harding (11% less)
William Taft (10% less)
war Franklin D. Roosevelt (33% more)
Harry S. Truman (25% more)
James Madison (24% more)
Bill Clinton (7% less)
Barack Obama (4% less)
Warren G. Harding (2% less)
peace John F. Kennedy (32% more)
Harry S. Truman (30% more)
Dwight D. Eisenhower (27% more)
Theodore Roosevelt (28% less)
Calvin Coolidge (15% less)
Benjamin Harrison (11% less)
taxpayers Barack Obama (48% more)
Bill Clinton (47% more)
Lyndon B. Johnson (46% more)
John Adams (34% less)
Zachary Taylor (32% less)
Abraham Lincoln (31% less)
the Constitution Andrew Johnson (57% more)
James Buchanan (57% more)
Rutherford B. Hayes (40% more)
Barack Obama (18% less)
John F. Kennedy (17% less)
Bill Clinton (15% less)

When the SotU speeches were analyzed by president, some interesting variations turned up, some of which were caused by events (“war”, “peace”) and some which were the result of period vocabulary (“taxpayers”, “Constitution”).

A Luminoso analysis can output a lot of numbers, but usually we make the most out of topic-topic correlations, which tell you to what extent any two concepts are related. Many strong correlations were predictable (Senate/ratified 62%, repeal/law 74%), but some were quite interesting. The correlation between “Supreme Court” and “impartial” (77%) suggests that the Supreme Court has always been viewed as the fair arbiter of American legislation. The strong link between “Communists” and “Fascists” (41%) indicates that US presidents have talked about these different political movements in quite similar ways. “England” and “Britain” were 91% related, while “Russia” and “Soviet Union” were only 35% correlated, accurately reflecting the differences that these names represent.

The most interesting results came from the topic-timeline data, which measures how much any given concept (and its related ideas) were discussed over time:

SotU-PoliticalEntities

Republicans and Democrats alike will unite in bipartisan indignation to find that, as far as presidential SotU addresses are concerned, they are practically identical entities. However, they may be heartened to learn that their share of the conversation is increasing. Meanwhile, the “Constitution” has been virtually ignored since the start of the 20th century, and both “Congress” and the “Senate” have held steady in terms of relevance in SotU speeches.

SotU-Legislation

Interestingly, terms relating to legislation have been in decline since the early 20th century. This could be because presidents switched from discussing specific legislation to broad policy goals, but it’s more likely that this is the result of emphasizing foreign over domestic policy.

SotU-Rhetoric

After World War II, the SotU became a message not just for American “citizens”, but for the international community. The role of America in the world changed, and the role of the State of the Union speech did as well. Presidents abruptly began using the SotU address to tell everyone what they were going to do about the communists, fascists, and the imperialists.

SotU-MilitaryVocabulary

World War II profoundly changed the way US presidents discuss all military actions. It used to be about “maritime” rights and dealing with “beligerents” and “hostilities”. After WWII, the new watchword was “security”, and countering “aggression”. Most notably, it was after WWII that US presidents decided that they were the leaders of the “free world” and said so.

Will Obama’s address tonight represent the natural continuation of historical trends, or will it cover unprecedented new topics? We’ll find out tonight, and so will our software.

Luminoso Webinar TOMORROW, Wednesday, January 15: The State of the Union Address

There have been 224 State of the Union addresses delivered by 44 sitting United States Presidents, and Luminoso has read them all. What are some of the general themes over time? Has the language, tenor or sentiment changed since George Washington delivered the first State of the Union address in 1790?

During this interactive webinar, analytics engineer Dr. Alice Kaanta will demonstrate how Luminoso came to understand each of the State of the Union addresses. Alice will also be available to answer your general questions about how Luminoso derives deep insights from unstructured text.

We hope you will listen in!

Please register here.

Luminoso on Italian food

Italian food

As we did with the local Chinese restaurants (http://blog.luminoso.com/2014/01/08/luminoso-on-chinese-food/), Luminoso analyzed thousands of Yelp reviews for all Italian restaurants within a five mile radius of our offices in Cambridge, which helpfully includes Boston’s North End. While the results from both analyses are very similar, one key difference stuck out, indicating that the type of cuisine we choose to eat creates different customer expectations.

Discussion of the food dominates reviews of Italian restaurants on Yelp, and the more the review talks about the food the more likely the reviewer is to rate the restaurant highly. Calling out dishes by name correlates with higher ratings; our Insight Engine can quickly determine which terms are food names, based on word usage, even if the reviewer spelled “gnocchi” and “tortellini” with more imagination than accuracy. And, as with Chinese restaurants, authenticity is a desired quality. If the reviewers find the food “al dente,” ratings go up.

Just as we found for Chinese restaurants, the service (waiters, waitresses, and the seldom-appreciated hostess) is only mentioned in lower-rated reviews, able to garner much blame but little praise.

Where Italian restaurant reviews differ is in reviewers’ attitudes to ambiance. Looking at the Concept Cloud generated by the reviews, we see the expected clusters around the topics of food and service, but also a third cluster that isn’t in the Chinese restaurant data.  This cluster focuses on the décor, the music, the noise, the mood.

This implies that Bostonians go to Chinese restaurants to be served food, but visit Italian restaurants for something more. They go for the experience.  They’re not all going for the same experience, however. Some want a romantic Italian restaurant where they can take a date. Others want a family restaurant where they feel like they’ve come home.

Therefore, it’s much more important for an owner of an Italian restaurant to focus on how their locations feels, looks, sounds, and how their staff interacts with their customers, than it is for the owner of a Chinese restaurant. They won’t get good reviews for service, because good reviews for service are very rare, but that good service will translate into a good experience, and thus a good review for the ambience.

Of course, it still comes back to the food, in both cuisines. If the food isn’t good, if it’s bland or boring, then no one will stay, even in the nicest atmosphere. If the food is delicious, spicy and authentic, then a good ambience is nice, but your customers would still buy it out of the back of a truck.

Luminoso on Chinese food

Chinese food

We have used Luminoso’s state-of-the-art text analytics software to tackle Chinese food. Thousands of Yelp reviews from hundreds of Chinese restaurants were entered into Luminoso’s Insight Engine for analysis. (We will neither confirm nor deny that we ran this analysis to pick a lunch spot.)

What we found is that the quality of food is by far the most important factor in reviewer sentiment. Across the price spectrum, and over the 15 mile radius encompassed, people rated restaurants based on the dishes they consumed. Restaurants were particularly judged on their authenticity, particularly in Boston’s Chinatown.

A good example of this phenomenon is crab rangoon. Crab rangoon is American, probably invented at Trader Vic’s in San Francisco in 1956 and it’s not served at more authentic Chinese restaurants. Mentions of “crab,” “crab rangoon,” and “crab [various misspellings of “rangoon”]” pop up a lot in reviews of restaurants rated at three stars and below. In reviews of top-rated restaurants, the term “crab” disappears entirely, giving way to scallion pancakes and soup dumplings.

In contrast, the discussion of table service correlates only slightly with positive reviews, and is strongly linked to negative reviews. Good and great service is easily ignored by food-focused patrons, but bad service can ruin a meal. Reviews which mentioned a “waiter” or “waitress” were usually negative. Table service, like editing, is an invisible art, noticed only in the breach. However, reviews that mentioned the “chef” were likely to be positive, especially when chefs were mentioned by name.

As with service, restaurant décor and ambiance is noticed more in negative reviews than positive reviews. However, this is obscured because this is reflected in how people talk about their food. If you search for “cockroach” (which, thanks to Luminoso’s sophisticated modeling, includes reviews that refer to “a thing scurrying across the floor”), you’ll find the expected majority of negative reviews. However, in these reviews, “cockroach” is closely associated with concepts like “tasteless”, “bland”, and “disgusting”. From this, we can infer that the off-putting setting made the food less enjoyable.

Our takeaway is that while the quality of food is the driving factor in how people rate a restaurant, bad service and décor (particularly of the many-legged variety) can ruin one’s enjoyment of it. Even though Yelp reviews may be focused on food, the entire restaurant is being reviewed, consciously or not.