Luminoso Software Updates

As you may know, the development team at Luminoso works tirelessly to make sure that our solutions are doing what they should without any hiccups. When we make improvements and fix things, we want you to be with us every step of the way.

Check out our most recent software update here highlighting improvements on the following:

  • Retrieving Documents in Bulk via API
  • Improved Collocation
  • Improved API Documentation

If you ever have any questions, comments or feedback about your Luminoso experience, or want to know more about our solutions and software updates, please feel free to reach out to us at support@luminoso.com.

Client Guest Blog Post: Text Analytics for Tobacco Control Research – University of Illinois at Chicago’s Experience with Luminoso

The following is a reposted blog by Hy Tran discussing Luminoso, which UIC used to analyze a collection of tweets from the Center for Disease Control and Prevention’s (CDC) Tips From Former Smokers (Tips) II campaign.

In 2013 the CDC launched the second round of Tips from Former Smokers – a national media campaign that used real-life stories from smokers suffering from health consequences of smoking. A hallmark of Tips is its use of very graphic images to elicit strong emotional responses.

Our challenge was to analyze 140,000 Tips-relevant tweets to see how Twitter users react to Tips messages. Specifically, what emotions do the ads make people feel? Do people accept or reject the messages? To help answer these research questions, we turned to the text analytics software Luminoso.

Luminoso offers several helpful features: it is cloud-based, so any computer with internet connection can access the analysis; it has friendly user interface so users need not be familiar with natural language processing (NLP) to use it; and it comes with a number of handy tools to visualize the complexity of the corpus. Below is the ConceptCloud, a main part of the user interface. Users can manipulate the cloud directly to extract information and create visualizations.

Concept Cloud

Figure 1 – Concept Clouds representing associations between concepts found in CDC Tips Tweets Corpus. Each concept is a word or phrase that has a certain meaning. Concepts that appear more often are displayed in bigger fonts. Concepts that are strongly correlated are located close to each other. Related concepts are displayed in various shades of the same color, e.g., the concepts in light green are about Terrie, the lady with hole in throat.

What really sets Luminoso apart is that it comes with a built-in network of “common sense” knowledge, which allows the machine to understand implicit connections in language that we humans take for granted. In Luminoso’s own words, this tool recognizes the difference between “I find your pants disturbing” and “I find your lack of pants disturbing.” In order to have such understanding, one needs to have the common sense knowledge that people are expected to wear pants. Of course for humans, this kind of knowledge is taken for granted and never spelled out in day-to-day conversations. However, computers do not have such knowledge; someone has to teach it to them first. Any attempt for a machine to analyze human language would be incomplete unless the computer possesses an equivalent set of such common sense knowledge.

Luminoso acquires common sense knowledge from ConceptNet, a semantic network built by MIT Media Lab. It then takes advantage of a technique known as AnalogySpace to combine “big picture” knowledge from ConceptNet with specific knowledge learned from the text data in hand. The technical details behind ConceptNet and AnalogySpace are beyond the scope of this blog post. Users with interest are encouraged to start with this paper, which provides a comprehensive introduction to these topics. For further reading, please refer to this publication list by Catherine Havasi, one of the active developers of ConceptNet.

Our challenge with Tips data is to extract sentiments from text and see which sentiments are strongly related to each other. Sentiments are difficult to analyze – they can be hard to define and understanding them requires a good deal of implicit knowledge. Luminoso’s ability to incorporate common sense knowledge made it attractive to our purpose.

Table 1

Table 1 – Correlation among some Sentiment Concepts found in Tips Corpus. Each row and column indicates a concept. The numbers are correlation coefficients between concepts.

For example, Luminoso understands that sentiments such as “depressing” and “sad” are strongly correlated, whereas “sad” and “funny” are not correlated. While this seems obvious to a human reader, it is no simple task to teach such ideas to a machine. This kind of understanding, though simplified, can serve as a good starting point for further development in sentiment analysis.

Result

One of our main goals is to understand how people react to the graphic depictions of the consequences of smoking shown in Tips ads. The table below provides correlations between specific smoking-related diseases and sentiment reaction.

Table 2

Table 2 – Correlation between smoking consequences (found by Search Rule) and sentiments (found by Luminoso).

Here we can see a strong positive correlation between ads showing patients with tobacco-related cancer and sadness. This indicates that talking about cancer elicits sad emotions in the audience, which may be a sign that people are engaged with this message. Likewise, it seems that a stoma can freak and creep people out, suggesting a fear response to effects of disease as opposed to a humor response. COPD and stroke have weak correlation with sentiments, meaning ads featuring these conditions provoke a less emotional response.

We also have concepts such as “laugh” or “funny.” Positive correlation with these concepts can indicate that audiences are rejecting the message. In general, correlations with ads are weak for these concepts. The moderately negative correlations between “lol”, “funny” and stoma ads suggest that the seriousness of the disease caused by smoking is relatively well-accepted among audiences.

Another useful feature of Luminoso is the ability to correlate a topic with users’ metadata. Below is an example with topics related to Terrie – one of the former smokers featured in Tips whose story was the most prominent topic of discussion in our corpus.

Figure 2 – Correlations between Topics about Terrie (vertical axis) and Number of Followers of the tweet authors (horizontal axis).

Figure 2 – Correlations between Topics about Terrie (vertical axis) and Number of Followers of the tweet authors (horizontal axis).

The trend indicates that people with lower numbers of followers are more likely to talk about Terrie ads than those with many followers. Since people with fewer followers tend to be organic as opposed to commercial users, this indicates that Terrie ads managed to engage organic audiences.

We can also see if people talk about quitting smoking after viewing the ads.

Figure 3

Figure 3 – Correlations between Topics about Not Smoking (vertical axis) and Number of Followers of the tweet authors (horizontal axis).

“Not Start” refers to users saying don’t start smoking. “Stop Smoking” refers to people telling others to stop smoking. “Quit Smoking” refers to discussion about quitting smoking. Note that there’s a very large correlation of Quit Smoking talk for users with large followers. However, these are mainly institutional users such as state public health agencies giving advice about how to quit. It appears that people with fewer followers (likely to be organic users) are motivated to tell others to stop smoking after seeing these ads.

These insights, drawn from over 140,000 Tweets about the CDC Tips campaign, would not be possible using human coders alone. Thus Luminoso proves to be a promising tool for analyzing sentiment in social media data. These analyses reveal important differences in emotional responses to different kinds of ads, show how people talk about smoking after viewing the ads, and provide further support for running graphic anti-smoking media campaigns in the future.

This blog post was written by Research Assistant Hy Tran, a doctoral candidate in biostatistics.

Updates: Timelines, Subsets and Vectors

Here at Luminoso, we are committed to consistently improving the science of our system. Over the weekend, a few upgrades were made live in our Dashboard.

As an example to help show the what the new updates look like, we’ve taken a few screen shots from our Amazon Kindle Fire project.

Timeline Accuracy

Topic timelines are now based on all of the documents instead of using a sampling, making them more accurate than ever before.

Subsets

The scaling of topic-subset correlations has also been improved, so users can now make direct comparisons to topic timelines, in addition to more accurate comparisons between subsets.

Previous Timelines and Subsets

Old Timeline

Old Subsets

New Timelines and Subsets

New Timeline New Subsets

Vectors

We’ve increased the accuracy of our vectors (the fundamental portrayal on which our visualizations and statistics are based), resulting in small improvements in the accuracy of correlation values.

 

Highlights

Searching through documents in order to understand association with sentiment and topics of discussion can be a cumbersome process, especially with multi-page responses. We’ve introduced highlighting to our system – a feature that brings exact and conceptual matches to the readers attention, regardless of their location within the document. This provides users with instant insight into what drives discussion on a document level.

In this example, the term “charger” has been queried. The system automatically highlights all exact matches in blue, and conceptual matches in light grey.

chargerhighlight

 

Highlights work on complex phrases and parts of multi-word concepts as well.

The system highlights exact matches, as well as logical iterations of the word.

nothighlight

“Kindle Fire” – a portion of a multi-word concept – is highlighted when searched, while the entire concept is highlighted if “Kindle Fire HD” is queried.

kindlefiremagnify

For API users, we make the indices into the text for each term available on every document, so you can find relevant portions of documents faster.

kindlefirehdmagnify

Token Authentication

The Luminoso API now supports standard token-based authentication, making it easier to write an API client in any programming language: each request is authenticated simply by including a header containing an API token.

Picking a March Madness bracket using natural language text

The NCAA basketball playoffs are here. It’s time to make complicated bets on the outcome of a single-elimination tournament. Time for basketball fans to act like they know things about statistics, and statistics fans to act like they know things about basketball, and fans of both basketball and statistics to win stuff. Time for March Madness.

I enjoy following the March Madness results but I never have a good plan for how to fill out a bracket. When I saw a link to Coder’s Bracket, I thought it was an awesome idea: instead of making lots of arbitrary decisions, write a computer program to pick the bracket for you. But what do I know about any of these basketball stats that it uses as input?

What I know about is text analytics, and so Luminoso’s text analytics are going to pick a bracket for me. It may not win a billion dollars from Warren Buffett, but it’ll be fun.

Picking winners based on Twitter activity

I told our Twitter listener to find March Madness-related terms such as “NCAA”, “basketball”, “bracket”, and of course “March Madness”. I also gave it the full names of all the teams according to Wikipedia, in case those helped. I collected tweets from about 12 to 5 PM today.

My assumption is that the teams that are more likely to win are the ones that more people are talking about on Twitter. If the Florida Gators end up matched against the Tulsa Golden Hurricane, we can see that more people are talking about the Gators than the Golden Hurricane and pick them as the winners.

The only problem is how to measure this. The full names of college basketball teams are often complicated and unlikely to be used in casual conversation, such as the “Stephen F. Austin Lumberjacks” or the “Louisiana-Lafayette Ragin’ Cajuns”. But shorter versions of the names are ambiguous. If someone says “Iowa”, are they talking about the Iowa Hawkeyes or the Iowa State Cyclones? If you say “Go Wildcats”, which of four teams are you cheering for? And are people who tweet #GoState even trying to communicate anything?

First, I should check my assumption. Can I just count the number of occurrences of the full name of the team? Not really. This fails for a few reasons:

  • The results are noisy. The numbers are small (about 5 to 100) and they change drastically when you use a slight variant of the team name.
  • The results may be biased, in that some teams may be more likely to be called by their full name than others.
  • Most of the people who talk like that are spammers auto-generating tweets. Unless I believe that spammers have some key insight on who’s going to win, I should probably disregard this data.

On the one hand, this sounds like a job for Luminoso and semantic similarity.

On the other hand, basically all of the semantic information that would link teams to nicknames is missing here. People will put “Gonzaga” next to other teams in a list of teams they’re rooting for more often than they’ll put it next to “Bulldogs” to form the whole team name, so there is no accurate way to determine automatically that “Gonzaga” and “Bulldogs” are the same team.

But Luminoso’s term statistics can tell us which terms seem to be interesting, and many of those are team names.

Assigning relevance to teams

The relevance function in Luminoso finds words and 2-3 word phrases that appear more than you’d expect from the general distribution of words in English. This isn’t just word frequency: phrases can be more relevant than their individual words if those words usually appear together, and it de-emphasizes common words like “next” (not to mention “the”) in favor of ones that are more interesting in this set of data.

We have an ordered list of relevant terms that we extracted from the collected tweets. It contains things like “tournament bracket, ncaatournament, Iowa, Gators, Duke, MSU, …”. The only problem, as described before, is assigning these relevant terms to teams.

We know now that we can’t hope for a clear one-to-one mapping (some of these aren’t teams anyway). But we can go through the list of terms in order, and ask: “Does this term appear in exactly one team name?”

Each team then ends up with the relevance score of its most relevant unique term. In the list above, for example, we’d start by skipping “tournament bracket” and “ncaatournament” because they don’t go with a team. We then skip Iowa because there are multiple teams from Iowa. We keep “Gators”, and we’d assign its score to the Florida Gators if they don’t already have a higher score.

I had to make sure to allow for abbreviations, such as “UConn” for the University of Connecticut Huskies and “NC State” for the North Carolina State Wolfpack. But once this was done, it seemed to be a fairly resilient way to rank the teams by their Twitter buzz. It was probably a bit unfair to the Kentucky Wildcats — both of their names are ambiguous, so they could only count for the full phrase “Kentucky Wildcats” — but that just goes to show that they should pick a more creative name.

The results, including which term was the best for each team, are shown in this spreadsheet.

The bracket

Here’s the bracket and the code to generate it, on Coder’s Bracket.

Luminoso's March Madness bracket

There’s a tweak in the code that says that 15th and 16th seeds can’t win. I added that, not just because it’s a very good assumption, but also because otherwise we’d give too much weight to a team such as Cal Poly that was just about to play a play-in game as the data was being collected.

Finally, a concept cloud

One of these posts wouldn’t seem complete without showing the semantic space of all these mostly-March Madness tweets as a concept cloud. The words and phrases here are sized by their relevance score, colored by major topics (the tournament in general, and teams that people are rooting for), and positioned so that related concepts are near each other.

Disregard Putin, he’s getting into everything these days.

March Madness concept cloud

The State of the Union

Tonight, President Obama gives his State of the Union address. It’s that time of year when we reflect on where the country is going, and then gossip about who in the audience made a faux pas on national TV, invent drinking games, and start up text analytics software.

The SotU address is a pretty good case for text analytics. It’s important. You want to know fundamentally what it’s about. But it’s also rather long and frequently takes a while to get to the point. Can’t we get a computer to listen to it, and just glance at a computer-generated visualization of what the President said, instead?

This is, of course, the kind of thing we do at Luminoso. And while tonight’s speech hasn’t happened yet, we can apply Luminoso’s software to examine the content of State of the Union addresses over the history of our country. Political rhetoric changes with the times, but we can use Luminoso to open new avenues of investigation into a general model of how politicians talk.

We built this model from transcripts of all 227 SotU speeches. The model captures the semantics of how words and phrases are used, based on its general background knowledge and the way words are used in these speeches in particular. For example, the model knows that the President is talking about approximately the same thing when he mentions “negotiation” and “arbitration”, or for that matter “poverty” and the “needy”. It knows that “combat”, “conquer”, “troops”, and “invasion” are all ways to talk about “war”, which is important because there is a lot of talk about war.

Here’s the overview Concept Cloud for all of the speeches.

SotUCloud

There’s more going on in a Concept Cloud than in a typical word cloud visualization. The words are grouped together in space so that words that are used in similar ways appear near each other, or in similar colors. The size of the word indicates how much information we get from the word — this is why the largest words aren’t “the”, “I”, and “America”, they’re words relating to particular meaningful topics. Some of the informative words were predictable (“taxpayers”, “troops”, “negotiated”) and some of them less so (“civilized”, “gratifying”, “enlightened”).

We can track the usage of named entities (“Congress”, “Britain”, “Communists”)­, actions (“enacted”, “negotiated”), and descriptions (“civilized”, “belligerent”, “patriotic”). Some of these topics were legislative, some involved foreign policy, and quite a lot of our analytical topics were military, but then again, so were many of the speeches.

One thing we can do is break down the word usage by political party:

Concept Cloud for Democratic-Republican presidents (29 speeches)
Concept Cloud for Democratic-Republican presidents (29 speeches)

SotUCloudRepublican
Concept Cloud for Republican presidents (89 speeches)

SotUCloudDemocrat
Concept Cloud for Democratic presidents (90 speeches)

The Concept Clouds for speeches from different political parties turned up some interesting contrasts. Democratic-Republican presidents demonstrated presidential priorities from a different century, including the “Barbary” coast pirates, “fortifications”, and management of “militias”. Republican presidents tended to discuss foreign relations more bluntly (“confront”, “cooperate”, “allies”), while Democrats tended to be multi-lateral (“United Nations”, “aggression”, “neighbors”). On economic policies, Republicans discussed, “inflation”, “incentives”, and “taxpayers”, while Democrats talked of “unemployment” and the “minimum wage”.

Now we can break it down by President, looking at the correlation between the content of their speeches and various topics. This is something we measure as a percentage, with 0% being the typical correlation between two arbitrary words. These numbers measure whether each President talks about various topics more or less than usual, based on the appearance of the key word and many related words in their speeches.

Presidents who discussed… The most The least
God Ronald Reagan (11% more)
George H. W. Bush (9% more)
George W. Bush (7% more)
Calvin Coolidge (11% less)
Warren G. Harding (11% less)
William Taft (10% less)
war Franklin D. Roosevelt (33% more)
Harry S. Truman (25% more)
James Madison (24% more)
Bill Clinton (7% less)
Barack Obama (4% less)
Warren G. Harding (2% less)
peace John F. Kennedy (32% more)
Harry S. Truman (30% more)
Dwight D. Eisenhower (27% more)
Theodore Roosevelt (28% less)
Calvin Coolidge (15% less)
Benjamin Harrison (11% less)
taxpayers Barack Obama (48% more)
Bill Clinton (47% more)
Lyndon B. Johnson (46% more)
John Adams (34% less)
Zachary Taylor (32% less)
Abraham Lincoln (31% less)
the Constitution Andrew Johnson (57% more)
James Buchanan (57% more)
Rutherford B. Hayes (40% more)
Barack Obama (18% less)
John F. Kennedy (17% less)
Bill Clinton (15% less)

When the SotU speeches were analyzed by president, some interesting variations turned up, some of which were caused by events (“war”, “peace”) and some which were the result of period vocabulary (“taxpayers”, “Constitution”).

A Luminoso analysis can output a lot of numbers, but usually we make the most out of topic-topic correlations, which tell you to what extent any two concepts are related. Many strong correlations were predictable (Senate/ratified 62%, repeal/law 74%), but some were quite interesting. The correlation between “Supreme Court” and “impartial” (77%) suggests that the Supreme Court has always been viewed as the fair arbiter of American legislation. The strong link between “Communists” and “Fascists” (41%) indicates that US presidents have talked about these different political movements in quite similar ways. “England” and “Britain” were 91% related, while “Russia” and “Soviet Union” were only 35% correlated, accurately reflecting the differences that these names represent.

The most interesting results came from the topic-timeline data, which measures how much any given concept (and its related ideas) were discussed over time:

SotU-PoliticalEntities

Republicans and Democrats alike will unite in bipartisan indignation to find that, as far as presidential SotU addresses are concerned, they are practically identical entities. However, they may be heartened to learn that their share of the conversation is increasing. Meanwhile, the “Constitution” has been virtually ignored since the start of the 20th century, and both “Congress” and the “Senate” have held steady in terms of relevance in SotU speeches.

SotU-Legislation

Interestingly, terms relating to legislation have been in decline since the early 20th century. This could be because presidents switched from discussing specific legislation to broad policy goals, but it’s more likely that this is the result of emphasizing foreign over domestic policy.

SotU-Rhetoric

After World War II, the SotU became a message not just for American “citizens”, but for the international community. The role of America in the world changed, and the role of the State of the Union speech did as well. Presidents abruptly began using the SotU address to tell everyone what they were going to do about the communists, fascists, and the imperialists.

SotU-MilitaryVocabulary

World War II profoundly changed the way US presidents discuss all military actions. It used to be about “maritime” rights and dealing with “beligerents” and “hostilities”. After WWII, the new watchword was “security”, and countering “aggression”. Most notably, it was after WWII that US presidents decided that they were the leaders of the “free world” and said so.

Will Obama’s address tonight represent the natural continuation of historical trends, or will it cover unprecedented new topics? We’ll find out tonight, and so will our software.

Luminoso Webinar TOMORROW, Wednesday, January 15: The State of the Union Address

There have been 224 State of the Union addresses delivered by 44 sitting United States Presidents, and Luminoso has read them all. What are some of the general themes over time? Has the language, tenor or sentiment changed since George Washington delivered the first State of the Union address in 1790?

During this interactive webinar, analytics engineer Dr. Alice Kaanta will demonstrate how Luminoso came to understand each of the State of the Union addresses. Alice will also be available to answer your general questions about how Luminoso derives deep insights from unstructured text.

We hope you will listen in!

Please register here.