Automated Insights, Inc.

Defining the Context Layer in NLG Part 1: What Just Happened?

Posted by Joe Procopio on Oct 4, 2017

Sign up for our Newsletter

Get the latest findings about NLG and more delivered right to your inbox.

Defining the Context Layer in NLG Part 1: What Just Happened?

In my last blog post, I outlined the differences between good Natural Language Generation and bad NLG. My conclusion was that all of the hard work and sweat should be going into defining the NLG context layer, the logical decision matrix that sits between the data and the words.

This is because NLG content is less about the words and more about the insights, the data. The context layer will help make sure the words are used to convey the data to the reader in a manner that can be absorbed quickly and easily.

So how do you define a context layer?

First and foremost, defining the context layer is a business process, not a technical process. It's akin to defining business requirements for custom software, and then translating those business requirements into technical requirements. The beauty of Wordsmith is that it allows anyone, regardless of technical aptitude, to translate their own business requirements into automated content.

But like any business initiative worth pursuing, just because the process isn't technical doesn't mean it isn't complex. In fact, when we work with customers to assist them in creating massive automated content projects, like Yahoo's Fantasy Football Recaps or the AP's Quarterly Earnings Reports, we spend the vast majority of our time working with our customer to define the context layer, and we take most of our cues from their industry experts.

The goal when creating good automated content is to unlock the story hidden within the data. And those stories can usually be boiled down to a few basic structural elements.

In this post, we'll look at the first and most important of those structural elements, and talk about building a context layer around it.

What just happened?

The shortest form of automated content gets right to the point:

Your team won.

Your stock price rose.

Your customer churn decreased.

Those examples are more alerts than insights, and while the science here is still NLG, the complexity is rather low, and we're still pretty much just reporting simple data back to the end user.

Because those examples lack any insight, there is no context layer to contend with. Once you add context, you move from alert to sentence, and from raw data to NLG.

Your team won, against a better team.

Your stock price rose to a new 52-week high.

Your customer churn decreased, leading to an increase in profits.

Now you've got stories -- small stories, but stories nonetheless. And that first bit of context, which should be the most important bit of context, is the lede, much like a lede in a news story, which is a single sentence that quickly summarizes the most important aspects of the story.

The first step in defining the context layer is outlining every single viable lede for the story being told by the data. This isn't as hard as it sounds, because ledes can be grouped. In a strictly data-focused sense, ledes are generally grouped into categories of movement within the data, namely: records, trends, and deltas.

Records tend to denote the highest/lowest, quickest/slowest, longest/shortest data points within a given period of time. The relevance of the peaks and valleys, as well as the time frame over which they're determined, are business decisions.

With records, a good rule of thumb is the less frequent the record is surpassed, or the farther back in time an old record, the more significant the new record. In stocks, a weekly high can be insignificant when compared to a 52-week high or an all-time high. You can also look at the gap between the new record and the old record, with greater significance implied at greater differences. When a record is smashed, so to speak, it's usually a bigger deal.

Trends are series of data points that show growth/contraction or positive/negative or any other pattern, usually over time. Again, how these trends are interpreted and their inherent value to the end-reader are business decisions.

Usually, the longer the trend, the more significant it is. However, it also depends on the pattern of changes in the direction of the trend. If a trend direction rarely changes, for example, days at work without an accident, it's probably less significant the longer the trend goes on, and more significant when the trend finally breaks.

Deltas are the differences in two data points from one point on an axis to the next, usually a time axis but also potentially over location or any other attribute that suggests similarity. Deltas are important when they're larger or smaller than expected, or when they're related to another data point in the same set.

Significant changes here might include a delta that is X times larger than the median, i.e a stock that usually rises or falls no more than 1% rises by 10%. Another might be a delta that is markedly different than the expected delta, for example, a team wins by 2 points but was expected to lose by 20.

Groupings

Once the groupings are established and the ledes are determined and outlined, they need to be prioritized. This is because more than one significant scenario can happen with any new measurement of data. For example, a stock price hits a 52-week high due to its largest single-day gain in three months. Only one of these scenarios should be the lede.

Well, I say that, but you can prepare for two-part ledes:

ABCD rode its largest single day gain in three months to a new 52-week high.

Or even three part ledes:

ABCD rode its largest single day gain in three months to its best week in five years and a new 52-week high.

But you can see how even the reading of that sentence gets overly complex. NLG is intended to make sense of a lot of data as concisely as possible. So while you can automate as much content as you'd like, NLG gets more complex the more you write.

Now that we have a prioritized outline of ledes, however, we can create additional groupings of these ledes to accommodate multiple scenarios into our NLG. Of course, this assumes we have the space to include them, because we're now moving from a single sentence to a paragraph. We'll use a primary lede and then a secondary lede:

ABCD soared 3.5% yesterday to close at a new 52-week high. The stock also saw weekly gains of 6.2%, marking its best week since 2012.

You might note that the language is now able to shift to accommodate all that information without sounding as robotic. You can plan for this when outlining these groups of multiple ledes, something that's much harder to do when trying to cram three scenarios into a single sentence. It's not just an aesthetic choice, you're actually conveying better information to the end-reader when you use more human-sounding language.

But just as important a business decision as what you say is how many words you take to say it. And what you leave out of the content can be just as important as what you leave in. Not every event in the data is significant, and adding too much complexity to the narrative can actually defeat the purpose of automated content, which, again, is to make sense of a lot of data as concisely as possible.

If you're working with anything longer than a couple sentences, you'll want to make sure your first one or two sentences are lede only. In other words, use the lede to tell the most important part of what just happened, with context, and save the detail for later.

We'll discuss the middle and and the end of our context layer in the next post.