Natural language generation (NLG). It’s what we do here at Automated Insights. But what does it mean? And how do you do it well?At one extreme of the NLG spectrum, there’s true robot writing. This involves training a deep-learning algorithm on a corpora of text, then having it produce content based on its learnings. From a pure research and artificial intelligence perspective, this represents the holy grail of NLG. In our AI-addled, buzzword-fueled media landscape (robots are stealing your jobs!), many people assume that this is what companies like ours do. Some of our competitors also peddle this mythology. As seen in this example (and countless others), this type of writing, while fascinating to explore (and a key part of the Ai Labs roadmap), is not nearly ready for primetime.For customers like the Associated Press, an organization with famously precise editorial and journalistic standards, there must be some degree of determinism in the NLG process. In fact, all of our customers are interested in controlling the tone and message of their automated content in order to make it consistent with their branding and use case. There are certainly parts of NLG that can be controlled entirely by the robots (think: recommendations on synonyms and restructured sentences, or automating the data science to choose what’s most interesting to talk about). But, in general, doing NLG well is still a robots + humans proposition.

At the other end of the spectrum is a pure Mad Libs- or mail merge-style approach in which a static template is used to merely substitute in data variables at specified locations. It’s easy to identify this type of overly-simplified NLG, as it’s characterized by robotic-sounding prose and a complete lack of variability.

In between these two extremes is where all the current players in the NLG space are living. Some level of determinism (mandated by human domain experts) is still required, but the ability to quickly create nuanced and variable narratives is what separates the good from the bad.

A Narrative-Driven or Data-Driven Approach?

Truly fantastic automated content can be found at the confluence of rich data and great storytelling. In general, a more robust set of data variables will enable a more nuanced narrative. But more and better data can only complement story structure, not replace it.

So what do we mean by “story structure?” The world of automated writing is really not that different from its more manual predecessor. Like a human-written piece of content, an automated narrative contains a lede, some supporting material, and a conclusion. Instead of just writing about a single event, however, an automated story structure must account for the entire universe of possibilities. That universe ranges from extreme edge cases/outliers to mundane occurrences. This codified three-dimensional story structure, once created, can quickly write billions of personalized narratives that encompass dozens, hundreds, or even thousands of unique scenarios.

Variability vs. Complexity

Let’s start by defining what we mean by these terms as they pertain to automated content.

Variability can be thought of as “within-narrative” differentiation. That is, if you were to automate the same story (i.e., one powered by the same underlying data) several times-or several thousand times- how much would each iteration of that story vary from the others? This type of content differentiation does not depend on conditional logic or rules. It is merely iterating the rendered narrative based on some user-specified distribution of variations.

While variability is accounting for differences within narratives, complexity can be considered “between-narrative” differentiation. If you were to automate 50 stories from your dataset (assuming that the underlying data were randomly selected), complexity measures how much each story would vary from the others. This is done through conditional logic (what we generally call “branching”-or, even more colloquially, “rules”), which is driven by the richness of the data and/or the thoroughness of the data wrangling/pre-processing. In other words, the content that is produced directly depends on the data.

Depending on your use case or vertical, you may care more about variability, more about complexity, or lots about both. For example, a product descriptions use case for e-commerce is probably more concerned with variability for SEO purposes than it is about complex, topic-level differentiation. A minor-league baseball recap or earnings report probably needs more complexity than variability. For wide-scale, hyper-personalized content-say, a fantasy football or video game recap-high levels of both variability and complexity will work best.

Adding Variability

Let’s start with the following lede sentence taken from one of Automated Insights’ automated basketball recaps:

“Led by LeBron James, who had a triple-double with 25 points, 13 rebounds, and a season-high 12 assists, the Cleveland Cavaliers defeated the Golden State Warriors 113-102 on Thursday at Oracle Arena.”

Basic variability, as defined in the previous section, can be added in a couple of key ways.

I. Adding synonyms

This is simply replacing words or phrases with similar words or phrases. Examples from the above lede include:

  1. “Led” can be iterated with words like “Carried / Lifted / Powered”
  2. “defeated” can alternatively be “beat / knocked off”
  3. “had” can be “recorded / put up / recorded / registered / accumulated”

Not all words can or should be iterated using synonyms. In some cases, there might be a dozen worthy synonyms to choose from; in others, only a couple. In general, adding synonyms introduces a trade-off between variability and robotic-/clunky-sounding prose. As the number of synonyms is increased, the permutations of possible outputs likewise rise. But the probability of having your favorite (i.e., the best-sounding) option selected will fall. Also words that might stick out to readers (like iterating three with ‘a triumvirate’) if used too much are best avoided (or used sparingly via the assigned weight given to them).

II. Restructuring sentences

Entire sentences or phrases can also be re-written to add variability. Going back to our previous lede example, it can also be structured as:

  1. A triple-double by LeBron James lifted the Cleveland Cavaliers to a 113-102 victory over the Golden State Warriors. He had 25 points, 13 rebounds, and 12 assists.
  2. With 25 points, 13 rebounds, and 12 assists, LeBron James’ triple-double carried the Cleveland Cavaliers to a 113-102 win over the Golden State Warriors.
  3. Leading the Cleveland Cavaliers to a 113-102 win over the Golden State Warriors, LeBron James recorded a triple double with 25 points, 13 rebounds, and 12 assists.

Like with adding synonyms, there is often a trade-off between the number of iterations included and the “humanness” of the resulting prose. Try to find a happy medium between adding variability without introducing robotic-sounding narratives.

Adding Complexity

In general, the lede sentence acts as the primary driver of complexity. By creating a rich possibility of potential lede topics, you can greatly enhance the amount of stories that can be told. The degree of topic-level complexity that a narrative needs is a function of several things-most importantly the vertical or use case. A vertical like e-commerce (with a use case like product descriptions) will generally have relatively few lede topics. A well-designed sports game recap, on the other hand, will require heavy lede-level variability. Verticals like finance and business intelligence also generally call for a higher degree of complexity (but, again, it depends on use case within those verticals).

Since a sports recap is a canonical example of a story demanding topic-level complexity (read through a few human-penned NBA recaps and note the number of different lede types), let’s return to our lede sentence from earlier:

“Led by LeBron James, who had a triple-double with 25 points, 13 rebounds, and a season-high 12 assists, the Cleveland Cavaliers defeated the Golden State Warriors 113-102 on Thursday at Oracle Arena.”

While lede-level complexity can be added in several ways, a couple of the most powerful are:

I. Adding conditional logic (“rules”) to account for different types of topics

This is where automating the data science (or data pre-processing) can pay big dividends. In the case of NBA recaps, your pre-processing script can parse play-by-play data to determine whether a game had a “big play” (e.g., a buzzer-beater, game-winning shot, late lead change, comeback, or back-and-forth finish), or one/multiple “big players.” Or maybe the most notable thing to lead with is a team-level insight (a season-high in points or a long winning streak, for example). Designing a lede paragraph that’s flexible enough to deal with this entire range of possibilities is the first (and most important) step in creating a complex piece of automated content.

Examples of how the lede sentence might differ depending on which option is triggered include:

  1. Big play: Vince Carter banked in a 15-foot jumper to beat the buzzer, lifting the Memphis Grizzlies to a 99-98 win over the Utah Jazz on Friday.
  2. Comeback win: The San Antonio Spurs outscored the Dallas Mavericks 14-4 over the final 2:45, capping off a come-from-behind 104-100 victory on Monday.
  3. Team superlative: The Los Angeles Lakers set season highs in points (132) and made 3-pointers (21) in a 28-point blowout victory over the Portland Trail Blazers on Sunday.
  4. Battle of top scorers: The teams had dueling 40-point scorers, as the Golden State Warriors beat the Oklahoma City Thunder 117-111 on Tuesday. Steph Curry led the Warriors with 45, while Kevin Durant topped OKC with a season-high 47.
  5. Balanced scoring: The Boston Celtics had a balanced attack with eight double-digit scorers in a 115-100 victory over the the Brooklyn Nets on Wednesday. Isaiah Thomas led the way with 21 points, followed by Tyler Zeller (17), Jae Crowder (17), Evan Turner (15), and Jared Sullinger (14).
  6. 2 big players: John Wall and Bradley Beal lifted the Washington Wizards to a 96-90 win over the Toronto Raptors on Wednesday. Wall had 28 points, 12 assists, and four steals, while Beal chipped in with 31 points and six rebounds.

There are dozens of data-driven lede types that can be constructed to maximize complexity. While some will rarely hit (like a perfect game or cycle in baseball, or an all-time high for a stock/portfolio summary), it’s important to include these edge cases for the sake of completeness. There’s also a trade-off between how frequently a condition is triggered and how much it needs iterated. You’ll want to dedicate your marginal iteration effort to the phrases that are used most frequently. That is, you don’t need seven different ways to describe an event that might only happen once every 10,000 narratives. Finally, the amount of “options” in your lede isn’t always directly correlated with its complexity. It’s important to understand how “theoretical complexity” (i.e., assuming all variations are equally likely to occur) is related to “empirical complexity” (i.e., how frequently your variations actually occur in the dataset). Having 20 options that each hit five percent of the time is different than having 19 options that hit one percent of the time each, then a catch-all option that hits for the remaining 81 percent of stories. Ultimately, the amount of differentiation depends on the frequency distribution of those options which, in turn, is a function of the distribution of the underlying data (and the design of the conditional logic).

II. Within-condition branching

Rather than creating an entirely new topic, using branching within an existing topic is also a way to increase complexity. The quintessential example of this type of differentiation is a branch with options for ‘rose/fell/held steady’ depending on the direction of a period-over-period change. An intermediate version of this branch can add options for things like ‘rose slightly’ or ‘fell significantly.’ A more advanced version could even add options for things like ‘more than tripled,’ ‘nearly doubled,’ or ‘fell by over half.’

In cases like this, most of the text for a given option remains static. All that will change is the word (or phrase) being impacted by the branch (i.e., replacing ‘rose’ with ‘fell’ when appropriate). Other examples of this type of complexity are:

  1. Player stat lines: depending on how a player performed, you might replace (for example) ‘had a 17-point, 13-rebound double-double’, with ‘had 28 points and seven assists’; sometimes you’ll want to talk about things like steals and blocks (if they exceed a specified threshold), sometimes you won’t
  2. Type of victory: might choose between things like ‘blew out’ or ‘came back to defeat’ or ‘upset’ or ‘snuck past’ depending on the nature of the win

Type of scoring play: in a basketball context, you’d be choosing between things like ‘a deep 3-pointer’ or ‘a 15-foot pull-up jumper’ or ‘a driving finger roll’ or ‘a pair of free throws’ to describe a score

Wrapping It Up

The amount of complexity and variability that an automated story structure needs is always a function of use case. It’s also a function of the richness of the underlying data. You might want to increase topic-level complexity, but be constrained by data availability (e.g., you can’t talk about how single plays impact win probability without adding a win probability model to the data pre-processing step). The best examples of NLG combine rich data with nuanced conditional logic/story structures. While the heavy lifting of unearthing the most interesting insights can be automated, there’s still a human component that’s necessary to codify domain-specific business logic into all-encompassing story arcs.

With the right data and the right software (Wordsmith!), anyone who can write one great story can write millions of them. Using the best practices detailed above can help you keep those narratives nuanced and variable, even as the scale explodes.