Using NLG to Auto-Summarize Digital Video |

On April 23rd, I had the honor of representing Automated Insights on a panel at the National Association of Broadcasters annual conference, NABShow, in Las Vegas, with folks from IBM Watson and Salesforce Einstein.I wasn’t there to talk about automating television from data, although — technically — that’s already possible. However, if there’s one thing I do know about video, it’s that there’s a lot of it. Way too much of it. And we don’t need anymore of it, whether or its humans or robots making it.”It is estimated that media companies and user-generated content creates over 2 billion digital images and over 1 billion hours of video watch time every day.” — Justin Pang, head of publishing partnerships at Google.But more is coming. A lot more. I don’t have to tell you that a digital video revolution is underway. That you know. Video is no longer just there for our entertainment, it’s quickly catching up to written content as a primary means of communication.

What I do want to get across in this article is that we’re quickly approaching a critical point at which the explosion of unstructured data generated by digital video content will make it next to impossible to understand, utilize, or even recall most of the information contained in all of that video.

Automation can fix that.

Natural Language Generation (NLG) should be telling the stories behind all that video using all that unstructured data. I’ve been automating content for seven years, from the very inception of our company, Automated Insights, and have produced billions of unique and insightful, human-sounding narratives from raw data for companies like the Associated Press and Yahoo.

All along, I’ve been fighting a battle for acceptance of automated content in the universe of traditional journalism.

Last week, the Associated Press published a report that neatly summarized that battle and declared it all but over. Augmented journalism, the term they use for the integration of human and machine in the creation of news stories, is not meant to take journalism jobs away from humans, it said. Augmented journalism should be standing side-by-side with traditional journalism to incorporate the data science required in contemporary journalism while complementing the investigative process and conclusive reasoning inherent in the job of the journalist.

That’s the message I took to the Columbia School of Journalism in 2013, and to SXSW in 2016, and again to SXSW earlier this year when I spoke about the Automated Future of Journalism with executives from the Washington Post and the New York Times. Each time I relayed this message, it resonated to a greater degree with journalists and media executives.

However, at this year’s SXSW talk, I also started discussing the role of automation around video content. It’s something I touched on at the end of this interview with NPR in late March, and it was the focus of my comments at NABShow. I’ve been researching, strategizing, and prototyping for about a year now, and I’ve figured out where automation plays with video.

It’s not where you might think.

Despite recent media speculation, we’re not heading for a video future in which one automated talking head holds a conversation with another automated talking head. This isn’t happening. It’s the same kind of misunderstanding of the medium I had to debunk when people thought robots would be writing all the news, all the time.

This speculation ignores the fact that machines and humans will continue to work together, like they have over 100 years of automation history. I get it, ignoring the symbiotic working relationship between machine and human is easy. It makes for good dystopian movies and novels. But like I’ve said from the beginning of Automated Insights’ NLG adventure, if you want automation to work well, it has to be a partnership between human and machine. The focus shouldn’t be on making the machine independent, it should be on removing the most expensive, time-and-resource consuming tasks from the human’s plate.

Video Data: What we know now

As video publishing formats and distribution models evolve, we’re creating more meta-data around video content. In most cases, a lot of this meta-data isn’t automated, but it is required for the video to be published and discovered on channels like YouTube or Facebook.

We get what I call the basics: Title, category, keywords, length, and even who is in the video and where and when it was shot. Quite a bit of content information can be discovered from this meta-data. In a lot of cases, a description is also entered at publishing, although these descriptions can be lacking. They’re far from in-depth, they’re unstructured, and they’re usually an afterthought.

As automation comes to digital video publishing, we’re starting to learn a lot more about the content itself, not just the meta-data.

Auto-captioning, using Natural Language Processing (NLP) on the audio portion of the video to capture what is being said, is now standard on Facebook. Of course, those auto-captions are far from accurate today, but they’re going to get better, and they have already fortified the concept of standard subtitle files (SubRip Text or SRT) for digital video.

It’s also a lot easier to edit a caption file that create it, and the editing tools Facebook provides turns the editing of an auto-captured caption file into a time commitment equal to or less than the length of the video itself. As those NLP algorithms get better, and with the onset of Alexa, Siri, Google Home, and so on, there’s a lot of incentive for these algorithms to get better quickly, that editing time will be reduced even further.

What we will know soon

As the audio detection technology that provides the automation of these SRT files becomes more accurate and robust, video detection technology will be right behind it. This will allow for the automatic recognition of people and objects in the video, and will not only provide information as to who said what, but will also provide context, based on what objects are in the shot.

Again, this is not as dystopian as you might fear. Think of how Facebook can, in photos, identify faces and recognize some and prompt the publisher to name others. The same process is happening in video, and it’s close, but not quite ready for prime time yet. Again, it’s much easier to confirm who’s in a video than figure out who’s in a video from scratch.

But chances are if you’re shooting a video meant for publishing, you probably already know who’s in it. What facial recognition can add is the “who said what” part of the equation.

And speaking of context, IBM Watson is working with facial sentiment recognition, which will provide another layer of context. Not only will we know what was said and who said it, but how it was said. Was it a joke? Was it exclamatory? Was it forceful? A hint? Answering these types of questions can help us get a handle on context that we don’t have, and might never have, with written content today.

IBM Watson is also working on facial sentiment analysis of the crowd — essentially those people in the frame that aren’t speaking, to provide context into how what was said was interpreted.

Finally, object recognition can give us another layer of context, either combined with location meta-data or not, and can add details specific to the action taking place within the content. Are we in a gym? A restaurant? Is someone holding something? Did someone throw something?

What we can do

Once this information is automated and available, it can be used to further refine the categorization of the content of the video, a sort of audio and video topic modeling. The end result is that the more you know about the video before you watch it, the better you can determine whether it’s something you’ll want to watch.

With automation, unstructured data becomes structured. We’ll be able to auto-summarize video content the same way we auto-summarize written content, allowing for a much broader and richer viewing experience, more meaningful engagement with video, and more useful information delivered to the end user in a shorter amount of time.

You’ll also be more likely to find it using existing search algorithms, which means you’ll be able to find exactly what you need, at the exact part of the video it exists. If you’ve ever done a video search, it can be a time consuming process, with a number false positives that have to be manually eliminated.

With structured data and content data, searching a video becomes as straightforward and accurate as searching a document.

Why we should do it

When I talk about automating narrative content from data, especially in a media and journalism context, I often come back to increases in the reach, depth, and speed of the organization to bring unique, relevant, and personalized information to the audience.

We do this very well with written content today, but as video eclipses written content in terms of preferred method of information delivery, which we can all agree is happening at a pace we didn’t imagine just five years ago, we need to be able to do this with video content as well.

Earlier this year, Automated Insights co-hosted a hackathon with the Amazon Alexa team and 15 teams ranging from startups to Fortune 500 companies. Using our NLG technology and Alexa’s Natural Language Processing (NLP) and speech technology, we created mind-blowing applications that allowed end users to receive spoken personalized news, financial, school, weather, and all sorts of other information, just by asking a question.

This is where we’re going with video, and this is why we need to tell these data stories around that video.