Finance Questions, Answered: Can We Mine Text for Economic and Financial Insights?


Gordon Phillips
Laurence F. Whittemore Professor of Business Administration

Gordon Phillips had a problem. He was trying to answer financial and economic questions about corporations and industries, but the usual data out there was just no good for many firms. The Standard Industrial Classification (SIC) system, and the North American Industry Classification System (NAICS), which are used by Federal Statistical Agencies to collect, analyze, and publish statistical data on the U.S. economy, couldn’t differentiate between, say, Microsoft and Apple. The systems classified them both as being in the computer industry, even though they each offer many products and services beyond traditional computers. It’s hard to compare firms when you don’t know what they actually make.

While other academics might have moved on to other questions with better data, Phillips and his colleague Gerard Hoberg created an entirely novel system that could produce the data they needed. They were among the first using text as data in economics and finance as they began analyzing text in 2008. That work has grown to become the Hoberg-Phillips Data Library, a website with 48,000 distinct users from more than 100 countries. These visitors, which include academics, practitioners and government researchers, flock to the data library because it is the only place on earth with text-based network industry classifications (TNIC). Over the years, Phillips and Hoberg have used natural language processing and machine learning to analyze the text of the product descriptions from corporations’ 10K statements. Through this process, they see similarities in the product descriptions from different firms and assign a numerical value to how close or far the firms are from each other, based on the products they sell. This comparative data is then, in turn, visualized in a spatial format that shows clusters of firms making similar products, or concentrations of firms with little competition.

“One important implication for economics and finance is, if firms are close together, their stock prices should co-move,” Phillips explains. “If there are too many firms close together, their profits should be low, because you’ve got too many competitors doing the same thing.” In other words, the TNIC system can give people a precise measure of product differentiation, and differentiation is the best predictor of profits. A free website that predicts corporate profits? No wonder why it has so many visitors. 

But Phillips and Hoberg didn’t stop with TNIC. In fact, it led to a one-million-dollar grant from the National Science Foundation, and nearly a dozen academic papers on topics such as industry momentum, product market threats, booms and busts, and mergers and acquisitions, to name a few. More recently, Phillips and Hoberg have deployed cutting-edge machine learning techniques to analyze even broader datasets, such as the texts of patents (to determine firms’ IP vulnerability) and text on the internet (to examine, say, firm entry threats from private firms and ESG corporate claims). “We’re getting more sophisticated and analyzing more patterns with more data,” Phillips says. “Right now, we’ve got 14 terabytes of data on the Dartmouth machines.”

While they are among the first to use natural language processing in finance and economics, Phillips and Hoberg don’t claim to have invented using text as data. It’s happening all over. uses it to compare users’ self-descriptions and make connections. Universities use it to detect plagiarism, by comparing the text in students’ papers to text on the internet. And of course, the CIA and NSA are using it to gather intelligence. But “in the business context, we’re pretty unique,” Phillips says. “We are helping lead the charge in this area.”

This article was originally published in print in Tuck Today magazine.

Read more “Finance Questions, Answered”