Firms apply for patents for a variety of reasons. In addition to trying to prevent others from copying a commercialized patented invention, firms also patent one technology to support the commercialization of their other innovations (which we call “pre-emptive patents”). At the same time, many patent applications fail to yield commercialized products. A variety of policy debates revolve around the relative rates of these different uses of patents. In particular, there is a concern that a large number of patents are not commercialized (with a common folk statistic stating that 10% or fewer are commercialized), raising concerns of over-patenting, with possible adverse effects on the innovation system. However, it has been difficult to get systematic evidence on the relative incidences of commercialized versus pre-emptive or failed patents. This paper applies advanced natural language processing (NLP) and machine learning (ML) methods to US patent documents to estimate the rates of patent uses of various types, over time at scale. We first use a survey of US inventors on the commercialization and other outcomes of their patents as independently labeled data for training our ML models. Using a combination of context embedding codings of the patent text (based on BERT) and bibliometric indicators from the patent documents, we develop a random forest model predicting different outcomes for patented inventions. We find that adding BERT coding of patents’ text contents offers new information beyond commonly used numeric and categorical variables reflecting patent characteristics (technology class, number of claims, patent class span, etc.) or token-based text analytics indicators, highlighting the benefits of adding context embedding NLP when categorizing patents. We check the validity of the trained model using external data on Virtual Patent Marking, and show that our model predicts high commercialization rates among patents that are indeed associated with commercialized products. We apply this trained model to the universe of all granted US patents, 1981-2015 to estimate the probabilities of various uses of the patents in this population. These estimates of usage probabilities are publicly available for other researchers to use in follow-on studies. Among US granted patents 1981-2015, the mean probability of commercialization is .59, and the mean probability of pre-emptive use is .19. Finally, we show how these probabilities vary by year, patent class, firm size, and government interest. The paper makes several contributions to understanding the uses of patents, as well as how to use ML to analyze patent data. In particular, we show that estimated rates of commercialized patents are substantially higher than is often asserted in policy discussion.
Contact person: Elisabeth Hofmeister
Subscription to the invitation mailing list and more information on the seminar page.