Skip to main content
Article
Extracting and matching patent in-text references to scientific publications
Proceedings of the 4th Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural Language Processing for Digital Libraries (BIRNDL 2019) (2019)
  • Suzan Verberne, Leiden University
  • Ioannis Chios, Leiden University
  • Jian Wang, Leiden University
Abstract
References in patent texts to scientific publications are valuable for studying the links between science and technology but are difficult to extract. This paper tackles this challenge, specifically, we extract references embedded in USPTO patent full texts and match them to Web of Science (WoS) publications. We approach the reference extraction problem as a sequence labelling task, training CRF and Flair models. We then match references to the WoS using regular expression patterns. We train and evaluate the reference extraction models using cross validation on a sample of 22 patents with 1,952 manually annotated in-text references. Then we apply the models to a large collection of 33,338 biotech patents. We find that CRF obtains better results on citation extraction than Flair, with precision scores of around 90% and recall of around 85%. However, Flair extracts much more references from the large collection than CRF, and more of those can be matched to WoS publications. We find that 88% of the extracted in-text references are not listed on patent front page, suggesting distinct roles played by in-text and front-page references. CRF and Flair collectively extract 603,457 references to WoS publications that are not listed on the front page. In addition to the 1.17 Million front-page references in the collection, this is a 51% increase in identified patent–publication links compared with only relying on front-page references.
Keywords
  • citations,
  • patents,
  • sequence labelling
Publication Date
Summer July 25, 2019
Citation Information
Suzan Verberne, Ioannis Chios and Jian Wang. "Extracting and matching patent in-text references to scientific publications" Proceedings of the 4th Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural Language Processing for Digital Libraries (BIRNDL 2019) (2019) p. 56 - 69
Available at: http://works.bepress.com/jwang/26/