
Erik Buunk, M.Sc.
Project/Data Science Officer
Innovation and Entrepreneurship Research
+49 89 24246-583
erik.buunk(at)ip.mpg.de
Areas of Interest
Data Science, Information Management, Business Consultancy, Project Management, Process optimization (Lean Six Sigma), Software development, Data Visualization, Graphic Design, Geographical Information Systems (GIS)
Resumé
since 2021
Project/Data Science Officer at the Max Planck Institute for Innovation and Competition (Innovation and Entrepreneurship Research)
2019 – 2020
Institute Fellow, Institute for Quantitative Social Science, Harvard University, MA, USA
2016 – 2019
Information management consultant, Security Region Utrecht, Netherlands
2011 – 2016
Senior IT Consultant, Municipality of Amersfoort, Netherlands
2011 – 2007
Information Consultant/Project manager Social Services, Municipality of Amersfoort, Netherlands
2003 – 2006
Graphic Design
1994 – 1996
Environmental Sciences, M.Sc. Degree (1996)
1991 – 1994
Science, Business and Administration, Propaedeutic Diploma, with Distinction (1992)
Publications
Conference papers
(2024). Logic Mill - A Knowledge Navigation System, CEUR Workshop Proceedings 3775, 25-35.
- Logic Mill is a scalable and openly accessible software system that identifies semantically similar documents within either one domain-specific corpus or multi-domain corpora. It uses advanced Natural Language Processing (NLP) techniques to generate numerical representations of documents. It leverages a large pre-trained language model to generate these document representations. The system focuses on scientific publications and patent documents and contains more than 200 million
documents. It is easily accessible via a simple Application Programming Interface (API) or via a web interface. Moreover, it is continuously being updated and can be extended to text corpora from other domains. We see this system as a general-purpose tool for future research applications in the social sciences and other domains. - https://ceur-ws.org/Vol-3775/paper7.pdf
- Also published as: arXiv preprint 2301.00200
- Event: 5th Workshop on Patent Text Mining and Semantic Technologies (PatentSemTech 2024) co-located with the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2024), Washington D.C., 2024-10-24
Further Publications, Press Articles, Interviews
(2024). Tracing the Flow of Knowledge From Science to Technology Using Deep Learning. Project Report for the Academic Research Programme of the EPO, 2024.
- This project aims to enhance the tracing of knowledge flows from scientific research to technology using advanced deep-learning techniques. By developing models such as Pat-SPECTER, and PaECTER, the project seeks to improve the accuracy of identifying connections between patents and scientific literature, surpassing the limitations of traditional citation-based analysis. Key findings include the analysis of the performance of Pat-SPECTER in predicting scientific citations for patents and the performance of PaECTER in predicting citations among patents. The evaluation process was made difficult by the incompleteness of the open-access database OpenAlex, which lacks abstracts for a portion of the scientific literature. Real-world tests demonstrated Pat-SPECTER’s effectiveness in identifying relevant prior art documents (patents and publications), improving the efficiency of prior art search. The project highlights the potential of advanced machine learning models and advances their use in tracing knowledge flows. It provides tools that can enhance patent examination processes, innovation tracking, and research and development strategies. These efforts help foster innovation by revealing the intricate connections between science and technology.
- https://link.epo.org/elearning/en-ARP2021_Harhoff.pdf
Discussion Papers
(2025). Tracing the Flow of Knowledge From Science to Technology Using Deep Learning, arXiv:2512.24259.
- We develop a language similarity model suitable for working with patents and scientific publications at the same time. In a horse race-style evaluation, we subject eight language (similarity) models to predict credible Patent-Paper Citations. We find that our Pat-SPECTER model performs best, which is the SPECTER2 model fine-tuned on patents. In two real-world scenarios (separating patent-paper-pairs and predicting patent-paper-pairs) we demonstrate the capabilities of the Pat-SPECTER. We finally test the hypothesis that US patents cite papers that are semantically less similar than in other large jurisdictions, which we posit is because of the duty of candor. The model is open for the academic community and practitioners alike.
(2025). Educational Backgrounds in Inventor Teams: The Role of Complementarities between Academic and Vocational Education in Team Performance, Swiss Leading House "Economics of Education" Working Paper, No. 248. Zürich: Universität Zürich, IBW - Institut für Betriebswirtschaftslehre.
- This paper analyzes whether inventor teams composed of members with diverse educational backgrounds, both academic and vocational, exhibit higher performance than teams with the same educational backgrounds. To exploit the different educational backgrounds among patent inventors in Switzerland, we construct a unique dataset of 35,486 inventors. This dataset links individual patenting activities from European Patent Office data from 1980–2021, with detailed biographical information obtained from LinkedIn. Using a supermodularity framework to assess complementarity, we find that inventor teams composed of members with academic and vocational backgrounds (as opposed to members with the same background) achieve higher team performance, measured by the quality of their jointly filed patents. This complementarity is even stronger in teams with at least one team member from a University of Applied Sciences. Further analysis reveals heterogeneous effects across technological fields. Overall, our findings show the importance of strategically combining different educational backgrounds in inventor teams, thereby highlighting the value of maintaining a balanced educational landscape.
- http://repec.business.uzh.ch/RePEc/iso/leadinghouse/0248_lhwpaper.pdf
(2024). PaECTER: Patent-level Representation Learning using Citation-informed Transformers, arXiv preprint 2402.19411. DOI
- PaECTER is a publicly available, open-source document-level encoder specific for patents. We fine-tune BERT for Patents with examiner-added citation information to generate numerical representations for patent documents. PaECTER performs better in similarity tasks than current state-of-the-art models used in the patent domain. More specifically, our model outperforms the next-best patent specific pre-trained language model (BERT for Patents) on our patent citation prediction test dataset on two different rank evaluation metrics. PaECTER predicts at least one most similar patent at a rank of 1.32 on average when compared against 25 irrelevant patents. Numerical representations generated by PaECTER from patent text can be used for downstream tasks such as classification, tracing knowledge flows, or semantic similarity search. Semantic similarity search is especially relevant in the context of prior art search for both inventors and patent examiners. PaECTER is available on Hugging Face.
(2022). Logic Mill - A Knowledge Navigation System, arXiv preprint 2301.00200.
- Logic Mill is a scalable and openly accessible software system that identifies semantically similar documents within either one domain-specific corpus or multi-domain corpora. It uses advanced Natural Language Processing (NLP) techniques to generate numerical representations of documents. Currently it leverages a large pre-trained language model to generate these document representations. The system focuses on scientific publications and patent documents and contains more than 200 million documents. It is easily accessible via a simple Application Programming Interface (API) or via a web interface. Moreover, it is continuously being updated and can be extended to text corpora from other domains. We see this system as a general-purpose tool for future research applications in the social sciences and other domains.
- https://doi.org/10.48550/arXiv.2301.00200
- Also published in: CEUR Workshop Proceedings 3775
Presentations
23.09.2024
Tracing the Flow of Knowledge from Science to Technology Using Deep Learning
Research Seminar
Martinsried/Planegg
27.03.2024
Revelio data - First Impressions
Research Seminar
Tutzing
25.03.2024
Logic Mill & Project Management
Research Seminar
Tutzing
19.09.2023
Logic Mill / Tracing The Flow Of Knowledge
Research Seminar
Schloss Ringberg
18.09.2023
Research Project Management
Research Seminar
Ringberg Castle
12.07.2023
Tracing The Flow Of Knowledge
EPO/ARP Workshop
online
03.07.2023
Tracing The Flow Of Knowledge
Poster presentation
Board of Trustees, Max Planck Institute for Innovation and Competition
Munich
27.02.2023
Startup Data
Research Seminar
Frauenchiemsee
06.09.2022
Startup Data Project and GDPR
Research Seminar
Bernried
04.07.2022
Logic Mill – Applications of Machine Learning to Patents, Publications, and Other Text Corpora
Poster presentation
Board of Trustees, Max Planck Institute for Innovation and Competition
Munich
09.06.2022
Logic Mill – Applications of Machine Learning to Patents, Publications, and Other Text Corpora
Poster presentation
Munich Summer Institute
Munich
13.04.2022
New Data Sources
Research Seminar
Ohlstadt
02.12.2021
Logic Mill
Research Seminar
Ringberg Castle
01.12.2021
Dataroom Reproducibility
Research Seminar
Ringberg Castle
01.10.2021
Tools and Resources for Reproducibility
Research Seminar
Feldkirchen-Westerham
30.09.2021
Logic Mill
Research Seminar
Feldkirchen-Westerham
27.07.2021
Information and Data Management at MPI-IC: Human Research Data in Practice
online
06.07.2021
Logic Mill, Applications of Machine Learning to Patents, Publications, and Other Text Corpora
online
26.03.2021
Replicability
Research Seminar
online