RESEARCH

Publications

Informing climate risk analysis using textual information - A research agenda (2024)
With Malte Schierholz, Bolei Ma, Jacob Beck, Andreas Dimmelmeier, Hendrik Christian Doll, Maurice Fehr, Frauke Kreuter, and Alex Fraser. 

Proceedings of the 1st Workshop on Natural Language Processing Meets Climate Change (ClimateNLP 2024) @ ACL 2024, available via this link.

This project is part of the larger research agenda GIST - Greenhouse Gas Insights and Sustainability Tracking, a research collaboration between Deutsche Bundesbank and LMU Munich to generate high-quality, granular firm-level emissions and sustainability data. More information can be found here

Abstract:
We present a research agenda focused on efficiently extracting, assuring quality, and consolidating textual company sustainability information to address urgent climate change decision-making needs. Starting from the goal to create integrated FAIR (Findable, Accessible, Interoperable, Reusable) climate-related data, we identify research needs pertaining to the technical aspects of information extraction as well as to the design of the integrated sustainability datasets that we seek to compile. Regarding extraction, we leverage technological advancements, particularly in large language models (LLMs) and Retrieval-Augmented Generation (RAG) pipelines, to unlock the underutilized potential of unstructured textual information contained in corporate sustainability reports. In applying these techniques, we review key challenges, which include the retrieval and extraction of CO2 emission values from PDF documents, especially from unstructured tables and graphs therein, and the validation of automatically extracted data through comparisons with human-annotated values. We also review how existing use cases and practices in climate risk analytics relate to choices of what textual information should be extracted and how it could be linked to existing structured data.

Working papers

Do Investors Use Sustainable Assets as Carbon Offsets? (2024)
With Jakob Famulok and Daniel Worring.

SAFE Working Paper No. 431, available at SSRN.

Presentations:

Prizes:

Media: 

Abstract:
We present novel evidence that retail investors attempt offsetting their carbon footprints by investing sustainably. Using highly granular transaction data from bank clients, we find that higher footprints are linked to greener portfolios. In an experiment with clients from the same bank, we show that an exogenous shock to the participants’ salience of their emissions causally shifts sustainable asset allocations upward. Finally, we identify a substitution effect between offsetting through donations and sustainable assets. Our findings add to an understanding of the behavioral drivers of sustainable investing, which is crucial to design effective policies aligning financial markets with environmental goals.

Addressing Data Gaps in Sustainability Reporting: A Benchmark Dataset for Greenhouse Gas Emission Extraction (2024)
With Jacob Beck, Anna Steinberg, Andreas Dimmelmeier, Laia Domenech Burin, Maurice Fehr, and Malte Schierholz

Under review at Scientific Data (Nature).

This project is part of the larger research agenda GIST - Greenhouse Gas Insights and Sustainability Tracking, a research collaboration between Deutsche Bundesbank and LMU Munich to generate high-quality, granular firm-level emission and sustainability data. More information can be found here

Abstract:
Company-level greenhouse gas (GHG) emissions data are essential for various stakeholders due to their relevance in addressing the climate crisis. However, obtaining reliable emissions data remains a significant challenge, as existing datasets are often fragmented, inconsistent, and reliant on intransparent methodologies. To address this gap, we provide a gold standard dataset containing emission metrics extracted from 139 sustainability reports collected from company websites. This gold standard dataset is an intermediate step and can be used to validate and fine-tune models to build high-quality extraction pipelines that scale at extracting emission metrics from tens of thousands of sustainability reports. Using a data extraction pipeline powered by a Large Language Model (LLM), we automatically extract emission metrics. The metrics are assessed and corrected independently by two human non-expert annotators. Reports with full agreement are directly added to the gold standard, those with discrepancies are reviewed by two groups of expert annotators. Remaining disagreements between experts are resolved through in-person expert discussions. This sequential process ensures high data quality while reducing reliance on human experts. The resulting dataset constitutes a benchmark for human and automated annotation. It bears significant reuse potential for information extraction tasks in a sustainable finance context as well as other downstream tasks such as greenwashing analysis.

The President Reacts to News Channel of Government Communication (2023)
With Farshid Abdi, Loriana Pelizzon, Mila Getmansky Sherman, and Zorka Simon. 

SAFE Working Paper No. 314, available at SSRN.
Revise & Resubmit at Management Science. 

Presented at:

Abstract:
Studying about 1,200 economy-related tweets of President Trump, we establish the "President reacts to news" channel of stock returns. Using high-frequency identification of market movements and machine learning to classify the topics and textual sentiment of tweets, we address the observed heterogeneity in the aggregate stock market response to these messages. After controlling for market trends preceding tweets, we find that 80% of tweets are reactive and predictable rather than novel and informative. The exceptions are trade war tweets, where the President has direct policy authority, and his tweets can reveal investable private information or information about his policy function.

Do Gamblers Invest in Lottery Stocks? (2023)
With Tobin Hanspal and Andreas Hackethal. 

SAFE Working Paper No. 373, available at SSRN.
Reject & Resubmit at Management Science.

Presented at:

Media

Abstract:
Previous studies document a relationship between gambling activity at the aggregate level and investments in securities with lottery-like features. We combine data on individual gambling consumption with portfolio holdings and trading records to examine whether gambling and trading act as substitutes or complements. We find that gamblers are more likely than the average investor to hold lottery stocks, but significantly less likely than active traders who do not gamble. Our results suggest that gambling behavior across domains is less relevant compared to other portfolio characteristics that predict investing in high-risk and high-skew securities, and that gambling on and off the stock market act as substitutes to satisfy the same need, e.g., sensation seeking.

Policy papers and gray literature

Houston, we have a problem: Can satellite information bridge the climate-related data gap? (2024)
With Andrés Alonso Robisco, José Manuel Carbó Martínez, and Elena Triebskorn.

Documento Ocasional No. 2428, available via Banco de España and Deutsche Bundesbank.
Under review at LAJCB (Latin American Journal of Central Banking).

Presentations: 

Abstract:
Central banks and international supervisors have identified the difficulty of obtaining climate information as one of the key obstacles to the development of green financial products and markets. To bridge this data gap, the use of satellite information from Earth Observation (EO) systems may be necessary. To better understand this process, we analyse the potential of applying satellite data to green finance. First, we summarise the policy debate from a central banking perspective. We then briefly describe the main challenges for economists in dealing with the EO data format and quantitative methodologies for measuring its economic materiality. Finally, using topic modelling, we perform a systematic literature review of recent academic studies to identify the research areas in which satellite data are currently being used in green finance. We find the following topics: physical risk materialisation (including both acute and chronic risk), deforestation, energy and emissions, agricultural risk and land use and land cover. We conclude with a comprehensive analysis on the financial materiality of this alternative data source, a mapping of these application domains to new green financial instruments and markets under development, such as thematic bonds or carbon credits, and some key considerations for policy discussion. 

The climate data iceberg – A depth of information to integrate (2024)
With Hendrik Christian Doll, Susanne Walter, and Gabriela Alves Werb.

Forthcoming in Bulletin on the 12th biennial IFC Conference on "Statistics and beyond: new data for decision making in central banks", slides available via this download link.

Presentations:

Abstract:
Central banks need climate-related data to align evidence-based climate change considerations with their core tasks. While structured data from administrative and proprietary sources are limited and contain considerable gaps, a wealth of climate-related information is dispersed and lies below the surface in unstructured form, such as sustainability reports or satellite images. To characterise this situation, we introduce the image of the climate data iceberg. Information from unstructured sources can bridge current data gaps and enhance the usability of existing data by improving its accuracy, extending its scope, and reducing data sharing barriers. In this paper, we discuss the challenges and opportunities central banks and supervisors face in leveraging this unstructured information for climate analysis and research. We further investigate how innovative efforts between central banks and other institutions can help generate actionable and usable climate-related data, exemplified by our own experiences and early-stage learnings from such collaborations.

Extracting Data Citations with Large Language Models  (2024)
With Hendrik Christian Doll and Sebastian Seltmann.

Presentations: 

Abstract:
Empirical researchers and research data centers (RDCs) face challenges in efficiently understanding and categorizing data sources and methodologies used in scholarly papers. This process currently relies on human readers and is time-consuming and prone to errors. To address this, we explore the potential of using Large Language Models (LLMs), specifically GPT-3.5, to automate the identification and categorization of research data sources. We analyze the accuracy of GPT-3.5 in detecting and summarizing data sources and methods in economics and finance papers. By employing web-scraping techniques, we collect a comprehensive sample of research papers and create human-labeled validation datasets. We evaluate the detection and prediction accuracy and address the issue of false answers provided by the model. Additionally, we assess the pre-processing requirements of GPT-3.5 for cost-effective implementation. Our paper also provides a guide for implementing our proposed solution at research institutions and RDCs worldwide, aiming to enhance data analysis and research data provision services.

Selected work in progress

Extraction of CO2 emissions from corporate sustainability reports (2024)
With Malte Schierholz, Anna Steinberg, Jacob Beck, Laia Domenech Burin, and Lisa Reichenbach

Accepted for presentation at the 65th ISI World Statistics Congress 2025, The Hague, NL.

This project is part of the larger research agenda GIST - Greenhouse Gas Insights and Sustainability Tracking, a research collaboration between Deutsche Bundesbank and LMU Munich to generate high-quality, granular firm-level emission and sustainability data. More information can be found here

Abstract:
Financial regulators and central banks are increasingly integrating sustainability aspects into their operations, but significant data gaps remain. The CSRD directive requires all large European enterprises to annually publish their greenhouse gas emissions (CO2-equivalents) in their management report, annual report, or sustainability report. The amount of information available, i.e., the value and unit for each scope, direct emissions (Scope 1), indirect energy-related emissions (Scope 2), and other indirect emissions (Scope 3), is immense, but the data are spread over thousands of PDF documents, published online on company websites, and historically often without abiding to official standards or guidelines. Until now, private companies extract carbon emissions and other indicators from these PDF documents and sell it in a structured, tabular data format to the Bundesbank and to other public authorities. However, despite little apparent difficulties in value extraction from PDF documents the reliability between values extracted by different companies is rather low. Given the current dim situation, we leverage Large Language Models (LLMs) and Retrieval Augmented Generation (RAG) to build several fully automated data extraction pipelines, which are then being compared with data bought from private providers and evaluated using a specially curated gold standard dataset of our own. Open-source software is shared with the community which enables everyone to extract CO2-related indicators from company sustainability reports.

ClimXtract: An open-source data extraction pipeline for company-level greenhouse gas emissions (2024)
With Anna Steinberg, Laia Domenech Burin, Ailin Liu, Malte Schierholz, Andreas Dimmelmeier, Lisa Reichenbach, and Maurice Fehr

Accepted for presentation at DagStat 2025 (7th Joint Statistical Meeting of the Deutsche Arbeitsgemeinschaft Statistik, Berlin, Germany).

This project is part of the larger research agenda GIST - Greenhouse Gas Insights and Sustainability Tracking, a research collaboration between Deutsche Bundesbank and LMU Munich to generate high-quality, granular firm-level emission and sustainability data. More information can be found here

Abstract:
Developing methods to supervise and regulate companies’ contributions to the climate crisis through emissions requires access to consistent and reliable data on company-level greenhouse gas (GHG) emissions. Currently, companies publish their sustainability reports (PDF format), which document GHG releases in a non-standardized and unstructured manner, and upload them to their websites instead of a central repository. Commercial data providers collect GHG emission data from the PDF files and other sources through non-transparent methods, raising doubts concerning the validity of the traded data. As an alternative, we present an open-source data extraction pipeline, which extracts the emissions for a given corporate sustainability report using a Large Language Model (LLM). The pipeline (1) identifies relevant pages in a sustainability report, (2) prompts the LLM for the emission value and unit of each scope (direct, indirect energy-related or other indirect emissions) and (3) parses the output to save the emission data in a database. Since emission values are often captured in tables inside company reports, we augment our pipeline with a table-specific extraction routine. Evaluation of our pipeline is achieved using a curated gold standard data set validated in a multi-stage annotation process. Our pipeline can be easily scaled up and is set up modularly for fast adaptability.

GeoCSR:  Leveraging Geospatial Data from Corporate Reports for Sustainable Finance Insights (2025)
With Felicitas Sommer, Andreas Dimmelmeier, and Christophe Christiaen

This project is part of the larger research agenda GIST - Greenhouse Gas Insights and Sustainability Tracking, a research collaboration between Deutsche Bundesbank and LMU Munich to generate high-quality, granular firm-level emission and sustainability data. More information can be found here.

Includes upcoming presentations. Unpublished papers available upon request.