Towards the goal of understanding the causal structure underlying complex systems - such as the Earth, the climate, or the brain - integrating Large Language Models (LLMs) with data-driven and domain-expertise-driven approaches has the potential to become a game-changer, especially in data and expertise-limited scenarios. Debates persist around LLMs’ causal reasoning capacities. However, rather than engaging in philosophical debates, we propose integrating LLMs into a scientific framework for causal hypothesis generation alongside expert knowledge and data. Our goals include formalizing LLMs as probabilistic imperfect experts, developing adaptive methods for causal hypothesis generation, and establishing universal benchmarks for comprehensive comparisons. Specifically, we introduce a spectrum of integration methods for experts, LLMs, and data-driven approaches. We review existing approaches for causal hypothesis generation and classify them within this spectrum. As an example, our hybrid (LLM + data) causal discovery algorithm illustrates ways for deeper integration. Characterizing imperfect experts along dimensions such as 1) reliability, 2) consistency, 3) uncertainty, and 4) content vs. reasoning are emphasized for developing adaptable methods. Lastly, we stress the importance of model-agnostic benchmarks.
App. Soft. Comp.
Pairwise causal discovery with support measure machines
Gherardo Varando, Salvador Catsis, Emiliano Diaz, and Gustau Camps-Valls
Bivariate causal discovery amounts to inferring the causal association between two random variables, usually from observational data. This task is the simplest and most fundamental causal discovery problem from which more complex discovery methods can be envisioned and developed. Classical bivariate causal discovery methods exploit a combination of specific sets of assumptions and data to obtain identifiability of the causal direction. Data-driven supervised approaches train machine learning models over large sets of causally-labeled bivariate datasets to learn the task of inferring the causal relationship from data. In this work, an ensemble algorithm based on support measure machines is proposed with the aim of combining the strength of different classical approaches (base methods) with data-driven decisions. In particular, support measure machine classifiers are trained to estimate the performance of each base method. Their decision functions are then used as data-dependent weights of a weighted voting scheme to estimate the causal direction in a bivariate causal discovery problem. This work demonstrates that the proposed algorithm, denoted as Causal Ensemble Measure Machine, performs equal to or better than state-of-the-art methods on a wide range of synthetic and real-world bivariate problems. Perhaps more importantly, this method enables a closer examination of the assumption dependence of existing algorithms on observational data.
neurips workshop
3D Cloud reconstruction through spatially aware masked autoencoders.
Stella Girtsou Díaz Salas-Porras, Lilli Freischem, Joppe Massant, Kyriaki-Margarita Bintsi, Guiseppe Castiglione, and 4 more authors
Physics is a field of science that has traditionally used the scientific method to answer questions about why natural phenomena occur and to make testable models that explain the phenomena. Discovering equations, laws, and principles that are invariant, robust, and causal has been fundamental in physical sciences throughout the centuries. Discoveries emerge from observing the world and, when possible, performing interventions on the system under study. With the advent of big data and data-driven methods, the fields of causal and equation discovery have developed and accelerated progress in computer science, physics, statistics, philosophy, and many applied fields. This paper reviews the concepts, methods, and relevant works on causal and equation discovery in the broad field of physics and outlines the most important challenges and promising future lines of research. We also provide a taxonomy for data-driven causal and equation discovery, point out connections, and showcase comprehensive case studies in Earth and climate sciences, fluid dynamics and mechanics, and the neurosciences. This review demonstrates that discovering fundamental laws and causal relations by observing natural phenomena is revolutionised with the efficient exploitation of observational data and simulations, modern machine learning algorithms and the combination with domain knowledge. Exciting times are ahead with many challenges and opportunities to improve our understanding of complex systems.
MLST
Learning latent functions for causal discovery
Emiliano Díaz, Gherardo Varando, J Emmanuel Johnson, and Gustau Camps-Valls
Machine Learning: Science and Technology, Jul 2023
Causal discovery from observational data offers unique opportunities in many scientific disciplines: reconstructing causal drivers, testing causal hypotheses, and comparing and evaluating models for optimizing targeted interventions. Recent causal discovery methods focused on estimating the latent space of the data to get around a lack of causal sufficiency or additivity constraints. However, estimating the latent space significantly increases model complexity, compromising causal identifiability and making it hard to compare models that correspond to different causal hypotheses. We propose a kernel, non-parametric latent-space modelling approach and deal with the difficulty of comparing causal directions by measuring and controlling for the level of causal assumption fulfilment. We introduce a latent noise causal inference framework to estimate latent factors associated with the hypothesized causal direction by optimizing a loss function with kernel independence criteria. We extend the framework to work with time series using an additional time-dependent kernel regularizer. We discuss the additivity assumption and model complexity and give empirical evidence of performance in a wide range of synthetic and real causal discovery problems.
Arxiv
Large Language Models for Constrained-Based Causal Discovery
Process understanding and modeling is at the core of scientific reasoning. Principled parametric and mechanistic modeling dominated science and engineering until the recent emergence of machine learning (ML). Despite great success in many areas, ML algorithms in the Earth and climate sciences, and more broadly in physical sciences, are not explicitly designed to be physically-consistent and may, therefore, violate the most basic laws of physics. In this work, motivated by the field of algorithmic fairness, we reconcile data-driven ML with physics modeling by illustrating a nonparametric and nonlinear physics-aware regression method. By incorporating a dependence-based regularizer, the method leads to models that are consistent with domain knowledge, as reflected by either simulations from physical models or ancillary data. The idea can conversely encourage independence of model predictions with other variables that are known to be uncertain either in their representation or magnitude. The method is computationally efficient and comes with a closed-form analytic solution. Through a consistency-vs-accuracy path diagram, one can assess the consistency between data-driven models and physical models. We demonstrate in three examples on simulations and measurement data in Earth and climate studies that the proposed ML framework allows us to trade-off physical consistency and accuracy.
Scientific Reports
Inferring causal relations from observational long-term carbon and water fluxes records
Emiliano Díaz, Jose Adsuara, Alvaro Moreno, Maria Piles, and Gustau Camps-Valls
Land, atmosphere and climate interact constantly and at different spatial and temporal scales. In this paper we rely on causal discovery methods to infer spatial patterns of causal relations between several key variables of the carbon and water cycles: gross primary productivity, latent heat energy flux for evaporation, surface air temperature, precipitation, soil moisture and radiation. We introduce a methodology based on the convergent cross-mapping (CCM) technique. Despite its good performance in general, CCM is sensitive to (even moderate) noise levels and hyper-parameter selection. We present a robust CCM (RCCM) that relies on temporal bootstrapping decision scores and the derivation of more stringent cross-map skill scores. The RCCM method is combined with the information-geometric causal inference (IGCI) method to address the problem of strong and instantaneous variable coupling, another important and long-standing issue of CCM. The proposed methodology allows to derive spatially explicit global maps of causal relations between the involved variables and retrieve the underlying complexity of the interactions. Results are generally consistent with reported patterns and process understanding, and constitute a new way to quantify and understand carbon and water fluxes interactions
neurips workshop
Identifying the Causes of Pyrocumulonimbus (PyroCb)
Emiliano Díaz Salas-Porras, Kenza Tazi, Ashwin Braude, Daniel Okoh, Kara D. Lamb, and 3 more authors