Symbolic regression
Symbolic regression (SR)1 is a popular paradigm for marrying classical modelling (think defining an inverse square law for gravitational potentials) and modern ML model fitting (think fitting a neural network to predict a function). The interplay between choosing a representation of the data and learning the parameters (a very physics-ey notion) and purely learning the representation from data (a very ML-ey notion), is an important area of study and SR attempts to bridge this gap.
Compression and Model selection
The task when fitting an ML model can be defined in two parts, compressing the information contained in the data into the parameters of the model and selecting the initial model to fit. The promise of Neural techniques is the latter doesn’t matter, and you can simply overparameterise the model and learn a useful representation. For scientific purposes we are less concerned with practical usefulness and more concerned with uncovering ground truth about the world, hence the model specification problem is much harder to gloss over.
The project aim
SR uses a genetic algorithm to perform compression and model selection at once, and there have been previous attempts to put this on a more formal probabilistic footing2. In this project we will attempt to bring the state of the art in precise model comparison (in the guise of nested sampling3), to study how well the genetic algorithms at the heart of SR tackle the model selection problem.
Technical skills and references
- This project will primarily be developing research code in Python
- Familiarity with ML libraries in python a plus e.g. sklearn, torch
- Little knowledge of Astronomy or Cosmology is required, and grounding in Bayesian statistics/ML will be developed through the project
Interpretable Machine Learning for Science with PySR and SymbolicRegression.jl Cranmer [2305.01582] ↩︎
Priors for symbolic regression Bartlett et al. [2304.06333] ↩︎
PolyChord: next-generation nested sampling W. Handley et al. [1506.00171] ↩︎