Chemistry-Informed Machine Learning

When the Laws of Physics Belong in the Loss Function

The Problem with Pure Data

Machine learning has transformed analytical chemistry. Models that once required weeks of manual calibration can now be trained in hours, and the explosion of available spectral data has made purely data-driven approaches increasingly attractive. Feed a neural network enough NIR spectra with known reference values, and it will find patterns — reliably, quickly, and without needing to understand why they exist.

But understanding why they exist turns out to matter.

A model trained purely on data learns correlations. It does not learn chemistry. Given a sufficiently large and representative training set, those correlations may be accurate and stable — but they are fragile in precisely the ways that matter most in real analytical deployments: novel sample matrices, instrument variation, measurement conditions that fall outside the training distribution. A spectral model that has never seen a sample at 35°C has no principled basis for predicting what one looks like. A model that has learned to associate a particular absorbance band with fat content in hazelnut paste may fail on cocoa mass, even though the underlying chemistry is closely related.

Chemistry-informed machine learning is the discipline of building physical and chemical constraints directly into the learning process — not as post-hoc corrections, but as structural properties of the model itself.

What a Loss Function Actually Does

Training a machine learning model is an optimisation problem. Given a model with parameters θ, a dataset of inputs X and targets y, and a prediction function f(X; θ), the goal is to find the parameter values that minimise the discrepancy between predictions and ground truth. That discrepancy is measured by a loss function L(θ).

The standard choice for regression tasks — including NIR calibration — is mean squared error:

L(θ) = (1/n) Σ (yᵢ − f(xᵢ; θ))²

This is clean, differentiable, and well-understood. It tells the model to minimise prediction error across the training set, and nothing else. The model is free to find any mapping from spectra to predicted values that achieves low error — including mappings that violate known chemistry, exploit spurious correlations in the training data, or extrapolate in physically implausible ways.

Chemistry-informed machine learning adds structure to this objective. The loss function is augmented with additional terms that penalise physically implausible behaviour:

L(θ) = L_data(θ) + λ · L_physics(θ)

Where L_data is the standard data-fidelity term, L_physics encodes domain knowledge as a differentiable constraint, and λ is a regularisation coefficient that controls the relative weight of the two objectives.

What Chemical Constraints Look Like in Practice

The form of L_physics depends on what is known about the system. In NIR spectroscopy, several constraints are well-established and naturally expressible as differentiable penalties.

Beer-Lambert adherence. The Beer-Lambert law states that absorbance is linearly proportional to concentration for dilute solutions: A = ε · c · l, where ε is the molar absorptivity, c is concentration, and l is path length. A model whose predictions violate this relationship in the linear regime is physically implausible. A penalty term that measures deviation from Beer-Lambert linearity in known concentration ranges discourages this.

Non-negativity of concentrations. Predicted constituent concentrations cannot be negative. Standard regression models are not constrained to produce non-negative outputs, and training sets that do not include near-zero concentrations may not enforce this implicitly. A non-negativity penalty — or equivalently, a constrained output activation — encodes this directly.

Closure constraints. In many analytical applications, the sum of all predicted constituents must equal a known total — fat, protein, moisture, and ash in a food matrix must sum to 100%. A closure penalty penalises predictions that violate this mass balance, improving both physical plausibility and predictive consistency across constituents.

Spectral smoothness. NIR absorption bands are broad and continuous — physically meaningful spectral features do not vary arbitrarily between adjacent wavelengths. A smoothness regulariser on the model’s learned spectral weights discourages high-frequency noise from being interpreted as chemically meaningful signal.

Why This Matters Beyond Accuracy

The case for chemistry-informed machine learning is not primarily about achieving lower RMSEP on a held-out test set. It is about building models that behave predictably and fail gracefully.

A purely data-driven model that encounters an out-of-distribution sample has no mechanism for recognising that it is extrapolating. It will produce a prediction — possibly a confident one — that may be chemically nonsensical. A chemistry-informed model, by contrast, is anchored to physical reality. Its predictions remain interpretable even at the boundaries of the training distribution, and its failure modes are more likely to be conservative than catastrophic.

For analytical applications where predictions drive real decisions — release or reject a batch, accept or return a delivery, flag or pass a process stream — this distinction is not academic. Interpretable, physically grounded models are also substantially easier to validate under regulatory frameworks that require documented justification of analytical methods.

There is also a data efficiency argument. Chemical constraints effectively supply additional information to the training process — information that does not need to be learned from labelled examples because it is already known. This is particularly valuable in NIR calibration, where acquiring reference measurements for training data is expensive, time-consuming, and often the primary bottleneck in deploying a new analytical method.

The Chemometric Tradition and Where It Leads

Classical chemometrics — partial least squares regression, principal component regression, and their variants — is implicitly physics-informed. PLS, for instance, finds latent variables that maximise covariance between spectral data and reference values, and the resulting regression vectors have a natural interpretation in terms of spectral loadings and chemical contributions. These methods have served analytical chemistry well for decades precisely because they impose structure that pure neural networks do not.

The emerging field of chemistry-informed deep learning does not replace this tradition — it extends it. Physics-informed neural networks (PINNs), hybrid mechanistic-statistical models, and constraint-regularised deep architectures are bringing the expressiveness of modern machine learning to bear on problems where domain knowledge has always been available but difficult to encode formally.

For NIR spectroscopy specifically, this convergence is particularly productive. The physics of near-infrared absorption is well-understood. The Beer-Lambert law, the spectral signatures of known functional groups, the expected behaviour of scattering in turbid media — all of this constitutes a rich body of prior knowledge that, when encoded into a learning objective, makes models more accurate, more robust, and more trustworthy than data alone can achieve.