“Critical evaluation of drug response prediction models with DrEval” has been published in Nature Communications.
ML-based cancer cell line drug response models are well-motivated, and significant research effort has gone into developing complex modeling approaches (over 100 papers in 2025). The problem: under rigorous evaluation, we found none that actually works: Most are published based on inflated metrics, break down when probed in realistic application scenarios, and are outperformed by simple baselines. Judith Bernett and Pascal Iversen, together with Mario Picciani, Katharina Baum, Markus List, and Mathias Wilhelm, built DrEval, a pipeline for unbiased evaluation of these models, to encourage more meaningful progress in the field.
DrEval could also serve as an unbiased reward signal for AI–agent–based model development in biological ML. The project is under active maintenance to provide a robust, shared, living benchmark with uniformly processed data for bias-free, reproducible, and application-oriented model evaluation.
Link to the paper: https://doi.org/10.1038/s41467-026-72903-w
Link to GitHub for drevalpy, the Python standalone: https://github.com/daisybio/drevalpy
Link to Github for nf-core/drugresponseeval, the accompanying Nextflow pipeline for reproducibility and scalability: https://github.com/nf-core/drugresponseeval/
