Steering Sequence Generation in Protein Language Models through Iterative Lookback Monte Carlo Sampling

Avatar
Poster
Voice is AI-generated
Connected to paperThis paper is a preprint and has not been certified by peer review

Steering Sequence Generation in Protein Language Models through Iterative Lookback Monte Carlo Sampling

Authors

Calvanese, F.; Lombardi, G.; Weigt, M.; FERNANDEZ-DE-COSSIO-DIAZ, J.

Abstract

Protein language models (pLMs) leverage large-scale evolutionary data to generate novel sequences, but steering generation toward desired physicochemical properties without sacrificing diversity remains a major challenge. Existing approaches often induce severe diversity loss or require computationally expensive retraining. We introduce Iterative Lookback Monte Carlo (ILMC), a training-free inference-time sampling strategy that interleaves autoregressive elongation with Metropolis--Hastings refinement to approximate sampling from a maximum-entropy target distribution balancing generative quality and steering objectives. We show theoretically that this target distribution is entropy-maximizing under fixed generative quality and steering constraints, and empirically that ILMC produces more diverse samples than standard autoregressive baselines at matched generative quality. Using simple steering potentials, ILMC improves desired molecular properties, including generating proteins with up to 12 higher predicted melting temperature than compute-matched alternative strategies. ILMC naturally applies to classifier-guided steering, where it outperforms purely autoregressive guidance in diversity while maintaining comparable enrichment of target properties. We validate ILMC on family-specific pLMs and on the multi-family model ProGen3.

Follow Us on

0 comments

Add comment