Greenshields-Watson A., Agarwal P., Robinson SA., Williams BH., Gordon GL., Capel HL., Li Y., Spoendlin FC., Aguilar-Sanjuan B., Boyles F., Deane CM.

Abstract Antigen receptor numbering allows delineation of antigen-binding regions of antibodies and T cell receptors, from sequence alone. Numbering is currently achieved by aligning to a reference set. This approach may result in different numbering depending on reference set used or fail on sequences from rare species or formats. We present a method (ANARCII) which requires no alignment step and is based on a Seq2Seq language model. ANARCII improves upon existing methods through more consistent numbering of key regions, robustness to truncations, generalisation to unseen species, and easier user installation. The lightweight architecture allows numbering of 90,000 sequences per minute on a high-end GPU. The software is available via web app ( https://opig.stats.ox.ac.uk/webapps/sabdab-sabpred/sabpred/anarcii/ ), and package ( https://github.com/oxpig/ANARCII ). Ultimately ANARCII allows numbering of more antibody-like sequences, with better recovery of full-length regions from existing databases, and enables comparative analysis of new receptors not numbered by existing tools.

ANARCII enables alignment-free antigen receptor numbering using a generalised language model

Greenshields-Watson A., Agarwal P., Robinson SA., Williams BH., Gordon GL., Capel HL., Li Y., Spoendlin FC., Aguilar-Sanjuan B., Boyles F., Deane CM.

DOI

Type

Publisher

Publication Date