The Observed T cell receptor Space database enables paired-chain repertoire mining, coherence analysis and language modelling
RAYBOULD M., Greenshields-Watson A., Agarwal P., Aguilar-Sanjuan B., Olsen T., Turnbull OM., Quast N., DEANE C.
T cell activation is governed through T cell receptors (TCRs), heterodimers of two sequence-variable chains (often an alpha [α] and beta [β] chain) that synergistically recognise antigen fragments presented on cell surfaces. Despite this, there only exist repositories dedicated to collecting single-chain, not paired-chain, TCR sequence data. We have addressed this gap by creating the Observed T cell receptor Space (OTS) database, a source of consistently processed and annotated, full-length, paired-chain TCR sequences. Currently, OTS contains 5.35M redundant (1.63M non-redundant) predominantly human sequences from across 50 studies and at least 75 individuals. Using OTS, we identify pairing biases, public TCRs, and distinct chain coherence patterns relative to antibodies. We also release a paired-chain TCR language model, providing paired embedding representations and a method for residue in-filling conditional on the partner chain. OTS will be updated as a central community resource, freely downloadable and available as a web application at https://opig.stats.ox.ac.uk/webapps/ots.