{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Predicting cattle T cell Epitopes in M.bovis\n", "\n", "This is a simplified version of an analysis we used to predict cattle T cell epitopes in the proteome of M.bovis. We apply three binding prediction methods to the M. bovis proteome using a subset of human MHC-II alleles to approximate the bovine immune response. Two different strategies are then applied to filter the resulting set of binders: global binder ranking and epitope density. Both these sets were used to make 20mer peptides for synthesis. \n", "\n", "Several other metrics are applied to produce a final list of peptide candidates.\n", "\n", "## References\n", "\n", "* Integrated computational prediction and experimental validation identifies promiscuous T cell epitopes in the proteome of pathogenic mycobacteria. Damien Farrell, Gareth Jones, Chris Pirson, Kerri M Malone, Kevin Rue-Albrecht, Anthony J Chubb, H Martin Vordermeier, Stephen V. Gordon. 2016\n", "* M. Pandya, “Definition of Bovine Leukocyte Antigen Diversity and Peptide Binding Profiles for Epitope Discovery,” 2016.\n", "* O. T. Schubert et al., “Absolute Proteome Composition and Dynamics during Dormancy and Resuscitation of Mycobacterium tuberculosis,” Cell Host Microbe, 2015." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import os, math, time, pickle, subprocess\n", "from importlib import reload\n", "from collections import OrderedDict\n", "import numpy as np\n", "import pandas as pd\n", "pd.set_option('display.width', 100)\n", "import epitopepredict as ep\n", "from epitopepredict import base, sequtils, plotting, peptutils, analysis\n", "from IPython.display import display, HTML, Image\n", "%matplotlib inline\n", "import matplotlib as mpl\n", "import pylab as plt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Alleles\n", "* 8 bovine similar HLA MHC-II alleles were used to approximate cattle responses\n", "* BoLA class I alleles used here are those most frequent in 81 Holstein cattle from the UVM Dairy Center of Excellence research herd assayed by PCR-SSP. (Different alleles were used in the original study)." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "mhc2alleles = ep.get_preset_alleles('bovine_like_mhc2')\n", "mhc1alleles = ['BoLA-1:01901', 'BoLA-2:00801','BoLA-2:01201','BoLA-4:02401','BoLA-2:07001',\n", " 'BoLA-3:01701','BoLA-1:02301','BoLA-2:01801','BoLA-6:01302','BoLA-3:00201']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Filtering with MTB srm data\n", "\n", "Here we load the protein sequences from a genbank file. MTB is used since it is 99% similar to Bovis. You can find this file at https://www.ncbi.nlm.nih.gov/genome/166?genome_assembly_id=159857\n", "\n", "The proteins are filtered using absolute abundance levels of proteins in an unfractionated mixed lysate of M. tuberculosis H37Rv cultures identified by Schubert et al. All proteins undetected using selected reaction monitoring (SRM) (concentration = 0 and <2 peptides detected) in this study were filtered out. This leaves 1870 out of ~4000 proteins." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1870\n" ] }, { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>type</th>\n", " <th>protein_id</th>\n", " <th>locus_tag</th>\n", " <th>gene</th>\n", " <th>db_xref</th>\n", " <th>product</th>\n", " <th>note</th>\n", " <th>translation</th>\n", " <th>pseudo</th>\n", " <th>start</th>\n", " <th>end</th>\n", " <th>strand</th>\n", " <th>length</th>\n", " <th>order</th>\n", " <th>concentration</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>CDS</td>\n", " <td>CCP42723.1</td>\n", " <td>Rv0001</td>\n", " <td>dnaA</td>\n", " <td>GI:444893470</td>\n", " <td>Chromosomal replication initiator protein DnaA</td>\n", " <td>Rv0001, (MT0001, MTV029.01, P49993), len: 507 ...</td>\n", " <td>MTDDPGSGFTTVWNAVVSELNGDPKVDDGPSSDANLSAPLTPQQRA...</td>\n", " <td>NaN</td>\n", " <td>0</td>\n", " <td>1524</td>\n", " <td>1</td>\n", " <td>507</td>\n", " <td>1</td>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>CDS</td>\n", " <td>CCP42724.1</td>\n", " <td>Rv0002</td>\n", " <td>dnaN</td>\n", " <td>GI:444893471</td>\n", " <td>DNA polymerase III (beta chain) DnaN (DNA nucl...</td>\n", " <td>Rv0002, (MTV029.02, MTCY10H4.0), len: 402 aa. ...</td>\n", " <td>MDAATTRVGLTDLTFRLLRESFADAVSWVAKNLPARPAVPVLSGVL...</td>\n", " <td>NaN</td>\n", " <td>2051</td>\n", " <td>3260</td>\n", " <td>1</td>\n", " <td>402</td>\n", " <td>2</td>\n", " <td>9</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>CDS</td>\n", " <td>CCP42725.1</td>\n", " <td>Rv0003</td>\n", " <td>recF</td>\n", " <td>GI:444893472</td>\n", " <td>DNA replication and repair protein RecF (singl...</td>\n", " <td>Rv0003, (MTCY10H4.01), len: 385 aa. RecF, DNA ...</td>\n", " <td>MYVRHLGLRDFRSWACVDLELHPGRTVFVGPNGYGKTNLIEALWYS...</td>\n", " <td>NaN</td>\n", " <td>3279</td>\n", " <td>4437</td>\n", " <td>1</td>\n", " <td>385</td>\n", " <td>3</td>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>CDS</td>\n", " <td>CCP42727.1</td>\n", " <td>Rv0005</td>\n", " <td>gyrB</td>\n", " <td>GI:444893474</td>\n", " <td>DNA gyrase (subunit B) GyrB (DNA topoisomerase...</td>\n", " <td>Rv0005, (MTCY10H4.03), len: 675 aa. GyrB, DNA ...</td>\n", " <td>MAAQKKKAQDEYGAASITILEGLEAVRKRPGMYIGSTGERGLHHLI...</td>\n", " <td>NaN</td>\n", " <td>5239</td>\n", " <td>7267</td>\n", " <td>1</td>\n", " <td>675</td>\n", " <td>5</td>\n", " <td>5</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>CDS</td>\n", " <td>CCP42728.1</td>\n", " <td>Rv0006</td>\n", " <td>gyrA</td>\n", " <td>GI:444893475</td>\n", " <td>DNA gyrase (subunit A) GyrA (DNA topoisomerase...</td>\n", " <td>Rv0006, (MTCY10H4.04), len: 838 aa. GyrA, DNA ...</td>\n", " <td>MTDTTLPPDDSLDRIEPVDIEQEMQRSYIDYAMSVIVGRALPEVRD...</td>\n", " <td>NaN</td>\n", " <td>7301</td>\n", " <td>9818</td>\n", " <td>1</td>\n", " <td>838</td>\n", " <td>6</td>\n", " <td>10</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " type protein_id locus_tag gene db_xref \\\n", "0 CDS CCP42723.1 Rv0001 dnaA GI:444893470 \n", "1 CDS CCP42724.1 Rv0002 dnaN GI:444893471 \n", "2 CDS CCP42725.1 Rv0003 recF GI:444893472 \n", "3 CDS CCP42727.1 Rv0005 gyrB GI:444893474 \n", "4 CDS CCP42728.1 Rv0006 gyrA GI:444893475 \n", "\n", " product \\\n", "0 Chromosomal replication initiator protein DnaA \n", "1 DNA polymerase III (beta chain) DnaN (DNA nucl... \n", "2 DNA replication and repair protein RecF (singl... \n", "3 DNA gyrase (subunit B) GyrB (DNA topoisomerase... \n", "4 DNA gyrase (subunit A) GyrA (DNA topoisomerase... \n", "\n", " note \\\n", "0 Rv0001, (MT0001, MTV029.01, P49993), len: 507 ... \n", "1 Rv0002, (MTV029.02, MTCY10H4.0), len: 402 aa. ... \n", "2 Rv0003, (MTCY10H4.01), len: 385 aa. RecF, DNA ... \n", "3 Rv0005, (MTCY10H4.03), len: 675 aa. GyrB, DNA ... \n", "4 Rv0006, (MTCY10H4.04), len: 838 aa. GyrA, DNA ... \n", "\n", " translation pseudo start end strand length order \\\n", "0 MTDDPGSGFTTVWNAVVSELNGDPKVDDGPSSDANLSAPLTPQQRA... NaN 0 1524 1 507 1 \n", "1 MDAATTRVGLTDLTFRLLRESFADAVSWVAKNLPARPAVPVLSGVL... NaN 2051 3260 1 402 2 \n", "2 MYVRHLGLRDFRSWACVDLELHPGRTVFVGPNGYGKTNLIEALWYS... NaN 3279 4437 1 385 3 \n", "3 MAAQKKKAQDEYGAASITILEGLEAVRKRPGMYIGSTGERGLHHLI... NaN 5239 7267 1 675 5 \n", "4 MTDTTLPPDDSLDRIEPVDIEQEMQRSYIDYAMSVIVGRALPEVRD... NaN 7301 9818 1 838 6 \n", "\n", " concentration \n", "0 1 \n", "1 9 \n", "2 1 \n", "3 5 \n", "4 10 " ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#srm data\n", "srm = pd.read_csv('srm_mtb.csv')\n", "#srm = srm[srm.length<=400]\n", "proteome = sequtils.genbank_to_dataframe('MTB-H37Rv.gb',cds=True)\n", "#filter proteins on srm data\n", "proteome = proteome.merge(srm[['locus_tag','concentration']],on='locus_tag',how='inner')\n", "proteome = proteome[proteome.concentration>0]\n", "print (len(proteome))\n", "proteome.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Predict binders\n", "\n", "Run binding predictions for each of the three and save to disk. This enables us to reload the results to re-analyse." ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "took 75.966 seconds\n", "predictions done for 1870 sequences in 8 alleles\n", "results saved to /home/damien/gitprojects/epitopepredict/examples/mtb_tepitope\n" ] } ], "source": [ "P1 = base.get_predictor('tepitope')\n", "b = P1.predict_sequences(proteome, alleles=mhc2alleles, threads=8, path='mtb_tepitope')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "P2 = base.get_predictor('netmhciipan')\n", "b = P2.predict_sequences(proteome, alleles=mhc2alleles, threads=8, path='mtb_netmhciipan',overwrite=False)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "took 1496.114 seconds\n", "predictions done for 200 sequences in 10 alleles\n", "results saved to /home/damien/gitprojects/epitopepredict/examples/mtb_netmhcpan\n" ] } ], "source": [ "reload(base)\n", "P3 = base.get_predictor('netmhcpan')\n", "b = P3.predict_sequences(proteome[:200], alleles=mhc1alleles, threads=8, path='mtb_netmhcpan',overwrite=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Compute clusters of promiscuous binders\n", "\n", "Here we have detected clusters of MHC II binders. These clusters vary in length and for peptides synthesis we must create a list of 20-mers covering shorter sequences or splitting longer sequences into 2. For the 20mers hydrophobicity and net charge are calculated." ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "reload(analysis)\n", "predictors = ['netmhcpan','tepitope','netmhciipan']\n", "clusters = {}\n", "for name in predictors:\n", " P = base.get_predictor(name)\n", " path = 'mtb_'+name\n", " P.load(path) \n", " pb=P.promiscuous_binders(n=3)\n", " #print (pb[:20])\n", " cl = analysis.find_clusters(pb, min_binders=3)\n", " clusters[name] = cl\n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## overlapping clusters?" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " name start end binders length overlap\n", "0 Rv0002 175 193 3 18 0\n", "2 Rv0003 300 331 5 31 0\n", "1 Rv0003 153 174 4 21 0\n", "3 Rv0005 472 493 3 21 0\n", "4 Rv0005 544 558 3 14 0\n", "5 Rv0006 311 335 3 24 0\n", "8 Rv0006 433 450 3 17 0\n", "6 Rv0006 369 384 3 15 0\n", "7 Rv0006 386 399 3 13 0\n", "9 Rv0006 565 578 3 13 0\n", "10 Rv0015c 4 22 3 18 0\n", "11 Rv0018c 0 16 3 16 0\n", "12 Rv0019c 18 36 4 18 0\n", "13 Rv0034 82 95 3 13 0\n", "14 Rv0036c 215 230 3 15 0\n", "16 Rv0037c 199 217 3 18 0\n", "17 Rv0037c 274 288 3 14 0\n", "15 Rv0037c 31 44 3 13 0\n", "20 Rv0041 661 684 5 23 0\n", "18 Rv0041 91 111 4 20 0\n", "19 Rv0041 265 282 4 17 0\n", "21 Rv0041 782 798 3 16 0\n", "22 Rv0041 812 827 3 15 0\n", "23 Rv0046c 165 184 3 19 0\n", "25 Rv0050 393 412 3 19 0\n", "24 Rv0050 183 196 3 13 0\n", "26 Rv0060 160 186 5 26 0\n", "27 Rv0062 28 48 6 20 0\n", "28 Rv0062 67 92 4 25 0\n", "29 Rv0066c 253 284 6 31 0\n", "... ... ... ... ... ... ...\n", "1207 Rv3856c 192 211 3 19 0\n", "1208 Rv3863 286 322 5 36 0\n", "1209 Rv3863 337 351 3 14 0\n", "1210 Rv3864 156 181 3 25 0\n", "1211 Rv3868 304 323 3 19 0\n", "1212 Rv3869 0 33 7 33 0\n", "1213 Rv3869 82 105 3 23 0\n", "1214 Rv3870 196 227 7 31 0\n", "1215 Rv3870 616 630 3 14 0\n", "1216 Rv3876 439 458 4 19 0\n", "1217 Rv3882c 59 73 3 14 0\n", "1219 Rv3887c 286 309 4 23 0\n", "1218 Rv3887c 260 278 4 18 0\n", "1220 Rv3887c 464 482 3 18 0\n", "1221 Rv3888c 190 214 6 24 0\n", "1222 Rv3895c 11 34 4 23 1\n", "1223 Rv3902c 41 63 3 22 0\n", "1225 Rv3907c 323 356 8 33 0\n", "1224 Rv3907c 178 202 3 24 0\n", "1226 Rv3909 743 764 3 21 0\n", "1229 Rv3910 411 430 3 19 0\n", "1230 Rv3910 453 468 3 15 0\n", "1228 Rv3910 229 243 3 14 0\n", "1227 Rv3910 160 173 3 13 0\n", "1231 Rv3913 170 186 4 16 0\n", "1232 Rv3917c 191 208 4 17 0\n", "1234 Rv3919c 143 168 3 25 0\n", "1233 Rv3919c 85 103 3 18 0\n", "1235 Rv3920c 112 125 3 13 0\n", "1236 Rv3921c 46 69 6 23 0\n", "\n", "[1237 rows x 6 columns]\n" ] } ], "source": [ "c1=clusters['netmhciipan']\n", "c2=clusters['tepitope']\n", "x = analysis.get_overlaps(c1,c2,how='inside')\n", "print (x)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "final = ep.analysis.create_nmers(cl, proteome, key='20mer', length=20, margin=2) \n", "final = analysis.peptide_properties(final, colname='20mer')\n", "print (final.head()) " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "reload(plotting)\n", "#print (P.get_names())\n", "ax = plotting.plot_tracks([P],name='Bla-g-5',cutoff=.95, cutoff_method='default',n=2, legend=True, figsize=(12,5),regions=cl)\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.2" } }, "nbformat": 4, "nbformat_minor": 4 }