{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Predicting cattle T cell Epitopes in M.bovis\n",
    "\n",
    "This is a simplified version of an analysis we used to predict cattle T cell epitopes in the proteome of M.bovis. We apply three binding prediction methods to the M. bovis proteome using a subset of human MHC-II alleles to approximate the bovine immune response. Two different strategies are then applied to filter the resulting set of binders: global binder ranking and epitope density. Both these sets were used to make 20mer peptides for synthesis. \n",
    "\n",
    "Several other metrics are applied to produce a final list of peptide candidates.\n",
    "\n",
    "## References\n",
    "\n",
    "* Integrated computational prediction and experimental validation identifies promiscuous T cell epitopes in the proteome of pathogenic mycobacteria. Damien Farrell, Gareth Jones, Chris Pirson, Kerri M Malone, Kevin Rue-Albrecht, Anthony J Chubb, H Martin Vordermeier, Stephen V. Gordon. 2016\n",
    "* M. Pandya, “Definition of Bovine Leukocyte Antigen Diversity and Peptide Binding Profiles for Epitope Discovery,” 2016.\n",
    "* O. T. Schubert et al., “Absolute Proteome Composition and Dynamics during Dormancy and Resuscitation of Mycobacterium tuberculosis,” Cell Host Microbe, 2015."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os, math, time, pickle, subprocess\n",
    "from importlib import reload\n",
    "from collections import OrderedDict\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "pd.set_option('display.width', 100)\n",
    "import epitopepredict as ep\n",
    "from epitopepredict import base, sequtils, plotting, peptutils, analysis\n",
    "from IPython.display import display, HTML, Image\n",
    "%matplotlib inline\n",
    "import matplotlib as mpl\n",
    "import pylab as plt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##  Alleles\n",
    "* 8 bovine similar HLA MHC-II alleles were used to approximate cattle responses\n",
    "* BoLA class I alleles used here are those most frequent in 81 Holstein cattle from the UVM Dairy Center of Excellence research herd assayed by PCR-SSP. (Different alleles were used in the original study)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "mhc2alleles = ep.get_preset_alleles('bovine_like_mhc2')\n",
    "mhc1alleles = ['BoLA-1:01901', 'BoLA-2:00801','BoLA-2:01201','BoLA-4:02401','BoLA-2:07001',\n",
    "               'BoLA-3:01701','BoLA-1:02301','BoLA-2:01801','BoLA-6:01302','BoLA-3:00201']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Filtering with MTB srm data\n",
    "\n",
    "Here we load the protein sequences from a genbank file. MTB is used since it is 99% similar to Bovis. You can find this file at https://www.ncbi.nlm.nih.gov/genome/166?genome_assembly_id=159857\n",
    "\n",
    "The proteins are filtered using absolute abundance levels of proteins in an unfractionated mixed lysate of M. tuberculosis H37Rv cultures identified by Schubert et al. All proteins undetected using selected reaction monitoring (SRM) (concentration = 0 and <2 peptides detected) in this study were filtered out. This leaves 1870 out of ~4000 proteins."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1870\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>type</th>\n",
       "      <th>protein_id</th>\n",
       "      <th>locus_tag</th>\n",
       "      <th>gene</th>\n",
       "      <th>db_xref</th>\n",
       "      <th>product</th>\n",
       "      <th>note</th>\n",
       "      <th>translation</th>\n",
       "      <th>pseudo</th>\n",
       "      <th>start</th>\n",
       "      <th>end</th>\n",
       "      <th>strand</th>\n",
       "      <th>length</th>\n",
       "      <th>order</th>\n",
       "      <th>concentration</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>CDS</td>\n",
       "      <td>CCP42723.1</td>\n",
       "      <td>Rv0001</td>\n",
       "      <td>dnaA</td>\n",
       "      <td>GI:444893470</td>\n",
       "      <td>Chromosomal replication initiator protein DnaA</td>\n",
       "      <td>Rv0001, (MT0001, MTV029.01, P49993), len: 507 ...</td>\n",
       "      <td>MTDDPGSGFTTVWNAVVSELNGDPKVDDGPSSDANLSAPLTPQQRA...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>1524</td>\n",
       "      <td>1</td>\n",
       "      <td>507</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>CDS</td>\n",
       "      <td>CCP42724.1</td>\n",
       "      <td>Rv0002</td>\n",
       "      <td>dnaN</td>\n",
       "      <td>GI:444893471</td>\n",
       "      <td>DNA polymerase III (beta chain) DnaN (DNA nucl...</td>\n",
       "      <td>Rv0002, (MTV029.02, MTCY10H4.0), len: 402 aa. ...</td>\n",
       "      <td>MDAATTRVGLTDLTFRLLRESFADAVSWVAKNLPARPAVPVLSGVL...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2051</td>\n",
       "      <td>3260</td>\n",
       "      <td>1</td>\n",
       "      <td>402</td>\n",
       "      <td>2</td>\n",
       "      <td>9</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>CDS</td>\n",
       "      <td>CCP42725.1</td>\n",
       "      <td>Rv0003</td>\n",
       "      <td>recF</td>\n",
       "      <td>GI:444893472</td>\n",
       "      <td>DNA replication and repair protein RecF (singl...</td>\n",
       "      <td>Rv0003, (MTCY10H4.01), len: 385 aa. RecF, DNA ...</td>\n",
       "      <td>MYVRHLGLRDFRSWACVDLELHPGRTVFVGPNGYGKTNLIEALWYS...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>3279</td>\n",
       "      <td>4437</td>\n",
       "      <td>1</td>\n",
       "      <td>385</td>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>CDS</td>\n",
       "      <td>CCP42727.1</td>\n",
       "      <td>Rv0005</td>\n",
       "      <td>gyrB</td>\n",
       "      <td>GI:444893474</td>\n",
       "      <td>DNA gyrase (subunit B) GyrB (DNA topoisomerase...</td>\n",
       "      <td>Rv0005, (MTCY10H4.03), len: 675 aa. GyrB, DNA ...</td>\n",
       "      <td>MAAQKKKAQDEYGAASITILEGLEAVRKRPGMYIGSTGERGLHHLI...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>5239</td>\n",
       "      <td>7267</td>\n",
       "      <td>1</td>\n",
       "      <td>675</td>\n",
       "      <td>5</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>CDS</td>\n",
       "      <td>CCP42728.1</td>\n",
       "      <td>Rv0006</td>\n",
       "      <td>gyrA</td>\n",
       "      <td>GI:444893475</td>\n",
       "      <td>DNA gyrase (subunit A) GyrA (DNA topoisomerase...</td>\n",
       "      <td>Rv0006, (MTCY10H4.04), len: 838 aa. GyrA, DNA ...</td>\n",
       "      <td>MTDTTLPPDDSLDRIEPVDIEQEMQRSYIDYAMSVIVGRALPEVRD...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>7301</td>\n",
       "      <td>9818</td>\n",
       "      <td>1</td>\n",
       "      <td>838</td>\n",
       "      <td>6</td>\n",
       "      <td>10</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  type  protein_id locus_tag  gene       db_xref  \\\n",
       "0  CDS  CCP42723.1    Rv0001  dnaA  GI:444893470   \n",
       "1  CDS  CCP42724.1    Rv0002  dnaN  GI:444893471   \n",
       "2  CDS  CCP42725.1    Rv0003  recF  GI:444893472   \n",
       "3  CDS  CCP42727.1    Rv0005  gyrB  GI:444893474   \n",
       "4  CDS  CCP42728.1    Rv0006  gyrA  GI:444893475   \n",
       "\n",
       "                                             product  \\\n",
       "0     Chromosomal replication initiator protein DnaA   \n",
       "1  DNA polymerase III (beta chain) DnaN (DNA nucl...   \n",
       "2  DNA replication and repair protein RecF (singl...   \n",
       "3  DNA gyrase (subunit B) GyrB (DNA topoisomerase...   \n",
       "4  DNA gyrase (subunit A) GyrA (DNA topoisomerase...   \n",
       "\n",
       "                                                note  \\\n",
       "0  Rv0001, (MT0001, MTV029.01, P49993), len: 507 ...   \n",
       "1  Rv0002, (MTV029.02, MTCY10H4.0), len: 402 aa. ...   \n",
       "2  Rv0003, (MTCY10H4.01), len: 385 aa. RecF, DNA ...   \n",
       "3  Rv0005, (MTCY10H4.03), len: 675 aa. GyrB, DNA ...   \n",
       "4  Rv0006, (MTCY10H4.04), len: 838 aa. GyrA, DNA ...   \n",
       "\n",
       "                                         translation pseudo  start   end  strand  length  order  \\\n",
       "0  MTDDPGSGFTTVWNAVVSELNGDPKVDDGPSSDANLSAPLTPQQRA...    NaN      0  1524       1     507      1   \n",
       "1  MDAATTRVGLTDLTFRLLRESFADAVSWVAKNLPARPAVPVLSGVL...    NaN   2051  3260       1     402      2   \n",
       "2  MYVRHLGLRDFRSWACVDLELHPGRTVFVGPNGYGKTNLIEALWYS...    NaN   3279  4437       1     385      3   \n",
       "3  MAAQKKKAQDEYGAASITILEGLEAVRKRPGMYIGSTGERGLHHLI...    NaN   5239  7267       1     675      5   \n",
       "4  MTDTTLPPDDSLDRIEPVDIEQEMQRSYIDYAMSVIVGRALPEVRD...    NaN   7301  9818       1     838      6   \n",
       "\n",
       "   concentration  \n",
       "0              1  \n",
       "1              9  \n",
       "2              1  \n",
       "3              5  \n",
       "4             10  "
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#srm data\n",
    "srm = pd.read_csv('srm_mtb.csv')\n",
    "#srm = srm[srm.length<=400]\n",
    "proteome = sequtils.genbank_to_dataframe('MTB-H37Rv.gb',cds=True)\n",
    "#filter proteins on srm data\n",
    "proteome = proteome.merge(srm[['locus_tag','concentration']],on='locus_tag',how='inner')\n",
    "proteome = proteome[proteome.concentration>0]\n",
    "print (len(proteome))\n",
    "proteome.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Predict binders\n",
    "\n",
    "Run binding predictions for each of the three and save to disk. This enables us to reload the results to re-analyse."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "took 75.966 seconds\n",
      "predictions done for 1870 sequences in 8 alleles\n",
      "results saved to /home/damien/gitprojects/epitopepredict/examples/mtb_tepitope\n"
     ]
    }
   ],
   "source": [
    "P1 = base.get_predictor('tepitope')\n",
    "b = P1.predict_sequences(proteome, alleles=mhc2alleles, threads=8, path='mtb_tepitope')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "P2 = base.get_predictor('netmhciipan')\n",
    "b = P2.predict_sequences(proteome, alleles=mhc2alleles, threads=8, path='mtb_netmhciipan',overwrite=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "took 1496.114 seconds\n",
      "predictions done for 200 sequences in 10 alleles\n",
      "results saved to /home/damien/gitprojects/epitopepredict/examples/mtb_netmhcpan\n"
     ]
    }
   ],
   "source": [
    "reload(base)\n",
    "P3 = base.get_predictor('netmhcpan')\n",
    "b = P3.predict_sequences(proteome[:200], alleles=mhc1alleles, threads=8, path='mtb_netmhcpan',overwrite=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Compute clusters of promiscuous binders\n",
    "\n",
    "Here we have detected clusters of MHC II binders. These clusters vary in length and for peptides synthesis we must create a list of 20-mers covering shorter sequences or splitting longer sequences into 2. For the 20mers hydrophobicity and net charge are calculated."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [],
   "source": [
    "reload(analysis)\n",
    "predictors = ['netmhcpan','tepitope','netmhciipan']\n",
    "clusters = {}\n",
    "for name in predictors:\n",
    "    P = base.get_predictor(name)\n",
    "    path = 'mtb_'+name\n",
    "    P.load(path)   \n",
    "    pb=P.promiscuous_binders(n=3)\n",
    "    #print (pb[:20])\n",
    "    cl = analysis.find_clusters(pb, min_binders=3)\n",
    "    clusters[name] = cl\n",
    " "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## overlapping clusters?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "         name  start  end  binders  length  overlap\n",
      "0      Rv0002    175  193        3      18        0\n",
      "2      Rv0003    300  331        5      31        0\n",
      "1      Rv0003    153  174        4      21        0\n",
      "3      Rv0005    472  493        3      21        0\n",
      "4      Rv0005    544  558        3      14        0\n",
      "5      Rv0006    311  335        3      24        0\n",
      "8      Rv0006    433  450        3      17        0\n",
      "6      Rv0006    369  384        3      15        0\n",
      "7      Rv0006    386  399        3      13        0\n",
      "9      Rv0006    565  578        3      13        0\n",
      "10    Rv0015c      4   22        3      18        0\n",
      "11    Rv0018c      0   16        3      16        0\n",
      "12    Rv0019c     18   36        4      18        0\n",
      "13     Rv0034     82   95        3      13        0\n",
      "14    Rv0036c    215  230        3      15        0\n",
      "16    Rv0037c    199  217        3      18        0\n",
      "17    Rv0037c    274  288        3      14        0\n",
      "15    Rv0037c     31   44        3      13        0\n",
      "20     Rv0041    661  684        5      23        0\n",
      "18     Rv0041     91  111        4      20        0\n",
      "19     Rv0041    265  282        4      17        0\n",
      "21     Rv0041    782  798        3      16        0\n",
      "22     Rv0041    812  827        3      15        0\n",
      "23    Rv0046c    165  184        3      19        0\n",
      "25     Rv0050    393  412        3      19        0\n",
      "24     Rv0050    183  196        3      13        0\n",
      "26     Rv0060    160  186        5      26        0\n",
      "27     Rv0062     28   48        6      20        0\n",
      "28     Rv0062     67   92        4      25        0\n",
      "29    Rv0066c    253  284        6      31        0\n",
      "...       ...    ...  ...      ...     ...      ...\n",
      "1207  Rv3856c    192  211        3      19        0\n",
      "1208   Rv3863    286  322        5      36        0\n",
      "1209   Rv3863    337  351        3      14        0\n",
      "1210   Rv3864    156  181        3      25        0\n",
      "1211   Rv3868    304  323        3      19        0\n",
      "1212   Rv3869      0   33        7      33        0\n",
      "1213   Rv3869     82  105        3      23        0\n",
      "1214   Rv3870    196  227        7      31        0\n",
      "1215   Rv3870    616  630        3      14        0\n",
      "1216   Rv3876    439  458        4      19        0\n",
      "1217  Rv3882c     59   73        3      14        0\n",
      "1219  Rv3887c    286  309        4      23        0\n",
      "1218  Rv3887c    260  278        4      18        0\n",
      "1220  Rv3887c    464  482        3      18        0\n",
      "1221  Rv3888c    190  214        6      24        0\n",
      "1222  Rv3895c     11   34        4      23        1\n",
      "1223  Rv3902c     41   63        3      22        0\n",
      "1225  Rv3907c    323  356        8      33        0\n",
      "1224  Rv3907c    178  202        3      24        0\n",
      "1226   Rv3909    743  764        3      21        0\n",
      "1229   Rv3910    411  430        3      19        0\n",
      "1230   Rv3910    453  468        3      15        0\n",
      "1228   Rv3910    229  243        3      14        0\n",
      "1227   Rv3910    160  173        3      13        0\n",
      "1231   Rv3913    170  186        4      16        0\n",
      "1232  Rv3917c    191  208        4      17        0\n",
      "1234  Rv3919c    143  168        3      25        0\n",
      "1233  Rv3919c     85  103        3      18        0\n",
      "1235  Rv3920c    112  125        3      13        0\n",
      "1236  Rv3921c     46   69        6      23        0\n",
      "\n",
      "[1237 rows x 6 columns]\n"
     ]
    }
   ],
   "source": [
    "c1=clusters['netmhciipan']\n",
    "c2=clusters['tepitope']\n",
    "x = analysis.get_overlaps(c1,c2,how='inside')\n",
    "print (x)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "final = ep.analysis.create_nmers(cl, proteome, key='20mer', length=20, margin=2)     \n",
    "final = analysis.peptide_properties(final, colname='20mer')\n",
    "print (final.head())   "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "reload(plotting)\n",
    "#print (P.get_names())\n",
    "ax = plotting.plot_tracks([P],name='Bla-g-5',cutoff=.95, cutoff_method='default',n=2, legend=True, figsize=(12,5),regions=cl)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}