Predicting peptide-protein binding with a random forest

Intro

TDC-2 introduces the Protein-Peptide Binding Affinity prediction task. The predictive, non-generative task is to learn a model estimating a function of a protein, peptide, antigen processing pathway, biological context, and interaction features. It outputs a binding affinity value or binary label indicating strong or weak binding. The binary label can also include additional biomarkers, such as allowing for a positive label if and only if the binding interaction is specific. To account for additional biomarkers beyond binding affinity value, our task is specified with a binary label. TDC-2 provides datasets and benchmarks for a generalized protein-peptide binding interaction prediction task and a TCR-Epitope binding interaction prediction task.

X: protein, peptide, antigen processing pathway, biological context, and interaction features

y: binding affinity value or binary label indicating strong or weak binding https://rlearsch.github.io/images/protein-peptide/thumbnail.png

  1. Find data sources
  2. Explore and visualize the data
  3. Clean data,
  4. Feature engineer
  5. Additional data
  6. Train models
  7. Deploy best model.

Find data sources

Data is from TDC: https://tdcommons.ai/benchmark/proteinpeptide_group/overview/

from tdc.benchmark_group.protein_peptide_group import ProteinPeptideGroup
group = ProteinPeptideGroup()
train, val = group.get_train_valid_split() # val dataset will be empty. use the train dataset if fine-tuning desired.
test = group.get_test()
X_train, y_train = train[0], train[1]
X_test, y_test = test[0], test[1]
Found local copy...
Loading...
Done!

Explore and visualize the data

X_train.head()
Sequence_IDSequenceProtein Target_IDProtein TargetNameALC (sequencing confidence)KD (nM)ReferenceLicenseUnnamed: 7...lowerexpectedupperprotein_or_rna_sequencegene_typeX1X2ID1ID2KD (nm)
2165NaNFADMPDYAGNVGKAnti-HA antibody 12ca5Anti-HA antibody 12ca5NaN>85Putative binderDOI:10.26434/chemrxiv-2023-tws4n\nCreative Commons Attribution-NonCommercial-NoD...NaN...0.00.00.0Protein sequence not found for the given gene ...Could not find type of geneFADMPDYAGNVGKProtein sequence not found for the given gene ...NaNAnti-HA antibody 12ca51
1207NaNGEEFHDYRDYADKAnti-HA antibody 12ca5Anti-HA antibody 12ca5NaN>85Putative binderDOI:10.26434/chemrxiv-2023-tws4n\nCreative Commons Attribution-NonCommercial-NoD...NaN...0.00.00.0Protein sequence not found for the given gene ...Could not find type of geneGEEFHDYRDYADKProtein sequence not found for the given gene ...NaNAnti-HA antibody 12ca51
215NaNASFAEYWNLLSAKMDM2MDM2NaN88Putative binderDOI:10.1038/s42004-022-00737-wCreative Commons Attribution 4.0 International...NaN...0.00.00.0MCNTNMSVPTDGAVTTSQIPASEQETLVRPKPLLLKLLKSVGAQKD...protein-codingASFAEYWNLLSAKMCNTNMSVPTDGAVTTSQIPASEQETLVRPKPLLLKLLKSVGAQKD...NaNMDM21
4156NaNFLNDKEDYSSAQKAnti-HA antibody 12ca5Anti-HA antibody 12ca5NaN>85Putative binderDOI:10.26434/chemrxiv-2023-tws4n\nCreative Commons Attribution-NonCommercial-NoD...NaN...0.00.00.0Protein sequence not found for the given gene ...Could not find type of geneFLNDKEDYSSAQKProtein sequence not found for the given gene ...NaNAnti-HA antibody 12ca51
3630NaNWMMPMDAKDYAEKAnti-HA antibody 12ca5Anti-HA antibody 12ca5NaN>85Putative binderDOI:10.26434/chemrxiv-2023-tws4n\nCreative Commons Attribution-NonCommercial-NoD...NaN...0.00.00.0Protein sequence not found for the given gene ...Could not find type of geneWMMPMDAKDYAEKProtein sequence not found for the given gene ...NaNAnti-HA antibody 12ca51

5 rows × 26 columns

The X data are amino acid sequences, the binding targets they were measured against, and some information about the source of the data.

y_train.value_counts()
Y
0    440
1      3
Name: count, dtype: int64

The y data are binary: bind or doesn’t bind. Only 3 of 443 sequence-target pairs result in binding.

Feature Engineer

I found this package to extract biochemical features from amino acid sequences, I’m going to try it:

https://github.com/amckenna41/protpy

import protpy as protpy
import pandas as pd
function_list = [protpy.amino_acid_composition, 
                 protpy.dipeptide_composition,
                 protpy.tripeptide_composition,
                 protpy.moreaubroto_autocorrelation,
                 #using default parameters: lag=30, properties=["CIDH920105", "BHAR880101", "CHAM820101", "CHAM820102", "CHOC760101", "BIGC670101", "CHAM810101", "DAYM780201"], normalize=True
                 protpy.moran_autocorrelation,
                 #using default parameters: lag=30, properties=["CIDH920105", "BHAR880101", "CHAM820101", "CHAM820102", "CHOC760101", "BIGC670101", "CHAM810101", "DAYM780201"], normalize=True
                 protpy.geary_autocorrelation,
                 #using default parameters: lag=30, properties=["CIDH920105", "BHAR880101", "CHAM820101", "CHAM820102", "CHOC760101", "BIGC670101", "CHAM810101", "DAYM780201"], normalize=True,
                 protpy.conjoint_triad,
                 #protpy.sequence_order_coupling_number_, ### <--------- this returns a float, i will have to handle it separately 
                 protpy.sequence_order_coupling_number,
                 #using default parameters: lag=30, distance_matrix="schneider-wrede"
                 protpy.quasi_sequence_order,
                 #using default parameters: lag=30, weight=0.1, distance_matrix="schneider-wrede"
                ]

X_train = X_train[['X1','Protein Target_ID']]
X_train = X_train.reset_index()

for function in function_list:
    df_list=[]
    for index, sequence in enumerate(X_train['X1']):
        AA = function(sequence)
        df_list.append(AA)
    tempdf = pd.concat(df_list)
    tempdf=tempdf.reset_index(drop=True)
    X_train = X_train.join(tempdf)
X_train['SOCN'] = X_train['X1'].apply(protpy.sequence_order_coupling_number_)

#There's probably a faster way to do this, but this way works - I had trouble applying protpy functions to entire dataframes in parallel. 
X_train.head()
indexX1Protein Target_IDACDEFGH...QSO_SW42QSO_SW43QSO_SW44QSO_SW45QSO_SW46QSO_SW47QSO_SW48QSO_SW49QSO_SW50SOCN
02165FADMPDYAGNVGKAnti-HA antibody 12ca515.3850.015.3850.0007.69215.3850.000...0.00.00.00.00.00.00.00.00.04.913
11207GEEFHDYRDYADKAnti-HA antibody 12ca57.6920.023.07715.3857.6927.6927.692...0.00.00.00.00.00.00.00.00.05.913
2215ASFAEYWNLLSAKMDM223.0770.00.0007.6927.6920.0000.000...0.00.00.00.00.00.00.00.00.03.860
34156FLNDKEDYSSAQKAnti-HA antibody 12ca57.6920.015.3857.6927.6920.0000.000...0.00.00.00.00.00.00.00.00.03.024
43630WMMPMDAKDYAEKAnti-HA antibody 12ca515.3850.015.3857.6920.0000.0000.000...0.00.00.00.00.00.00.00.00.04.325

5 rows × 9567 columns

y_train =  y_train.reset_index(drop=True)
y_train[y_train=='1']
155    1
391    1
412    1
Name: Y, dtype: object
X_train.iloc[[155, 391, 412]]
indexX1Protein Target_IDACDEFGH...QSO_SW42QSO_SW43QSO_SW44QSO_SW45QSO_SW46QSO_SW47QSO_SW48QSO_SW49QSO_SW50SOCN
15517TSFAAYWAALSAKMDM238.4620.00.00.0007.6920.00.0...0.00.00.00.00.00.00.00.00.02.641
3911ASFAAYWNLLSPMDM225.0000.00.00.0008.3330.00.0...0.00.00.00.00.00.00.00.00.02.009
4127TSFAEYANLLAPMDM225.0000.00.08.3338.3330.00.0...0.00.00.00.00.00.00.00.00.03.073

3 rows × 9567 columns

All 3 positive data points are with Protein Target MDM2. I’m going to only work with the subset of data focused on MDM2 as a down-sampling technique to remove negative signal.

MDM2_list = X_train.index[X_train['Protein Target_ID']=='MDM2'].tolist()
X_train_MDM2 = X_train[X_train['Protein Target_ID']=='MDM2']
y_trian_MDM2 = y_train.iloc[MDM2_list]
#Also drop the Protein Target ID column because... we don't need it anymore.
X_train_MDM2 = X_train_MDM2.drop(columns=['index','X1','Protein Target_ID'])
X_train_MDM2.head()
ACDEFGHIKL...QSO_SW42QSO_SW43QSO_SW44QSO_SW45QSO_SW46QSO_SW47QSO_SW48QSO_SW49QSO_SW50SOCN
223.0770.00.07.6927.6920.00.0000.07.69215.385...0.00.00.00.00.00.00.00.00.03.860
1538.4620.00.07.6927.6920.00.0000.07.6927.692...0.00.00.00.00.00.00.00.00.03.662
2453.8460.00.00.0007.6920.00.0000.07.69215.385...0.00.00.00.00.00.00.00.00.02.815
497.6920.00.07.6927.6920.07.6920.07.6927.692...0.00.00.00.00.00.00.00.00.03.957
5738.4620.00.07.6927.6920.00.0000.07.6927.692...0.00.00.00.00.00.00.00.00.03.586

5 rows × 9564 columns

Adding SMOTE oversampling

I want to do some strategic over-sampling as well to boost up the positive signals.

import imblearn.over_sampling
sm = imblearn.over_sampling.SMOTE(random_state=42, k_neighbors=2)
X_res, y_res = sm.fit_resample(X_train_MDM2, y_trian_MDM2)
X_res
ACDEFGHIKL...QSO_SW42QSO_SW43QSO_SW44QSO_SW45QSO_SW46QSO_SW47QSO_SW48QSO_SW49QSO_SW50SOCN
023.0770000.00.07.6920007.6920000.00.0000.07.69200015.385000...0.00.00.00.00.00.00.00.00.03.860000
138.4620000.00.07.6920007.6920000.00.0000.07.6920007.692000...0.00.00.00.00.00.00.00.00.03.662000
253.8460000.00.00.0000007.6920000.00.0000.07.69200015.385000...0.00.00.00.00.00.00.00.00.02.815000
37.6920000.00.07.6920007.6920000.07.6920.07.6920007.692000...0.00.00.00.00.00.00.00.00.03.957000
438.4620000.00.07.6920007.6920000.00.0000.07.6920007.692000...0.00.00.00.00.00.00.00.00.03.586000
..................................................................
5737.9990620.00.00.2865607.7140430.00.0000.07.4274838.000637...0.00.00.00.00.00.00.00.00.02.655856
5825.0000000.00.00.7556338.3330000.00.0000.00.00000016.667000...0.00.00.00.00.00.00.00.00.02.105483
5928.4836960.00.06.1765868.1671220.00.0000.01.99053614.344450...0.00.00.00.00.00.00.00.00.02.961207
6025.0000000.00.05.5207988.3330000.00.0000.00.00000016.667000...0.00.00.00.00.00.00.00.00.02.713924
6134.2657450.00.00.0000007.8918070.00.0000.05.29431810.489607...0.00.00.00.00.00.00.00.00.02.443999

62 rows × 9564 columns

Train the model

Random forest has worked best for me

import numpy as np
#from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
rf_classifier_smote = RandomForestClassifier(random_state=42)
#X_train, y_train = X_res, y_res
rf_classifier_smote.fit(X_res, y_res)
RandomForestClassifier(random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Process the test data in the same way as the training data

#function_list = [protpy.amino_acid_composition, 
#                 protpy.dipeptide_composition,
#                 protpy.tripeptide_composition,
#                 protpy.moreaubroto_autocorrelation,
#                 #using default parameters: lag=30, properties=["CIDH920105", "BHAR880101", "CHAM820101", "CHAM820102", "CHOC760101", "BIGC670101", "CHAM810101", "DAYM780201"], normalize=True
#                 protpy.moran_autocorrelation,
#                 #using default parameters: lag=30, properties=["CIDH920105", "BHAR880101", "CHAM820101", "CHAM820102", "CHOC760101", "BIGC670101", "CHAM810101", "DAYM780201"], normalize=True
#                 protpy.geary_autocorrelation,
#                 #using default parameters: lag=30, properties=["CIDH920105", "BHAR880101", "CHAM820101", "CHAM820102", "CHOC760101", "BIGC670101", "CHAM810101", "DAYM780201"], normalize=True,
#                 protpy.conjoint_triad,
#                 #protpy.sequence_order_coupling_number_, ### <--------- this returns a float, i will have to handle it separately 
#                 protpy.sequence_order_coupling_number,
#                 #using default parameters: lag=30, distance_matrix="schneider-wrede"
#                 protpy.quasi_sequence_order,
#                 #using default parameters: lag=30, weight=0.1, distance_matrix="schneider-wrede"
#                ]
#
X_test = X_test[['X1','Protein Target_ID']]
X_test = X_test.reset_index()

for function in function_list:
    df_list=[]
    for index, sequence in enumerate(X_test['X1']):
        AA = function(sequence)
        df_list.append(AA)
    tempdf = pd.concat(df_list)
    tempdf=tempdf.reset_index(drop=True)
    X_test = X_test.join(tempdf)
X_test['SOCN'] = X_test['X1'].apply(protpy.sequence_order_coupling_number_)

#There's probably a faster way to do this, but this way works - I had trouble applying protpy functions to entire dataframes in parallel. 
X_test.head()
indexX1Protein Target_IDACDEFGH...QSO_SW42QSO_SW43QSO_SW44QSO_SW45QSO_SW46QSO_SW47QSO_SW48QSO_SW49QSO_SW50SOCN
03910LDLKDYADQRKAKAnti-HA antibody 12ca515.3850.023.0770.0000.0000.00.000...0.00.00.00.00.00.00.00.00.07.133
14145TQHLDKHDYAVYKAnti-HA antibody 12ca57.6920.015.3850.0000.0000.015.385...0.00.00.00.00.00.00.00.00.04.970
24277SKFMFDPRDYAAKAnti-HA antibody 12ca515.3850.015.3850.00015.3850.00.000...0.00.00.00.00.00.00.00.00.06.074
33523EFVHDLKDYAAPKAnti-HA antibody 12ca515.3850.015.3857.6927.6920.07.692...0.00.00.00.00.00.00.00.00.05.177
43548PAFFWDLNDYAFKAnti-HA antibody 12ca515.3850.015.3850.00023.0770.00.000...0.00.00.00.00.00.00.00.00.05.147

5 rows × 9567 columns

X_test = X_test.drop(columns=['index','X1','Protein Target_ID'])
predictions = rf_classifier_smote.predict(X_test)
## The evaluate function requires a very particular structure for the data ##
predictions = predictions.astype(str)
predictions = predictions.astype(object)
out = group.evaluate(predictions)
out
[0.75, 0.1935483870967742, 0.993231386312359, 0.3076923076923077]

The F-score (0.31) is not great, but may be acceptable for such an imbalanced dataset. Based on my knowledge of the biological problem, it seems that you’d rather have false-positives than false-negatives. In other words, it’s okay to sacrifice some precision for improved recall, which is not captured in F score.

Another approach that may be useful is reducing the dimension of the sequence by using the commonly used amino acid residue functional groups:

  1. Cationic side chains
  2. Anionic side chains
  3. Polar uncharged side chains
  4. Hydrophobic side chains
  5. Special cases

As seen here: https://www.technologynetworks.com/applied-sciences/articles/essential-amino-acids-chart-abbreviations-and-structure-324357