Predicting peptide-protein binding with a random forest
Intro
TDC-2 introduces the Protein-Peptide Binding Affinity prediction task. The predictive, non-generative task is to learn a model estimating a function of a protein, peptide, antigen processing pathway, biological context, and interaction features. It outputs a binding affinity value or binary label indicating strong or weak binding. The binary label can also include additional biomarkers, such as allowing for a positive label if and only if the binding interaction is specific. To account for additional biomarkers beyond binding affinity value, our task is specified with a binary label. TDC-2 provides datasets and benchmarks for a generalized protein-peptide binding interaction prediction task and a TCR-Epitope binding interaction prediction task.
X: protein, peptide, antigen processing pathway, biological context, and interaction features
y: binding affinity value or binary label indicating strong or weak binding https://rlearsch.github.io/images/protein-peptide/thumbnail.png
- Find data sources
- Explore and visualize the data
- Clean data,
- Feature engineer
- Additional data
- Train models
- Deploy best model.
Find data sources
Data is from TDC: https://tdcommons.ai/benchmark/proteinpeptide_group/overview/
from tdc.benchmark_group.protein_peptide_group import ProteinPeptideGroup
group = ProteinPeptideGroup()
train, val = group.get_train_valid_split() # val dataset will be empty. use the train dataset if fine-tuning desired.
test = group.get_test()
X_train, y_train = train[0], train[1]
X_test, y_test = test[0], test[1]
Found local copy...
Loading...
Done!
Explore and visualize the data
X_train.head()
Sequence_ID | Sequence | Protein Target_ID | Protein Target | Name | ALC (sequencing confidence) | KD (nM) | Reference | License | Unnamed: 7 | ... | lower | expected | upper | protein_or_rna_sequence | gene_type | X1 | X2 | ID1 | ID2 | KD (nm) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2165 | NaN | FADMPDYAGNVGK | Anti-HA antibody 12ca5 | Anti-HA antibody 12ca5 | NaN | >85 | Putative binder | DOI:10.26434/chemrxiv-2023-tws4n\n | Creative Commons Attribution-NonCommercial-NoD... | NaN | ... | 0.0 | 0.0 | 0.0 | Protein sequence not found for the given gene ... | Could not find type of gene | FADMPDYAGNVGK | Protein sequence not found for the given gene ... | NaN | Anti-HA antibody 12ca5 | 1 |
1207 | NaN | GEEFHDYRDYADK | Anti-HA antibody 12ca5 | Anti-HA antibody 12ca5 | NaN | >85 | Putative binder | DOI:10.26434/chemrxiv-2023-tws4n\n | Creative Commons Attribution-NonCommercial-NoD... | NaN | ... | 0.0 | 0.0 | 0.0 | Protein sequence not found for the given gene ... | Could not find type of gene | GEEFHDYRDYADK | Protein sequence not found for the given gene ... | NaN | Anti-HA antibody 12ca5 | 1 |
215 | NaN | ASFAEYWNLLSAK | MDM2 | MDM2 | NaN | 88 | Putative binder | DOI:10.1038/s42004-022-00737-w | Creative Commons Attribution 4.0 International... | NaN | ... | 0.0 | 0.0 | 0.0 | MCNTNMSVPTDGAVTTSQIPASEQETLVRPKPLLLKLLKSVGAQKD... | protein-coding | ASFAEYWNLLSAK | MCNTNMSVPTDGAVTTSQIPASEQETLVRPKPLLLKLLKSVGAQKD... | NaN | MDM2 | 1 |
4156 | NaN | FLNDKEDYSSAQK | Anti-HA antibody 12ca5 | Anti-HA antibody 12ca5 | NaN | >85 | Putative binder | DOI:10.26434/chemrxiv-2023-tws4n\n | Creative Commons Attribution-NonCommercial-NoD... | NaN | ... | 0.0 | 0.0 | 0.0 | Protein sequence not found for the given gene ... | Could not find type of gene | FLNDKEDYSSAQK | Protein sequence not found for the given gene ... | NaN | Anti-HA antibody 12ca5 | 1 |
3630 | NaN | WMMPMDAKDYAEK | Anti-HA antibody 12ca5 | Anti-HA antibody 12ca5 | NaN | >85 | Putative binder | DOI:10.26434/chemrxiv-2023-tws4n\n | Creative Commons Attribution-NonCommercial-NoD... | NaN | ... | 0.0 | 0.0 | 0.0 | Protein sequence not found for the given gene ... | Could not find type of gene | WMMPMDAKDYAEK | Protein sequence not found for the given gene ... | NaN | Anti-HA antibody 12ca5 | 1 |
5 rows × 26 columns
The X data are amino acid sequences, the binding targets they were measured against, and some information about the source of the data.
y_train.value_counts()
Y
0 440
1 3
Name: count, dtype: int64
The y data are binary: bind or doesn’t bind. Only 3 of 443 sequence-target pairs result in binding.
Feature Engineer
I found this package to extract biochemical features from amino acid sequences, I’m going to try it:
https://github.com/amckenna41/protpy
import protpy as protpy
import pandas as pd
function_list = [protpy.amino_acid_composition,
protpy.dipeptide_composition,
protpy.tripeptide_composition,
protpy.moreaubroto_autocorrelation,
#using default parameters: lag=30, properties=["CIDH920105", "BHAR880101", "CHAM820101", "CHAM820102", "CHOC760101", "BIGC670101", "CHAM810101", "DAYM780201"], normalize=True
protpy.moran_autocorrelation,
#using default parameters: lag=30, properties=["CIDH920105", "BHAR880101", "CHAM820101", "CHAM820102", "CHOC760101", "BIGC670101", "CHAM810101", "DAYM780201"], normalize=True
protpy.geary_autocorrelation,
#using default parameters: lag=30, properties=["CIDH920105", "BHAR880101", "CHAM820101", "CHAM820102", "CHOC760101", "BIGC670101", "CHAM810101", "DAYM780201"], normalize=True,
protpy.conjoint_triad,
#protpy.sequence_order_coupling_number_, ### <--------- this returns a float, i will have to handle it separately
protpy.sequence_order_coupling_number,
#using default parameters: lag=30, distance_matrix="schneider-wrede"
protpy.quasi_sequence_order,
#using default parameters: lag=30, weight=0.1, distance_matrix="schneider-wrede"
]
X_train = X_train[['X1','Protein Target_ID']]
X_train = X_train.reset_index()
for function in function_list:
df_list=[]
for index, sequence in enumerate(X_train['X1']):
AA = function(sequence)
df_list.append(AA)
tempdf = pd.concat(df_list)
tempdf=tempdf.reset_index(drop=True)
X_train = X_train.join(tempdf)
X_train['SOCN'] = X_train['X1'].apply(protpy.sequence_order_coupling_number_)
#There's probably a faster way to do this, but this way works - I had trouble applying protpy functions to entire dataframes in parallel.
X_train.head()
index | X1 | Protein Target_ID | A | C | D | E | F | G | H | ... | QSO_SW42 | QSO_SW43 | QSO_SW44 | QSO_SW45 | QSO_SW46 | QSO_SW47 | QSO_SW48 | QSO_SW49 | QSO_SW50 | SOCN | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2165 | FADMPDYAGNVGK | Anti-HA antibody 12ca5 | 15.385 | 0.0 | 15.385 | 0.000 | 7.692 | 15.385 | 0.000 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 4.913 |
1 | 1207 | GEEFHDYRDYADK | Anti-HA antibody 12ca5 | 7.692 | 0.0 | 23.077 | 15.385 | 7.692 | 7.692 | 7.692 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 5.913 |
2 | 215 | ASFAEYWNLLSAK | MDM2 | 23.077 | 0.0 | 0.000 | 7.692 | 7.692 | 0.000 | 0.000 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.860 |
3 | 4156 | FLNDKEDYSSAQK | Anti-HA antibody 12ca5 | 7.692 | 0.0 | 15.385 | 7.692 | 7.692 | 0.000 | 0.000 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.024 |
4 | 3630 | WMMPMDAKDYAEK | Anti-HA antibody 12ca5 | 15.385 | 0.0 | 15.385 | 7.692 | 0.000 | 0.000 | 0.000 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 4.325 |
5 rows × 9567 columns
y_train = y_train.reset_index(drop=True)
y_train[y_train=='1']
155 1
391 1
412 1
Name: Y, dtype: object
X_train.iloc[[155, 391, 412]]
index | X1 | Protein Target_ID | A | C | D | E | F | G | H | ... | QSO_SW42 | QSO_SW43 | QSO_SW44 | QSO_SW45 | QSO_SW46 | QSO_SW47 | QSO_SW48 | QSO_SW49 | QSO_SW50 | SOCN | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
155 | 17 | TSFAAYWAALSAK | MDM2 | 38.462 | 0.0 | 0.0 | 0.000 | 7.692 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2.641 |
391 | 1 | ASFAAYWNLLSP | MDM2 | 25.000 | 0.0 | 0.0 | 0.000 | 8.333 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2.009 |
412 | 7 | TSFAEYANLLAP | MDM2 | 25.000 | 0.0 | 0.0 | 8.333 | 8.333 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.073 |
3 rows × 9567 columns
All 3 positive data points are with Protein Target MDM2. I’m going to only work with the subset of data focused on MDM2 as a down-sampling technique to remove negative signal.
MDM2_list = X_train.index[X_train['Protein Target_ID']=='MDM2'].tolist()
X_train_MDM2 = X_train[X_train['Protein Target_ID']=='MDM2']
y_trian_MDM2 = y_train.iloc[MDM2_list]
#Also drop the Protein Target ID column because... we don't need it anymore.
X_train_MDM2 = X_train_MDM2.drop(columns=['index','X1','Protein Target_ID'])
X_train_MDM2.head()
A | C | D | E | F | G | H | I | K | L | ... | QSO_SW42 | QSO_SW43 | QSO_SW44 | QSO_SW45 | QSO_SW46 | QSO_SW47 | QSO_SW48 | QSO_SW49 | QSO_SW50 | SOCN | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2 | 23.077 | 0.0 | 0.0 | 7.692 | 7.692 | 0.0 | 0.000 | 0.0 | 7.692 | 15.385 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.860 |
15 | 38.462 | 0.0 | 0.0 | 7.692 | 7.692 | 0.0 | 0.000 | 0.0 | 7.692 | 7.692 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.662 |
24 | 53.846 | 0.0 | 0.0 | 0.000 | 7.692 | 0.0 | 0.000 | 0.0 | 7.692 | 15.385 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2.815 |
49 | 7.692 | 0.0 | 0.0 | 7.692 | 7.692 | 0.0 | 7.692 | 0.0 | 7.692 | 7.692 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.957 |
57 | 38.462 | 0.0 | 0.0 | 7.692 | 7.692 | 0.0 | 0.000 | 0.0 | 7.692 | 7.692 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.586 |
5 rows × 9564 columns
Adding SMOTE oversampling
I want to do some strategic over-sampling as well to boost up the positive signals.
import imblearn.over_sampling
sm = imblearn.over_sampling.SMOTE(random_state=42, k_neighbors=2)
X_res, y_res = sm.fit_resample(X_train_MDM2, y_trian_MDM2)
X_res
A | C | D | E | F | G | H | I | K | L | ... | QSO_SW42 | QSO_SW43 | QSO_SW44 | QSO_SW45 | QSO_SW46 | QSO_SW47 | QSO_SW48 | QSO_SW49 | QSO_SW50 | SOCN | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 23.077000 | 0.0 | 0.0 | 7.692000 | 7.692000 | 0.0 | 0.000 | 0.0 | 7.692000 | 15.385000 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.860000 |
1 | 38.462000 | 0.0 | 0.0 | 7.692000 | 7.692000 | 0.0 | 0.000 | 0.0 | 7.692000 | 7.692000 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.662000 |
2 | 53.846000 | 0.0 | 0.0 | 0.000000 | 7.692000 | 0.0 | 0.000 | 0.0 | 7.692000 | 15.385000 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2.815000 |
3 | 7.692000 | 0.0 | 0.0 | 7.692000 | 7.692000 | 0.0 | 7.692 | 0.0 | 7.692000 | 7.692000 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.957000 |
4 | 38.462000 | 0.0 | 0.0 | 7.692000 | 7.692000 | 0.0 | 0.000 | 0.0 | 7.692000 | 7.692000 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.586000 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
57 | 37.999062 | 0.0 | 0.0 | 0.286560 | 7.714043 | 0.0 | 0.000 | 0.0 | 7.427483 | 8.000637 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2.655856 |
58 | 25.000000 | 0.0 | 0.0 | 0.755633 | 8.333000 | 0.0 | 0.000 | 0.0 | 0.000000 | 16.667000 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2.105483 |
59 | 28.483696 | 0.0 | 0.0 | 6.176586 | 8.167122 | 0.0 | 0.000 | 0.0 | 1.990536 | 14.344450 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2.961207 |
60 | 25.000000 | 0.0 | 0.0 | 5.520798 | 8.333000 | 0.0 | 0.000 | 0.0 | 0.000000 | 16.667000 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2.713924 |
61 | 34.265745 | 0.0 | 0.0 | 0.000000 | 7.891807 | 0.0 | 0.000 | 0.0 | 5.294318 | 10.489607 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2.443999 |
62 rows × 9564 columns
Train the model
Random forest has worked best for me
import numpy as np
#from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
rf_classifier_smote = RandomForestClassifier(random_state=42)
#X_train, y_train = X_res, y_res
rf_classifier_smote.fit(X_res, y_res)
RandomForestClassifier(random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier(random_state=42)
Process the test data in the same way as the training data
#function_list = [protpy.amino_acid_composition,
# protpy.dipeptide_composition,
# protpy.tripeptide_composition,
# protpy.moreaubroto_autocorrelation,
# #using default parameters: lag=30, properties=["CIDH920105", "BHAR880101", "CHAM820101", "CHAM820102", "CHOC760101", "BIGC670101", "CHAM810101", "DAYM780201"], normalize=True
# protpy.moran_autocorrelation,
# #using default parameters: lag=30, properties=["CIDH920105", "BHAR880101", "CHAM820101", "CHAM820102", "CHOC760101", "BIGC670101", "CHAM810101", "DAYM780201"], normalize=True
# protpy.geary_autocorrelation,
# #using default parameters: lag=30, properties=["CIDH920105", "BHAR880101", "CHAM820101", "CHAM820102", "CHOC760101", "BIGC670101", "CHAM810101", "DAYM780201"], normalize=True,
# protpy.conjoint_triad,
# #protpy.sequence_order_coupling_number_, ### <--------- this returns a float, i will have to handle it separately
# protpy.sequence_order_coupling_number,
# #using default parameters: lag=30, distance_matrix="schneider-wrede"
# protpy.quasi_sequence_order,
# #using default parameters: lag=30, weight=0.1, distance_matrix="schneider-wrede"
# ]
#
X_test = X_test[['X1','Protein Target_ID']]
X_test = X_test.reset_index()
for function in function_list:
df_list=[]
for index, sequence in enumerate(X_test['X1']):
AA = function(sequence)
df_list.append(AA)
tempdf = pd.concat(df_list)
tempdf=tempdf.reset_index(drop=True)
X_test = X_test.join(tempdf)
X_test['SOCN'] = X_test['X1'].apply(protpy.sequence_order_coupling_number_)
#There's probably a faster way to do this, but this way works - I had trouble applying protpy functions to entire dataframes in parallel.
X_test.head()
index | X1 | Protein Target_ID | A | C | D | E | F | G | H | ... | QSO_SW42 | QSO_SW43 | QSO_SW44 | QSO_SW45 | QSO_SW46 | QSO_SW47 | QSO_SW48 | QSO_SW49 | QSO_SW50 | SOCN | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 3910 | LDLKDYADQRKAK | Anti-HA antibody 12ca5 | 15.385 | 0.0 | 23.077 | 0.000 | 0.000 | 0.0 | 0.000 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 7.133 |
1 | 4145 | TQHLDKHDYAVYK | Anti-HA antibody 12ca5 | 7.692 | 0.0 | 15.385 | 0.000 | 0.000 | 0.0 | 15.385 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 4.970 |
2 | 4277 | SKFMFDPRDYAAK | Anti-HA antibody 12ca5 | 15.385 | 0.0 | 15.385 | 0.000 | 15.385 | 0.0 | 0.000 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 6.074 |
3 | 3523 | EFVHDLKDYAAPK | Anti-HA antibody 12ca5 | 15.385 | 0.0 | 15.385 | 7.692 | 7.692 | 0.0 | 7.692 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 5.177 |
4 | 3548 | PAFFWDLNDYAFK | Anti-HA antibody 12ca5 | 15.385 | 0.0 | 15.385 | 0.000 | 23.077 | 0.0 | 0.000 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 5.147 |
5 rows × 9567 columns
X_test = X_test.drop(columns=['index','X1','Protein Target_ID'])
predictions = rf_classifier_smote.predict(X_test)
## The evaluate function requires a very particular structure for the data ##
predictions = predictions.astype(str)
predictions = predictions.astype(object)
out = group.evaluate(predictions)
out
[0.75, 0.1935483870967742, 0.993231386312359, 0.3076923076923077]
The F-score (0.31) is not great, but may be acceptable for such an imbalanced dataset. Based on my knowledge of the biological problem, it seems that you’d rather have false-positives than false-negatives. In other words, it’s okay to sacrifice some precision for improved recall, which is not captured in F score.
Another approach that may be useful is reducing the dimension of the sequence by using the commonly used amino acid residue functional groups:
- Cationic side chains
- Anionic side chains
- Polar uncharged side chains
- Hydrophobic side chains
- Special cases
As seen here: https://www.technologynetworks.com/applied-sciences/articles/essential-amino-acids-chart-abbreviations-and-structure-324357