Predicting peptide-protein binding with a random forest

Intro

TDC-2 introduces the Protein-Peptide Binding Affinity prediction task. The predictive, non-generative task is to learn a model estimating a function of a protein, peptide, antigen processing pathway, biological context, and interaction features. It outputs a binding affinity value or binary label indicating strong or weak binding. The binary label can also include additional biomarkers, such as allowing for a positive label if and only if the binding interaction is specific. To account for additional biomarkers beyond binding affinity value, our task is specified with a binary label. TDC-2 provides datasets and benchmarks for a generalized protein-peptide binding interaction prediction task and a TCR-Epitope binding interaction prediction task.

X: protein, peptide, antigen processing pathway, biological context, and interaction features

y: binding affinity value or binary label indicating strong or weak binding https://rlearsch.github.io/images/protein-peptide/thumbnail.png

Find data sources
Explore and visualize the data
Clean data,
Feature engineer
Additional data
Train models
Deploy best model.

Find data sources

Data is from TDC: https://tdcommons.ai/benchmark/proteinpeptide_group/overview/

from tdc.benchmark_group.protein_peptide_group import ProteinPeptideGroup
group = ProteinPeptideGroup()
train, val = group.get_train_valid_split() # val dataset will be empty. use the train dataset if fine-tuning desired.
test = group.get_test()
X_train, y_train = train[0], train[1]
X_test, y_test = test[0], test[1]

Found local copy...
Loading...
Done!

Explore and visualize the data

X_train.head()

	Sequence_ID	Sequence	Protein Target_ID	Protein Target	Name	ALC (sequencing confidence)	KD (nM)	Reference	License	Unnamed: 7	...	protein_or_rna_sequence	gene_type	X1	X2	ID1	ID2	KD (nm)
2165	NaN	FADMPDYAGNVGK	Anti-HA antibody 12ca5	Anti-HA antibody 12ca5	NaN	>85	Putative binder	DOI:10.26434/chemrxiv-2023-tws4n\n	Creative Commons Attribution-NonCommercial-NoD...	NaN	...	Protein sequence not found for the given gene ...	Could not find type of gene	FADMPDYAGNVGK	Protein sequence not found for the given gene ...	NaN	Anti-HA antibody 12ca5	1
1207	NaN	GEEFHDYRDYADK	Anti-HA antibody 12ca5	Anti-HA antibody 12ca5	NaN	>85	Putative binder	DOI:10.26434/chemrxiv-2023-tws4n\n	Creative Commons Attribution-NonCommercial-NoD...	NaN	...	Protein sequence not found for the given gene ...	Could not find type of gene	GEEFHDYRDYADK	Protein sequence not found for the given gene ...	NaN	Anti-HA antibody 12ca5	1
215	NaN	ASFAEYWNLLSAK	MDM2	MDM2	NaN	88	Putative binder	DOI:10.1038/s42004-022-00737-w	Creative Commons Attribution 4.0 International...	NaN	...	MCNTNMSVPTDGAVTTSQIPASEQETLVRPKPLLLKLLKSVGAQKD...	protein-coding	ASFAEYWNLLSAK	MCNTNMSVPTDGAVTTSQIPASEQETLVRPKPLLLKLLKSVGAQKD...	NaN	MDM2	1
4156	NaN	FLNDKEDYSSAQK	Anti-HA antibody 12ca5	Anti-HA antibody 12ca5	NaN	>85	Putative binder	DOI:10.26434/chemrxiv-2023-tws4n\n	Creative Commons Attribution-NonCommercial-NoD...	NaN	...	Protein sequence not found for the given gene ...	Could not find type of gene	FLNDKEDYSSAQK	Protein sequence not found for the given gene ...	NaN	Anti-HA antibody 12ca5	1
3630	NaN	WMMPMDAKDYAEK	Anti-HA antibody 12ca5	Anti-HA antibody 12ca5	NaN	>85	Putative binder	DOI:10.26434/chemrxiv-2023-tws4n\n	Creative Commons Attribution-NonCommercial-NoD...	NaN	...	Protein sequence not found for the given gene ...	Could not find type of gene	WMMPMDAKDYAEK	Protein sequence not found for the given gene ...	NaN	Anti-HA antibody 12ca5	1

5 rows × 26 columns

The X data are amino acid sequences, the binding targets they were measured against, and some information about the source of the data.

y_train.value_counts()

Y
0    440
1      3
Name: count, dtype: int64

The y data are binary: bind or doesn’t bind. Only 3 of 443 sequence-target pairs result in binding.

Feature Engineer

I found this package to extract biochemical features from amino acid sequences, I’m going to try it:

https://github.com/amckenna41/protpy

import protpy as protpy
import pandas as pd

function_list = [protpy.amino_acid_composition, 
                 protpy.dipeptide_composition,
                 protpy.tripeptide_composition,
                 protpy.moreaubroto_autocorrelation,
                 #using default parameters: lag=30, properties=["CIDH920105", "BHAR880101", "CHAM820101", "CHAM820102", "CHOC760101", "BIGC670101", "CHAM810101", "DAYM780201"], normalize=True
                 protpy.moran_autocorrelation,
                 #using default parameters: lag=30, properties=["CIDH920105", "BHAR880101", "CHAM820101", "CHAM820102", "CHOC760101", "BIGC670101", "CHAM810101", "DAYM780201"], normalize=True
                 protpy.geary_autocorrelation,
                 #using default parameters: lag=30, properties=["CIDH920105", "BHAR880101", "CHAM820101", "CHAM820102", "CHOC760101", "BIGC670101", "CHAM810101", "DAYM780201"], normalize=True,
                 protpy.conjoint_triad,
                 #protpy.sequence_order_coupling_number_, ### <--------- this returns a float, i will have to handle it separately 
                 protpy.sequence_order_coupling_number,
                 #using default parameters: lag=30, distance_matrix="schneider-wrede"
                 protpy.quasi_sequence_order,
                 #using default parameters: lag=30, weight=0.1, distance_matrix="schneider-wrede"
                ]

X_train = X_train[['X1','Protein Target_ID']]
X_train = X_train.reset_index()

for function in function_list:
    df_list=[]
    for index, sequence in enumerate(X_train['X1']):
        AA = function(sequence)
        df_list.append(AA)
    tempdf = pd.concat(df_list)
    tempdf=tempdf.reset_index(drop=True)
    X_train = X_train.join(tempdf)
X_train['SOCN'] = X_train['X1'].apply(protpy.sequence_order_coupling_number_)

#There's probably a faster way to do this, but this way works - I had trouble applying protpy functions to entire dataframes in parallel. 

X_train.head()

	index	X1	Protein Target_ID	A	D	E	F	G	H	...	SOCN
0	2165	FADMPDYAGNVGK	Anti-HA antibody 12ca5	15.385	15.385	0.000	7.692	15.385	0.000	...	4.913
1	1207	GEEFHDYRDYADK	Anti-HA antibody 12ca5	7.692	23.077	15.385	7.692	7.692	7.692	...	5.913
2	215	ASFAEYWNLLSAK	MDM2	23.077	0.000	7.692	7.692	0.000	0.000	...	3.860
3	4156	FLNDKEDYSSAQK	Anti-HA antibody 12ca5	7.692	15.385	7.692	7.692	0.000	0.000	...	3.024
4	3630	WMMPMDAKDYAEK	Anti-HA antibody 12ca5	15.385	15.385	7.692	0.000	0.000	0.000	...	4.325

5 rows × 9567 columns

y_train =  y_train.reset_index(drop=True)
y_train[y_train=='1']

  1
  1
  1
Name: Y, dtype: object

X_train.iloc[[155, 391, 412]]

	index	X1	Protein Target_ID	A	E	F	...	SOCN
155	17	TSFAAYWAALSAK	MDM2	38.462	0.000	7.692	...	2.641
391	1	ASFAAYWNLLSP	MDM2	25.000	0.000	8.333	...	2.009
412	7	TSFAEYANLLAP	MDM2	25.000	8.333	8.333	...	3.073

3 rows × 9567 columns

All 3 positive data points are with Protein Target MDM2. I’m going to only work with the subset of data focused on MDM2 as a down-sampling technique to remove negative signal.

MDM2_list = X_train.index[X_train['Protein Target_ID']=='MDM2'].tolist()
X_train_MDM2 = X_train[X_train['Protein Target_ID']=='MDM2']
y_trian_MDM2 = y_train.iloc[MDM2_list]
#Also drop the Protein Target ID column because... we don't need it anymore.
X_train_MDM2 = X_train_MDM2.drop(columns=['index','X1','Protein Target_ID'])
X_train_MDM2.head()

	A	E	F	H	K	L	...	SOCN
2	23.077	7.692	7.692	0.000	7.692	15.385	...	3.860
15	38.462	7.692	7.692	0.000	7.692	7.692	...	3.662
24	53.846	0.000	7.692	0.000	7.692	15.385	...	2.815
49	7.692	7.692	7.692	7.692	7.692	7.692	...	3.957
57	38.462	7.692	7.692	0.000	7.692	7.692	...	3.586

5 rows × 9564 columns

Adding SMOTE oversampling

I want to do some strategic over-sampling as well to boost up the positive signals.

import imblearn.over_sampling
sm = imblearn.over_sampling.SMOTE(random_state=42, k_neighbors=2)
X_res, y_res = sm.fit_resample(X_train_MDM2, y_trian_MDM2)
X_res

	A	C	D	E	F	G	H	I	K	L	...	QSO_SW42	QSO_SW43	QSO_SW44	QSO_SW45	QSO_SW46	QSO_SW47	QSO_SW48	QSO_SW49	QSO_SW50	SOCN
0	23.077000	0.0	0.0	7.692000	7.692000	0.0	0.000	0.0	7.692000	15.385000	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	3.860000
1	38.462000	0.0	0.0	7.692000	7.692000	0.0	0.000	0.0	7.692000	7.692000	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	3.662000
2	53.846000	0.0	0.0	0.000000	7.692000	0.0	0.000	0.0	7.692000	15.385000	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	2.815000
3	7.692000	0.0	0.0	7.692000	7.692000	0.0	7.692	0.0	7.692000	7.692000	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	3.957000
4	38.462000	0.0	0.0	7.692000	7.692000	0.0	0.000	0.0	7.692000	7.692000	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	3.586000
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
57	37.999062	0.0	0.0	0.286560	7.714043	0.0	0.000	0.0	7.427483	8.000637	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	2.655856
58	25.000000	0.0	0.0	0.755633	8.333000	0.0	0.000	0.0	0.000000	16.667000	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	2.105483
59	28.483696	0.0	0.0	6.176586	8.167122	0.0	0.000	0.0	1.990536	14.344450	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	2.961207
60	25.000000	0.0	0.0	5.520798	8.333000	0.0	0.000	0.0	0.000000	16.667000	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	2.713924
61	34.265745	0.0	0.0	0.000000	7.891807	0.0	0.000	0.0	5.294318	10.489607	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	2.443999

62 rows × 9564 columns

Train the model

Random forest has worked best for me

import numpy as np
#from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

rf_classifier_smote = RandomForestClassifier(random_state=42)
#X_train, y_train = X_res, y_res
rf_classifier_smote.fit(X_res, y_res)

RandomForestClassifier(random_state=42)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Process the test data in the same way as the training data

#function_list = [protpy.amino_acid_composition, 
#                 protpy.dipeptide_composition,
#                 protpy.tripeptide_composition,
#                 protpy.moreaubroto_autocorrelation,
#                 #using default parameters: lag=30, properties=["CIDH920105", "BHAR880101", "CHAM820101", "CHAM820102", "CHOC760101", "BIGC670101", "CHAM810101", "DAYM780201"], normalize=True
#                 protpy.moran_autocorrelation,
#                 #using default parameters: lag=30, properties=["CIDH920105", "BHAR880101", "CHAM820101", "CHAM820102", "CHOC760101", "BIGC670101", "CHAM810101", "DAYM780201"], normalize=True
#                 protpy.geary_autocorrelation,
#                 #using default parameters: lag=30, properties=["CIDH920105", "BHAR880101", "CHAM820101", "CHAM820102", "CHOC760101", "BIGC670101", "CHAM810101", "DAYM780201"], normalize=True,
#                 protpy.conjoint_triad,
#                 #protpy.sequence_order_coupling_number_, ### <--------- this returns a float, i will have to handle it separately 
#                 protpy.sequence_order_coupling_number,
#                 #using default parameters: lag=30, distance_matrix="schneider-wrede"
#                 protpy.quasi_sequence_order,
#                 #using default parameters: lag=30, weight=0.1, distance_matrix="schneider-wrede"
#                ]
#
X_test = X_test[['X1','Protein Target_ID']]
X_test = X_test.reset_index()

for function in function_list:
    df_list=[]
    for index, sequence in enumerate(X_test['X1']):
        AA = function(sequence)
        df_list.append(AA)
    tempdf = pd.concat(df_list)
    tempdf=tempdf.reset_index(drop=True)
    X_test = X_test.join(tempdf)
X_test['SOCN'] = X_test['X1'].apply(protpy.sequence_order_coupling_number_)

#There's probably a faster way to do this, but this way works - I had trouble applying protpy functions to entire dataframes in parallel. 

X_test.head()

	index	X1	Protein Target_ID	A	D	E	F	H	...	SOCN
0	3910	LDLKDYADQRKAK	Anti-HA antibody 12ca5	15.385	23.077	0.000	0.000	0.000	...	7.133
1	4145	TQHLDKHDYAVYK	Anti-HA antibody 12ca5	7.692	15.385	0.000	0.000	15.385	...	4.970
2	4277	SKFMFDPRDYAAK	Anti-HA antibody 12ca5	15.385	15.385	0.000	15.385	0.000	...	6.074
3	3523	EFVHDLKDYAAPK	Anti-HA antibody 12ca5	15.385	15.385	7.692	7.692	7.692	...	5.177
4	3548	PAFFWDLNDYAFK	Anti-HA antibody 12ca5	15.385	15.385	0.000	23.077	0.000	...	5.147

5 rows × 9567 columns

X_test = X_test.drop(columns=['index','X1','Protein Target_ID'])

predictions = rf_classifier_smote.predict(X_test)
## The evaluate function requires a very particular structure for the data ##
predictions = predictions.astype(str)
predictions = predictions.astype(object)
out = group.evaluate(predictions)
out

[0.75, 0.1935483870967742, 0.993231386312359, 0.3076923076923077]

The F-score (0.31) is not great, but may be acceptable for such an imbalanced dataset. Based on my knowledge of the biological problem, it seems that you’d rather have false-positives than false-negatives. In other words, it’s okay to sacrifice some precision for improved recall, which is not captured in F score.

Another approach that may be useful is reducing the dimension of the sequence by using the commonly used amino acid residue functional groups:

Cationic side chains
Anionic side chains
Polar uncharged side chains
Hydrophobic side chains
Special cases

As seen here: https://www.technologynetworks.com/applied-sciences/articles/essential-amino-acids-chart-abbreviations-and-structure-324357

Share on

Twitter Facebook LinkedIn

Rob Learsch