Machine Learning – Porto Reports

Main Findings

The preprint by Li et al. describes a new deep learning model for prediction of antimicrobial peptides and its applications to identify these peptides on bullfrog genome.

Strengths

Deep learning application is a hot topic in machine learning area
AMPlify outperforms the methods in the benchmarking
AMP Scanner was retrained with AMPlify data sets
Careful selection of non-AMP sequences
Application in a real world scenario (screening the bullfrog genome)
Antimicrobial activity determined for pathogens from WHO priority list

Limitations

Section Hyperparameter tuning and model architecture is not biologist-friendly
Loose (https://doi.org/10.1038/nature05233) and Nagarajan (https://doi.org/10.3390/data4010027) datasets were not used as external validation data sets
The benchmarking lacks classical prediction systems (e.g. AntiBP2 and CAMP)
The problem of shuffled peptides was not addressed
The preprint lacks a pipeline flowchart figure
The web server was not implemented
The peptide screening did not include peptides predicted as non-AMPs

Comments

In the field of antimicrobial activity prediction, there are some classical problems that were not overcome in more than ten years of research. The first one is the absence of a non-antimicrobial peptides data set. It seems that we just accepted the use of sequences from Swissprot without the ‘antimicrobial’ annotation to create this data set. Li et al. were more rigorous with this data, which could help to explain AMPlify best performance on the benchmarking.

The second problem is related to the descriptors, which are not necessary when using deeplearning. However, the key problem of shuffled peptides (https://doi.org/10.1016/j.jtbi.2017.05.011) was not addressed by the authors. And this problem could explain some of their results in the bullfrog genome screening.

From the eleven predicted sequences, only four demonstrated antimicrobial activity, resulting in a probability of correct prediction of positive peptides of 0.36. In fact, the eleven peptides have characteristics of AMPs, however, because the shuffled problem was not addressed, we don’t know if these results could be due to the compositional bias. In addition as the authors themselves stated “the size of the training data is still small relative to the data typically employed in most deep learning applications”.

An interesting feature is that they retrained the AMP Scanner with their own data, allowing the comparison between the algorithms, not the systems. This reinforces what other manuscripts have shown, regardless the algorithm, if the system is trained with similar data, the outcome is similar. Because AMPlify has a slightly outperformed AMP Scanner (~5%), but both systems showed statistics higher than 90%.

Besides, AMP Scanner is not the only deep learning predictor available on the web, there is another system which would be interesting to compare, AxPEP (https://doi.org/10.1016/j.omtn.2020.05.006).

Regarding the antimicrobial screening on bullfrog genome, I checked the peptide molecular masses using protparam, and they didn’t match. It is not clear whether some modifications were made on peptides. Also, the peptides presented a rana box motif (https://doi.org/10.3389/fmicb.2018.02846), but it was not clear wheter they were synthesized with or without the disulfide bridge.

There is a very specific point that should be highlighted. In discussion the authors stated “it has the potential to play a role in de novo AMP design or enhancement”, well, considering that designed peptides are quite similar to AMPs, but a number of them are inactive, AMPlify should not be used for such purpose, mainly because the Loose data set was not included in the system assessments.

Quality assessment:
Originality: ☆☆☆☆✭
Rigor: ☆☆☆✭✭
Significance to the field: ☆☆☆✭✭
Interest to general audience: ☆☆☆✭✭
Quality of writing: ☆☆☆☆✭
Overall quality of the study: ☆☆☆✭✭

Reference:

Li et al. 2020. AMPlify: attentive deep learning model for discovery of novel antimicrobial peptides effective against WHO priority pathogens. BioRxiv. (Version 1). doi: https://doi.org/10.1101/2020.06.16.155705

Ten years ago, I was publishing my first manuscript as 1st author, a manuscript about antimicrobial activity prediction, which would be one of the pillars of Porto Reports. Because of its inaugural character, I choose that article to be subject of the first post on Porto Reports Legacy.

“An SVM model based on physicochemical properties to predict antimicrobial activity from protein sequences with cysteine knot motifs”, this manuscript is far from being a perfect manuscript, but it has some strengths, including an innovative strategy for antimicrobial activity prediction. In fact, the innovation was the main strength for worthing the publication. And by today, I use this work to teach what to do and what not to do in an scientific publication.

Briefly, the manuscript describes the construction of an antimicrobial activity prediction system using support vector machine as the machine learning algorithm and physicochemical properties as the sequence descriptors. The system reached a good accuracy (~80%) using the polynomial kernel.

The main limitations were (i) the weak English, and in fact, the incorrect use of verb tenses in several manuscript sections; and (ii) the non-contextualization of the computational problem, which makes the manuscript hard to understand for scientists from biological sciences.

Nevertheless, I need to talk about the strengths! To create something innovative, we need to be creative and in 2010 there was a wide field to explore in this topic. In fact, there was only one manuscript until then related to prediction of antimicrobial activity. Thus in this condition, the idea does not need to be bright, it just need to be different.

The difference was not in the algorithm itself, but on how to train the machine learning algorithm. Because the first manuscript demonstrated that there are only slight differences in the predictive power of different algorithms with the same training schemes. Thus, we used physicochemical properties to train the system, reaching a good accuracy. However, there were some limitations on our technique, that were well described and properly addressed, which is always a strength.

This system was the precursor of CS-AMPPred, and due to some errors in the choice of methods, including the support vector machine engine, the original system did not reach its actual potential.

Quality assessment:

Originality: ☆☆☆☆✭
Rigor: ☆☆☆✭✭
Significance to the field: ☆☆☆☆✭
Interest to general audience: ☆☆☆✭✭
Quality of writing: ☆☆✭✭✭
Overall quality of the study: ☆☆☆✭✭

Reference:
Porto W.F., Fernandes F.C., Franco O.L. (2010) An SVM Model Based on Physicochemical Properties to Predict Antimicrobial Activity from Protein Sequences with Cysteine Knot Motifs. In: Ferreira C.E., Miyano S., Stadler P.F. (eds) Advances in Bioinformatics and Computational Biology. BSB 2010. Lecture Notes in Computer Science, vol 6268. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-15060-9_6

Tag: Machine Learning

#PrePrintFeedback: “AMPlify: attentive deep learning model for discovery of novel antimicrobial peptides effective against WHO priority pathogens”

An SVM model based on physicochemical properties to predict antimicrobial activity from protein sequences with cysteine knot motifs