In silico characterization of class II plant defensins from Arabidopsis thaliana

Until now, all of our posts were basically about the development of machine learning models for prediction of antimicrobial peptides. However, there are more to explore than machine learning. In our last paper, published on Phytochemistry, we characterized two defensins from Arabidopsis thaliana, which despite being a in silico study, it is closer to biology than informatics.

Being a model plant, A. thaliana has an array of resources available on the web; and despite that, our paper shows there are more to be discovered on such plant. This plant has more than 300 defensin genes described – defensins are small proteins involved in plant defense against biotic and abiotic stresses.

Finding new information on this context would be unexpected. However, we found two defensins belonging to class II defensins, which could help in understanding the evolution and distribution of defensins among the flowering plants.

In this context, the web resources for A. thaliana played a critical role on this study. By applying a classical strategy for identification of cysteine-rich peptides on A. thaliana predicted proteome, we found those two defensins, but a number of questions araised from that, including their tissue of expression. However, this information is sometimes inaccessible depending on the tissue of expression, need for an specific stimulus or even amount of protein or RNA produced.

Fortunately, there is a high resolution transcript map for A. thaliana, where we could identify the expression of both defensins on flowers, ovules and seeds. This is interesting because the other known class II defensins are expressed in a similar context, in flowers for solanaceus species and seeds for poaceous species.

In addition, given the evolutionary distance among Brassicaceae, Solanaceae and Poaceae families, these class II defensins could be spread among all flowering plants. We do not know the function of A. thaliana’s class II defensins, but for solanaceous and poaceous’ class II defensins, they present antimicrobial function. Do the A. thaliana’s class II defenins have the same function?

Well, the actual function we do not know, but the predicted structures seem to be very similar to classical plant defensins. In addition, the genes that codes these defensins in A. thaliana seem to be a result of duplication process, because they are neighbors and their sequences share ~70% of identity.

In fact, despite being a in silico study, a number of hypothesis emerged, which remember me about an article by Markowertz on Plos Biology, “All biology is computational biology” (https://doi.org/10.1371/journal.pbio.2002050). The application of computational methods to study the structure, function and evolution of proteins is a very exciting field. And this article is a good example of modern biology application.

Quality assessment:
Originality ☆☆☆☆✭
Rigor ☆☆☆☆✭
Significance to the field ☆☆☆☆☆
Interest to general audience ☆☆☆☆✭
Quality of writing ☆☆☆☆✭
Overall quality of the study ☆☆☆☆✭

Reference
Costa et al. (2020) In silico characterization of class II plant defensins from Arabidopsis thaliana. Phytochemistry, vol 179, 112511. https://doi.org/10.1016/j.phytochem.2020.112511

#PrePrintFeedback: “AMPlify: attentive deep learning model for discovery of novel antimicrobial peptides effective against WHO priority pathogens”

Main Findings

The preprint by Li et al. describes a new deep learning model for prediction of antimicrobial peptides and its applications to identify these peptides on bullfrog genome.

Strengths

  • Deep learning application is a hot topic in machine learning area
  • AMPlify outperforms the methods in the benchmarking
  • AMP Scanner was retrained with AMPlify data sets
  • Careful selection of non-AMP sequences
  • Application in a real world scenario (screening the bullfrog genome)
  • Antimicrobial activity determined for pathogens from WHO priority list

Limitations

  • Section Hyperparameter tuning and model architecture is not biologist-friendly
  • Loose (https://doi.org/10.1038/nature05233) and Nagarajan (https://doi.org/10.3390/data4010027) datasets were not used as external validation data sets
  • The benchmarking lacks classical prediction systems (e.g. AntiBP2 and CAMP)
  • The problem of shuffled peptides was not addressed
  • The preprint lacks a pipeline flowchart figure
  • The web server was not implemented
  • The peptide screening did not include peptides predicted as non-AMPs

Comments

In the field of antimicrobial activity prediction, there are some classical problems that were not overcome in more than ten years of research. The first one is the absence of a non-antimicrobial peptides data set. It seems that we just accepted the use of sequences from Swissprot without the ‘antimicrobial’ annotation to create this data set. Li et al. were more rigorous with this data, which could help to explain AMPlify best performance on the benchmarking. 

The second problem is related to the descriptors, which are not necessary when using deeplearning. However, the key problem of shuffled peptides (https://doi.org/10.1016/j.jtbi.2017.05.011) was not addressed by the authors. And this problem could explain some of their results in the bullfrog genome screening.

From the eleven predicted sequences, only four demonstrated antimicrobial activity, resulting in a probability of correct prediction of positive peptides of 0.36. In fact, the eleven peptides have characteristics of AMPs, however, because the shuffled problem was not addressed, we don’t know if these results could be due to the compositional bias. In addition as the authors themselves stated “the size of the training data is still small relative to the data typically employed in most deep learning applications”.

An interesting feature is that they retrained the AMP Scanner with their own data, allowing the comparison between the algorithms, not the systems. This reinforces what other manuscripts have shown, regardless the algorithm, if the system is trained with similar data, the outcome is similar. Because AMPlify has a slightly outperformed AMP Scanner (~5%), but both systems showed statistics higher than 90%.

Besides, AMP Scanner is not the only deep learning predictor available on the web, there is another system  which would be interesting to compare, AxPEP (https://doi.org/10.1016/j.omtn.2020.05.006).

Regarding the antimicrobial screening on bullfrog genome, I checked the peptide molecular masses using protparam, and they didn’t match. It is not clear whether some modifications were made on peptides. Also, the peptides presented a rana box motif (https://doi.org/10.3389/fmicb.2018.02846), but it was not clear wheter they were synthesized with or without the disulfide bridge.

There is a very specific point that should be highlighted. In discussion the authors stated “it has the potential to play a role in de novo AMP design or enhancement”, well, considering that designed peptides are quite similar to AMPs, but a number of them are inactive, AMPlify should not be used for such purpose, mainly because the Loose data set was not included in the system assessments.

Quality assessment:
Originality: ☆☆☆☆✭
Rigor: ☆☆☆✭✭
Significance to the field: ☆☆☆✭✭
Interest to general audience: ☆☆☆✭✭
Quality of writing: ☆☆☆☆✭
Overall quality of the study: ☆☆☆✭✭

Reference:

Li et al. 2020. AMPlify: attentive deep learning model for discovery of novel antimicrobial peptides effective against WHO priority pathogens. BioRxiv. (Version 1). doi: https://doi.org/10.1101/2020.06.16.155705

An SVM model based on physicochemical properties to predict antimicrobial activity from protein sequences with cysteine knot motifs

Ten years ago, I was publishing my first manuscript as 1st author, a manuscript about antimicrobial activity prediction, which would be one of the pillars of Porto Reports. Because of its inaugural character, I choose that article to be subject of the first post on Porto Reports Legacy.

An SVM model based on physicochemical properties to predict antimicrobial activity from protein sequences with cysteine knot motifs”, this manuscript is far from being a perfect manuscript, but it has some strengths, including an innovative strategy for antimicrobial activity prediction. In fact, the innovation was the main strength for worthing the publication. And by today, I use this work to teach what to do and what not to do in an scientific publication.

Briefly, the manuscript describes the construction of an antimicrobial activity prediction system using support vector machine as the machine learning algorithm and physicochemical properties as the sequence descriptors. The system reached a good accuracy (~80%) using the polynomial kernel.

The main limitations were  (i) the weak English, and in fact, the incorrect use of verb tenses in several manuscript sections; and (ii) the non-contextualization of the computational problem, which makes the manuscript hard to understand for scientists from biological sciences.

Nevertheless, I need to talk about the strengths! To create something innovative, we need to be creative and in 2010 there was a wide field to explore in this topic. In fact, there was only one manuscript until then related to prediction of antimicrobial activity. Thus in this condition, the idea does not need to be bright, it just need to be different.

The difference was not in the algorithm itself, but on how to train the machine learning algorithm. Because the first manuscript demonstrated that there are only slight differences in the predictive power of different algorithms with the same training schemes. Thus, we used physicochemical properties to train the system, reaching a good accuracy. However, there were some limitations on our technique, that were well described and properly addressed, which is always a strength.

This system was the precursor of CS-AMPPred, and due to some errors in the choice of methods, including the support vector machine engine, the original system did not reach its actual potential.

Quality assessment:

Originality: ☆☆☆☆✭
Rigor: ☆☆☆✭✭
Significance to the field: ☆☆☆☆✭
Interest to general audience: ☆☆☆✭✭
Quality of writing: ☆☆✭✭✭
Overall quality of the study: ☆☆☆✭✭

Reference:
Porto W.F., Fernandes F.C., Franco O.L. (2010) An SVM Model Based on Physicochemical Properties to Predict Antimicrobial Activity from Protein Sequences with Cysteine Knot Motifs. In: Ferreira C.E., Miyano S., Stadler P.F. (eds) Advances in Bioinformatics and Computational Biology. BSB 2010. Lecture Notes in Computer Science, vol 6268. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-15060-9_6

How to use our RESTFull APIs

We offer RESTFull APIs for some of our services. However, currently, only Sense the Moment has this feature enabled.

Using the GET protocol, you just need to include the sequence after the last slash:

http://portoreports.com/api/stm/<SEQUENCE>

For instance, if you want to run the sequence “WILLIAMFARIASP”, you need to run http://portoreports.com/api/stm/WILLIAMFARIASP. You should get the following JSON as result:

{"success":1, "1":{"score":0.63485344605918}}

Using the POST protocol, you need to pass the fasta file with the element name ‘file’ and then you should get a JSON with multiple entries:

{"success":1,"1":{"score":"0.18667964985098"},"2":{"score":"0.63485344605918"}}

If you want to submit multiple sequences using the RESTFull API, we recommend the POST protocol, because freewha could block your IP address, avoiding unintentional SYN flood .