Minería de Patrones Secuenciales aplicada a la Predicción del Plegamiento de Proteínas

Published in: Industry, Innovation, and Infrastructure for Sustainable Cities and Communities: Proceedings of the 17th LACCEI International Multi-Conference for Engineering, Education and Technology
Date of Conference: July 24-26, 2019
Location of Conference: Montego Bay, Jamaica
Authors: Julio Quintana-Zaez (Universidad de Ciego de Ávila, CU)
Hector R. Velarde-Bedregal (Universidad Católica de Santa María, PE)
Guillermo E. Calderón-Ruiz (Universidad Católica de Santa María, PE)
Cosme E. Santisteban-Toca (Instituto Tecnológico de Ciudad Cuauhtémoc, MX)
(Universidad de Ciego de Ávila)
Full Paper: #37

Abstract:

Sequence mining consists of finding statistically relevant patterns in data collections represented sequentially. These, are an important type of data, where it matters the order that occupy the elements in the set and that finds a wide range of applications in Bioinformatics and Computational Biology. The prediction of protein structures is one of these applications. Where, a protein is no more than a sequence of amino acids forming patterns known as alpha helices, beta sheets and turns. For purposes of our investigation, these collections or secondary structures would be the itemsets, while the amino acids that make up the entire sequence, the items. Despite multiple attempts to predict protein folding, the algorithms developed to date only reach a 35% effectiveness. That is why we propose SPMCcm, an algorithm based on the prediction of frequent sequences and a scheme of classifiers. Which uses the information provided by the amino acid sequence, in two stages. Where, the first stage learns of the interactions between the secondary structures of the proteins, which it extracts as frequent sequences or itemsets. Meanwhile, the second stage learns of the interaction between the amino acids present in the interacting structures or items. The experimental evaluation showed that SPMCcm behaves in a similar way, independently of the base classifier used, reaching accuracies in the prediction of up to 48%, higher than the 35% reported by the literature, without using large computational resources and possessing explanatory capacity.