- Split View
-
Views
-
Cite
Cite
Arun S. Konagurthu, Lloyd Allison, Peter J. Stuckey, Arthur M. Lesk, Piecewise linear approximation of protein structures using the principle of minimum message length, Bioinformatics, Volume 27, Issue 13, July 2011, Pages i43–i51, https://doi.org/10.1093/bioinformatics/btr240
- Share Icon Share
Abstract
Simple and concise representations of protein-folding patterns provide powerful abstractions for visualizations, comparisons, classifications, searching and aligning structural data. Structures are often abstracted by replacing standard secondary structural features—that is, helices and strands of sheet—by vectors or linear segments. Relying solely on standard secondary structure may result in a significant loss of structural information. Further, traditional methods of simplification crucially depend on the consistency and accuracy of external methods to assign secondary structures to protein coordinate data. Although many methods exist automatically to identify secondary structure, the impreciseness of definitions, along with errors and inconsistencies in experimental structure data, drastically limit their applicability to generate reliable simplified representations, especially for structural comparison.
This article introduces a mathematically rigorous algorithm to delineate protein structure using the elegant statistical and inductive inference framework of minimum message length (MML). Our method generates consistent and statistically robust piecewise linear explanations of protein coordinate data, resulting in a powerful and concise representation of the structure. The delineation is completely independent of the approaches of using hydrogen-bonding patterns or inspecting local substructural geometry that the current methods use. Indeed, as is common with applications of the MML criterion, this method is free of parameters and thresholds, in striking contrast to the existing programs which are often beset by them.
The analysis of results over a large number of proteins suggests that the method produces consistent delineation of structures that encompasses, among others, the segments corresponding to standard secondary structure.
Availability: http://www.csse.monash.edu.au/~karun/pmml.
Contact: arun.konagurthu@monash.edu; lloyd.allison@monesh.edu
1 INTRODUCTION
With the rapid growth in the corpus of known structures, concise representations are increasingly preferred to inspect and analyze protein folding patterns (Abagyan and Maiorov, 1988; Lesk, 1995; Richardson, 1981; Taylor et al., 1983). At the core of this simplification is the idea that proteins contain repetitive substructural elements and that the essence of a fold lies in the assembly and interaction of these elements (Kamat and Lesk, 2007; Konagurthu and Lesk, 2010; Lesk and Chothia, 1980; Lesk, 1995).
The appearance of some of these elements arises from the periodicity in the patterns of hydrogen bonds between backbone nitrogen and carbonyl groups along the protein polypeptide chain. Among the standard secondary structure definitions are: helix (α-helix, π-helix and 310-helix) and strand of sheet (Edsall et al., 1966). Ideally, the spatial trace of α-Carbon (Cα) atoms of standard secondary structure show a linear trend allowing them to be abstracted using vectors or line segments, without much loss of structural information about the fold. The common practice is to fit an axis to a helix and a least-square line to Cα or main chain atoms of strands of sheet (Chothia et al., 1981; Lesk, 1995).
Replacement of secondary structural elements with line segments is therefore one of the common methods to abstract protein structures and construct concise representation of their folding patterns. The number of standard secondary structural elements observed in a protein is typically an order of magnitude smaller than the number of residues in a chain. Therefore methods that utilize concise representations clearly benefit from a massive space and computational saving, especially when comparing and analyzing structures on a large scale (Abagyan and Maiorov, 1988; Konagurthu et al., 2008; Mizuguchi and Go, 1995; Shi et al., 2007).
Methods that abstract protein structure at the level of secondary structure generally rely on external programs that can automatically assign secondary structures to coordinate data. However, accurate identification and assignment of secondary structure is an inexact process (Cuff and Barton, 1999). Although definitions based on hydrogen bonding provides some rigor in assigning secondary structure, the standard definition of what constitutes a hydrogen bond is based on the notion of bond energy whose measurement can be imprecise and acutely sensitive even to small differences in the position of nitrogen and carbonyl atoms, especially the carbonyl oxygen positions. Two popular programs that use hydrogen bonding as a basis for assignment of secondary structure are DSSP (Kabsch and Sander, 1983) and STRIDE (Frishman and Argos, 1995).
On the other hand, secondary structure can be defined using geometric features such as distances and dihedral angles of Cα atoms along the backbone in addition to other local structural features. In fact, there is a direct correlation between patterns of hydrogen bonding and the geometry that arise out of them. However, secondary structural elements can deviate substantially from ideal geometry, therefore posing severe challenges to detect such elements using geometric features alone. Among the methods that rely primarily on geometry to assign secondary structure are (Dupuis et al., 2004; Labesse et al., 1997; Levitt and Greer, 1977; Majumdar et al., 2005; Richards and Kundrot, 1988; Sklenar et al., 1989; Srinivasan and Rose, 1999; Taylor, 2001). (See (Majumdar et al., 2005) for details of popular programs that assign secondary structural elements.)
We note that previous comparative studies have highlighted the difficulties of existing programs to assign consistently secondary structure to coordinate data and have proposed using a ‘consensus’ definition—secondary structure assignment that is at the intersection of all the methods—to arrive at a reliable simplification of protein structures (Colloc'h et al., 1993; Cuff and Barton, 1999).
The main goal for abstracting protein structures must be to achieve maximal economy of description with minimal loss of structural information (Taylor, 2001). However, simplifying structures at the level of standard secondary structure is lossy because the loop regions are ignored. Therefore, a reliable method that achieves the above goal and that is tolerant to measurement error and noise is preferred. Even better would be a method entirely independent of preconceived notions of what substructures are being sought.
Here, we describe a method that generates a principled abstractions of protein structures. Our method uses the rigorous statistical framework of minimum message length (MML). In fact, the realization of the goal to maximize economy and minimize loss of information fits squarely into the MML criterion, making it extremely well-suited for this specific problem. In this work, we treat a protein as an ordered list of Cα coordinates. Our method uses an information-theoretic approach to explain as a line segment the points between any pair of residues in the structure. Each such explanation is encoded in a certain number of bits (or code length). Using these code lengths, a globally optimal explanation is computed which minimizes the total encoded (message) length of the given coordinate data. The code lengths contributing to this minimum message length result in the best piecewise linear approximation of the structure. In a stark contrast to the existing methods, our method is completely free of parameters and thresholds. We emphasize that our method is not a method for delineating secondary structures. However, as expected from such a method, our results show that the line segments generated by this method correspond well with standard secondary structures of proteins.
We note that this article generalizes to three dimensions the work of Banerjee et al. (1996), who described a polygonal approximation method on general two dimensional sequence of points.1 Indeed, it can be shown that our method described in this paper can be generalized to arbitrary dimensions and other types of structural data (over and beyond proteins). We have attempted to keep the notations in this paper consistent with those described in the work of Banerjee et al. (1996) for the convenience of the reader.
Section 2 briefly summarizes the MML framework, followed by Sections 3–6 which describe the mechanics of our approach. Section 7 presents an analysis of the results of our method over a large number of protein structures.
2 THE MINIMUM MESSAGE LENGTH FRAMEWORK
Wallace and Boulton (1968) first proposed the theory of MML, where given a set of competing hypotheses (or models) that can explain some observed data, the MML criterion provides a statistically rigorous framework for selecting the best hypothesis to describe the data. In many ways, MML is a formal information-theoretic realization of the principle of Occam's razor.
MML is best understood through a communication process where a transmitter and a receiver are connected through one of Shannon's communication channels. The objective is that a transmitter must send some data D to the receiver. The transmitter and receiver must have previously agreed on a set of rules (that is, a code book) of communication using common knowledge and prior expectations. If the transmitter can find a good hypothesis, H*, to fit the data, (s)he will be able to transmit the data economically.
In MML, an explanation of the data comes as a two part message: Such a message paradigm ensures complete transparency in communication. That is, any information that is not common knowledge cannot be included except as a part of the message sent by the transmitter. Otherwise, the message sent will be indecipherable by the receiver. There can be no hidden parameters in this framework of communication. In fact, this issue extends to stating and inferring real-valued parameters to an ‘appropriate’ level of precision, which is pertinent to the current problem on hand.
transmit the hypothesis H* taking l(H*) bits, and
transmit the observed data D given H* taking l(D|H*) bits.
The MML framework additionally offers ‘safety’ in that if an inefficient code is used to encode a message, it can only make the hypothesis look less attractive than otherwise. Note that MML yields a natural hypothesis test: the null-model corresponds to transmitting the data raw. If a stated hypothesis takes longer than what is required by a null-model, then clearly such an hypothesis is unacceptable. A more complex hypothesis fits the data better than a simpler model, in general. We see that MML encoding gives a trade-off between hypothesis complexity (l(H)), and its goodness of fit to the data (l(D|H)). Therefore, MML criterion formally justifies and realises Occam's razor.
An important aspect of MML framework is that it is tolerant to measurement accuracy and noise in the underlying data. For a justification of this and a comprehensive study of the principle of MML, refer (Wallace, 2005).
3 FORMULATING THE PROBLEM USING MINIMUM MESSAGE LENGTH
A protein 𝒫={P1,···,Pn} is a sequence2 of n three-dimensional points corresponding to the coordinates (in ℝ3) of Cα atoms along the protein backbone, from its N- to C- terminus.3
Define a piecewise linear approximation of 𝒫 as a subsequence of k≤n points from 𝒫 of the form 𝒬={Q1≡Pi1,…,Qk≡Pik} such that 1=i1<···<ik=n, and the first and last points of 𝒬 are the same as the first and last points of 𝒫 (i.e. Q1=P1 and Qk=Pn).
Given some subsequence 𝒬 of sequence of points 𝒫, the protein can be approximated (or simplified) using line segments drawn between every successive pair of points in the subsequence, Qr′ and Qr′+1, 1≤r′<k. We will use the term delineation to describe this piecewise linear approximation. Further, we will use the term endpoint to describe any point in 𝒬. This is because any pair of consecutive points, Qr′≡Pir′ and Qr′+1≡Pir′+1, form endpoints for abstracting the points between Pir′ and Pir′+1 (inclusive) in the protein with a line segment. Note that a subsequence 𝒬 with k endpoints yields a delineation containing k−1 line segments between successive endpoints.
The goal this article is to find the best delineation of a given set of coordinate data, where the objective to select the best comes from defining the problem using the minimum message length criterion. Consistent to the communication process described in Section 2, the transmitter explains the data in 𝒫 with a hypothesis 𝒬 and sends it as a message whose code length is globally minimum over all possible hypotheses. Receiver will then able to infer the entire data 𝒫 from the received message to a reasonable level of precision using the general rules they have agreed upon as a part of the code book.
For the problem of delineating a structure from coordinate data, the transmitter will send the following two part message (refer Section 2): Therefore, as a part of the codebook, the transmitter and receiver must have agreed upon the encoding of the endpoints in 𝒬 and the encoding of deviations of points 𝒫−𝒬 explained by line segments between successive endpoints in 𝒬. Since the coordinate data of proteins is available at some fixed precision, the transmitter and receiver agree on the specific precision at which the data should be sent. We emphasize that the encoding of the above should allow the receiver to decode the message to the agreed precision.
The first part is the subsequence of points 𝒬 which denotes the delineation of 𝒫. This is equivalent to transmitting the hypothesis 𝒬 in l(𝒬) bits.
The second part will contain the remainder of points in 𝒫 (that is, 𝒫−𝒬) that weren't sent in the first part. In other words, these are the points in 𝒫 that are between the endpoints stated in 𝒬. The statement of these points will be encoded as spatial deviations with respect to the line segments between endpoints. This is equivalent to transmitting the observed data 𝒫 given the hypothesis 𝒬 over l(𝒫|𝒬) bits.
4 CODE LENGTH TO STATE THE DELINEATION AND DATA UNDER MML CRITERION
In this section, we will discuss the statement and transmission of the two part message described in Section 3.
4.1 Encoding the first part of the message
The first part pertains to the transmission of the delineation 𝒬 containing k endpoints. The transmitter must therefore state the number of points k. There are several optimal universal prefix codes available to encode integers. Here, we use an asymptotically optimal Elias omega code which encodes the integral value k in log*k bits4 (Elias, 1975).
Next, the coordinates of all endpoints are to be encoded. Each endpoint is a set of three real numbers of the form (x,y,z). Published protein coordinate data contain three putatively significant figures after the decimal point, in Angstrom Å units. The transmitter can scale this data to one decimal precision and treat the coordinates as integers. Now, an optimal code to send these coordinates is for the transmitter to first send the coordinates of a bounding rectangular box, (xmin,ymin,zmin) and (xmax,ymax,zmax) over all possible values of x,y and z in the given data. Once this bounding box is specified, any (x,y,z) coordinates within the box can be coded in log(xmax−xmin)+log(ymax−ymin)+log(zmax−zmin)=log((xmax−xmin)(ymax−ymin)(zmax−zmin))=log V bits, where V is the volume of the bounding rectangular box. It follows from here that all the k endpoints in 𝒬 can be stated in klog V bits.5
Therefore, the message length to state the first part of the transmission requires log*k+klog V bits.
4.2 Encoding the second part of the message
In the second part, the transmitter has to encode the data, 𝒫−𝒬, between endpoints stated in the first part of the message. For a successive pair of endpoints (Qr′≡Pi,Qr′+1≡Pj), 1≤i<j≤n, 1≤r′≤k in 𝒬, there are j−i−1 intermediate points between Pi and Pj in 𝒫. In this work, these intermediate data points will be treated as noisy samples and will be stated as a set of spatial deviations with respect to the line segment between Pi and Pj.
If such a scheme is used to communicate the second part of the message, for each line segment in 𝒬 between successive endpoints, the second part of the message will encode the following information:
the number of points explained by the line segment.
three spatial deviations for each intermediate point with respect to the line that will allow the receiver to recover the original location of the intermediate point up to a reasonable approximation.
the parameters of the probability distribution associated with each of the three sets of spatial deviations, over all intermediate points.
To explain the encoding of this part more clearly, consider Fig. 1. Let Lij denote the line segment between two successive endpoints in 𝒬, Qr′≡Pi and Qr′+1≡Pj. This line will be used to explain the intermediate points Pi+1···Pj−1∈𝒫. For any intermediate point Pr, i+1≤r≤j−1, define three spatial deviations Δ sr, tr and ur. In the reverse order, ur is the signed distance of Pr to the plane defined by vectors Pj−Pi and z-axis. To define tr, first project Pr to the plane defined above. Call this projection point . Given this projection, tr is the signed perpendicular distance of to the line Lij. Finally, the deviation Δsr is the (unsigned) lateral distance along the line Lij between points of projection of and onto the line (Fig. 1). (Refer the supplementary note containing a discussion on these deviations under arbitrary rotation of the coordinates.) Note that once the endpoints Pi and Pj are specified, and given the sets of spatial deviations Δ sr's, tr's and ur's for the intermediate points Pr, ∀i<r<j, the receiver can entirely recover the coordinates of all intermediate points.
In this work, we assume the three spatial deviations Δ s's, t's and u's of the intermediate points to be independent and normally distributed. Individual variables of each distribution are considered independent and random. (See supplementary note for a discussion on these assumptions.) Given these assumptions we have three distributions of the form: Δs~𝒩(μΔs,σ2Δs), t~𝒩(μt,σ2t), and u~ 𝒩(μu,σ2u), where μ and σ2 are the mean and variance of the respective normal distributions. For the structural coordinate data, we assume that the mean of the distributions of t's and u's is zero: t~𝒩(0, σ2t), and u~𝒩(0, σ2u). Therefore, to communicate the three distribution, the transmitter has to state the following four parameters: (μΔs, σ2Δs, σ2t, σ2u).
Therefore, using Shannon's insight, the optimal code length to describe the entire sets of individual deviations of Δs's for a line Lij will require −log(p(Δsi+1,…,Δsj−1|𝒩(μΔs, σ2Δs))) bits.
Following a similar expansion, we can show that the code lengths for the deviation tr's and ur's are and , respectively.
So far in this second part, we have computed the code lengths required to state intermediate points explained by the line Lij. Note that a delineation of a structure containing k endpoints defines k−1 such line segments. For convenience in notation, assume the endpoints of each line segment is of the form (Pir,Pjr), 1≤r<k. (In practice, for a delineation, Pir of the r-th line segment is equivalent to Pjr−1 of (r−1)th line segment.) Then the total code length of the second part is the sum of the following terms:
∑r=1k−1log* mrij, where mrij=jr−ir−1, representing the total code length to encode the number of intermediate points described by all line segments in the delineation put together.
bits to encode the parameters (three per line segment) corresponding to the distribution of spatial deviations for all lines.
bits to encode Δsr's over all line segments
bits to encode tr's over all line segments
bits to encode ur's over all line segments
4.3 Problem statement
The problem:
Given 𝒫 containing a sequence of n points, find a subsequence 𝒬∈𝒫 containing k≤n points such that the total message length to explain 𝒫 with 𝒬, , is globally minimum.
5 FINDING THE OPTIMAL DELINEATION
This section will describe the procedure to compute the optimal delineation 𝒬* for a given coordinate data. Broadly, the search for the optimal delineation has two steps.
Potentially every pair of points Pi and Pj, 1≤i<j≤n can be a part of the delineation in 𝒬*. (We note here that the segments in the delineation must not overlap, except for successive regions, and those only at their endpoints.) Therefore, we will first build a matrix ℋ=(ℋij)1≤i<j≤n of code lengths for all possible pairs of points in 𝒫.
Then, the matrix ℋ will be used to find a subsequence of points 𝒬* such that the total code length ℋ(𝒬*) of the delineation is minimized, using a one-dimensional dynamic program.
5.1 Computation of code length over all possible segments
Equation (2) expresses the message length ℋij required to describe any line segment Lij between two points Pi and Pj. We will examine the complexity of computing each of the components that constitute Equation (2).
For the n points in 𝒫, there are possible line segments. The log V term in Equation (2) is constant across all possible segments and is computed once while reading the data points of 𝒫. Next, for each line segment, there are three parameters whose code lengths depend on the number of points in between the endpoints. This is trivially computed in constant time as j−i−1.
The relatively complex part is to compute the code lengths of the spatial deviations of the line, Δ s's t's and u's. Each of these three deviations have code lengths that depend on their respective variance, σ2Δs, σ2t and σ2u. While one can compute the variance of each set of deviations from the coordinate data, such a computation is linear in the number of points that each line segment explains. If this näive approach is followed, the computation of ℋ requires O(n3) operations. We will show in the later Section 6 that this is redundant and that the total time required to compute ℋ can be achieved in O(n2) operations, by computing the variances of all three spatial deviations incrementally from previous computations using a set of sufficient statistics. But before that we will describe the method to compute the optimal delineation given the matrix ℋ.
5.2 Optimal delineation as a one-dimensional dynamic program
Dynamic programming is perfectly suited when dealing with problems that contain sequential constraints, where the solutions to the subproblems have a recursive overlapping substructure (Bellman, 1957). The problem statement in Section 4.3 is an ideal candidate for the search strategy of dynamic programming. Since a delineation is a subsequence which preserves the linear ordering of its elements, the optimal delineation of the given data can be derived by computing and memoizing (i.e. caching) the optimal delineation of its subproblems.
We will use the matrix ℋ of code length between all possible endpoints to find the optimal delineation 𝒬* that minimizes ℋ(𝒬*) using a one-dimensional dynamic program.
Using the above relationship, the array C is filled iteratively from 1 to n. Upon completion, the value Cn gives the optimal message length corresponding to the best delineation 𝒬* of 𝒫, where ℋ(𝒬*)≡Cn is globally minimum. The subsequence of endpoints of this optimal delineation can be computed by storing, for each j, the back pointer i<j of the array from which the optimal value Cj was derived. With these back pointers, a simple traceback from Cn (until C1 is reached) gives the set endpoints (in reverse order) that form the best delineation 𝒬*.
6 EFFICIENT COMPUTATION OF MATRIX ℋ
As mentioned in Section 5.1 the matrix of code lengths ℋ can be computed efficiently in O(n2) operations and this section will show how this can be achieved.
For the matrix ℋ to be computable in O(n2) operations, each element ℋij in the matrix should be computable in constant time. However terms σrΔs2, σrt2, and σru2 in Equation (2) cannot be computed in constant time. For a line segment Lij, näively, these three variances take time proportional to the number of points explained by the line to compute, leading to a O(n3) algorithm for computing the matrix ℋ. Below we will show that each of σrΔs2, σrt2, and σru2 can indeed be computed incrementally and in constant time from previous computations resulting in a O(n2) algorithm.
6.1 Constant-time update of σ2Δs's
Consider first these notations: for any vector with direction ratios 〈x,y,z〉, let represents the vector norm of . Let any point Pi∈𝒫 have the direction ratios of the form 〈xi,yi,zi〉.
Now, let Sijxx, Sijyy, Sijzz, Sijxy, Sijyz, Sijxz be a set of variables which we will call here sufficient statistics. These variables are of the form: , where A and B take the values {x,y,z}.
Therefore, using the sufficient statistics the computation of σΔs for a line segment can be computed incrementally in constant time.
6.2 Constant-time update of σ2t's
Let be the normal to a plane defined by , where is the unit vector along z-axis with the direction cosines 〈0,0,1〉. It follows that the direction ratios of are 〈−(yj−yi),(xj−xi), 0〉.
Define as a vector which is normal to the plane . The direction ratios of will be: 〈−(xj−xi)(zj−zi),−(yj−yi)(zj−zi),(xj−xi)2+(yj−yi)2〉 Let represent the direction cosines of , where , and .
6.3 Constant-time update of σ2u's
We have seen above that . Let represent the direction cosines of , where , and . (Note ).
Therefore, the update rules in Equation (4)–(6) allows an efficient computation of the matrix ℋ of code lengths in O(n2) operations.
7 RESULTS
In the previous sections, we have demonstrated an efficient and statistically robust algorithm to simplify a protein structure with piecewise linear segments. We implemented the described algorithm (in C++). Our implementation is available from http://www.csse.monash.edu.au/~karun/pmml/.
We evaluated our method using a non-redundant dataset containing 15 399 protein structures obtained from the protein data bank (Berman et al., 2002). (The non-redundancy here implies that no two structures in this dataset share a sequence identity >65%.) This dataset was culled using the program PISCES (Wang and Dunbrack, 2003). The list of proteins structures in the dataset and the results of their delineation produced by our method can be obtained from the aforementioned link.
Figure 2 gives the distribution of the measure of simplification of structures over the entire dataset. For a structure, the measure of simplification is the ratio of number of line segments identified by the program over the number of residues in the structure. On an average over the entire dataset the delineation size (that is, the number of line segments in the delineation) constitutes 13.85% of the total size of structure (in residues). In addition, the average segment length over the entire dataset is observed to be 8.11 residues. In general, the number of segments is correlated to total size of the protein structure.
It is of considerable interest to evaluate the agreement of standard secondary structural elements—helices and strands of sheets—with the delineation identified by the program. We note that an ideal delineation of a structure must encompass these elements since they are ideal candidates for approximation with lines or vectors given the linear spatial trend in their geometry. In order to evaluate the agreement, we coarsely classify each segment to one of three secondary structure states: ‘Helix’, ‘Strand’ and ‘Other’. This three-state classification is based on certain geometric characteristics of the segments in the delineation. Specifically, we compute the following geometric profiles for each segment: ‘rise’, ‘pitch’ and backbone dihedral angles ϕ and ψ. The rise (ρ) of the segment with endpoints Pi and Pj is ρ=Dij/(j−i+1), where Dij is the Euclidean distance between the endpoints. In other words, the rise gives the average translation of points along the line between endpoints. The rise of a standard secondary structure is directly related to the pitch (p) of the segment. For a substructure with a geometry that repeats itself every n residues, the relationship between rise and pitch is given by p=nρ. Table 1 summarizes the geometric profiles of ideal secondary structures (Taylor, 2001). Inspecting these profiles per segment, a coarse characterisation for each segment in the delineation is achieved.
Type . | ϕ . | ψ . | n . | ρ . | p . |
---|---|---|---|---|---|
310-Helix | −57.1 | −69.7 | 3.0 | 2.0 | 6.0 |
α-Helix | −57.8 | −47.0 | 3.6 | 1.5 | 5.5 |
π-Helix | −74.0 | −4.0 | 4.4 | 1.1 | 5.0 |
β-Strand | −139.0 | 135.0 | 2.0 | 3.4 | 6.8 |
Type . | ϕ . | ψ . | n . | ρ . | p . |
---|---|---|---|---|---|
310-Helix | −57.1 | −69.7 | 3.0 | 2.0 | 6.0 |
α-Helix | −57.8 | −47.0 | 3.6 | 1.5 | 5.5 |
π-Helix | −74.0 | −4.0 | 4.4 | 1.1 | 5.0 |
β-Strand | −139.0 | 135.0 | 2.0 | 3.4 | 6.8 |
Type . | ϕ . | ψ . | n . | ρ . | p . |
---|---|---|---|---|---|
310-Helix | −57.1 | −69.7 | 3.0 | 2.0 | 6.0 |
α-Helix | −57.8 | −47.0 | 3.6 | 1.5 | 5.5 |
π-Helix | −74.0 | −4.0 | 4.4 | 1.1 | 5.0 |
β-Strand | −139.0 | 135.0 | 2.0 | 3.4 | 6.8 |
Type . | ϕ . | ψ . | n . | ρ . | p . |
---|---|---|---|---|---|
310-Helix | −57.1 | −69.7 | 3.0 | 2.0 | 6.0 |
α-Helix | −57.8 | −47.0 | 3.6 | 1.5 | 5.5 |
π-Helix | −74.0 | −4.0 | 4.4 | 1.1 | 5.0 |
β-Strand | −139.0 | 135.0 | 2.0 | 3.4 | 6.8 |
Examining the coarse segment level assignment for the structures in the dataset, we note that the average length of segments assigned as ‘Helix’ is 13.01 residues while the same for those assigned as ‘Strand’ is 7.33 residues.
To evaluate our coarse assignment, we choose two popular and extensively used secondary structure assignment programs, DSSP (Kabsch and Sander, 1983) and STRIDE (Frishman and Argos, 1995). DSSP and STRIDE assign each residue to one of multiple secondary structural states, including 310-, α-, π-helices and β-strands of sheet. For the structures in our dataset, we generate the respective secondary structure assignments using DSSP and STRIDE. We note that both these programs assign secondary structure definitions at a residue level, while the coarse assignment for our method described above is at a segment level. Therefore, to enable a comparison between the methods we assign all residues within a segment to the segment level secondary structure state.
Table 2 gives the concordance of Helix6 and Strand assignments between DSSP, STRIDE, and our method, PMML.
Comparison . | Helices (%) . | Strands (%) . |
---|---|---|
PMML (coarse)_vs_DSSP | 79.0 | 83.3 |
PMML (coarse)_vs_STRIDE | 79.3 | 83.1 |
PMML (refine)_vs_DSSP | 92.6 | 92.4 |
PMML (refine)_vs_STRIDE | 91.3 | 92.1 |
STRIDE_vs_DSSP | 95.7 | 96.9 |
Comparison . | Helices (%) . | Strands (%) . |
---|---|---|
PMML (coarse)_vs_DSSP | 79.0 | 83.3 |
PMML (coarse)_vs_STRIDE | 79.3 | 83.1 |
PMML (refine)_vs_DSSP | 92.6 | 92.4 |
PMML (refine)_vs_STRIDE | 91.3 | 92.1 |
STRIDE_vs_DSSP | 95.7 | 96.9 |
Comparison . | Helices (%) . | Strands (%) . |
---|---|---|
PMML (coarse)_vs_DSSP | 79.0 | 83.3 |
PMML (coarse)_vs_STRIDE | 79.3 | 83.1 |
PMML (refine)_vs_DSSP | 92.6 | 92.4 |
PMML (refine)_vs_STRIDE | 91.3 | 92.1 |
STRIDE_vs_DSSP | 95.7 | 96.9 |
Comparison . | Helices (%) . | Strands (%) . |
---|---|---|
PMML (coarse)_vs_DSSP | 79.0 | 83.3 |
PMML (coarse)_vs_STRIDE | 79.3 | 83.1 |
PMML (refine)_vs_DSSP | 92.6 | 92.4 |
PMML (refine)_vs_STRIDE | 91.3 | 92.1 |
STRIDE_vs_DSSP | 95.7 | 96.9 |
Although even a coarse segment level assignment by our method produced a satisfactory concordance with DSSP and STRIDE, there is still a disagreement of ~15% between PMML and the other two methods. Inspecting these differences we note that the majority of them came from the terminal parts of the segments delineated by our program. Therefore, we refine the coarse level assignment produced by PMML using the hydrogen bonding patterns of residues within each segment to reassign the secondary structure state at a residue level. We use a simple proximity (of backbone nitrogen and carbonyl groups) and angle (of N, O, C atoms) based computation of hydrogen bonds. Comparing our refined assignments at a residue level with DSSP and STRIDE we notice a substantial improvement in the concordance of helix and stand assignments with DSSP and STRIDE. (See rows 3 and 4 in Table 2.) We emphasize that although PMML can be used to generate protein secondary structure assignments, its real aim is to generate concise representations of structures, irrespective of the nature of the segments of which they are composed. For instance, PMML could be applied to RNA structures without needing any appeal to the types of substructure anticipated.
Manually evaluating the delineation of a large number of structures we notice that although PMML's delineation identifies the regions of helix and strand consistently, there remain small discrepancies in assigning precise beginning and end residues of secondary structure elements as ascertained by an expert. To highlight these differences consider the following example of the delineation produced by PMML. Figure 3 shows the structure of oxidized Clostridium beijerinckii flavodoxin. This protein binds a cofactor, flavin mononucleotide (FMN). Flavodoxin is a small α/β protein, containing a 5-stranded parallel β-sheet (β1,…,β5), with two helices packed against each face of the sheet (αA,αE and αC,αD). There is also a short helix (αB) located near the N-terminus of the protein. (Fig. 3.) Different segments produced by PMML are shown in different colors. The elements of secondary structure shown as thick ribbons are the secondary structure assignments taken from the structure's wwPDB file (5NLL). Table 3 gives the residue ranges (that is, start and end residues) for each secondary structural element (SSE) of the flavodoxin structure listed in its wwPDB file. The residue ranges of the corresponding segmentation produced by PMML is also presented in the table. Broadly, the program correctly assigns segments to the SSEs. However, minor differences can be observed in the locations of their start and end residues. In most cases, we notice an absolute difference of 1 or 2 residues in the N- or C- terminal regions of these SSEs. The segmentation in the regions around the SSEs αE, β2 and β5 show some discrepancies. The residue range from wwPDB corresponding to αE was approximated by PMML using 2 segments instead of one. The first segment is composed of roughly one turn of the helix at αE's N-terminal end. This is understandable as this turn is substantially skewed from the main helical axis and, indeed, there is an interruption in the hydrogen bonding. However, the second segment composed of 11 residues in this region is consistent with the assignment in the wwPDB file. In the case of β2, the start location identified by PMML precedes the start location identfied in the wwPDB file by four residues. On inspecting the flavodoxin structure, there appears to be a backbone hydrogen bond between the carbonyl group of residue Asp29 and the nitrogen of Met1 (of strand β1), so the β2 strand may well start at residue Lys28 or Asp29. Similarly, for β5, the start location of the segment from PMML was identified to be three residues before the location identified in the wwPDB file, and inspecting the structure, we note the β−bulge in strand β5, and hydrogen bonds between atoms 80O···109N and 82N···109O; assignment of the start of the strand β5 to residue 109 is not indefensible.
SSE . | wwPDB . | PMML . |
---|---|---|
β1 | Lys2-Trp6 | Met1-Tyr5 |
αA | Asn11-Glu25 | Asn11-Glu27 |
β2 | Asn31–Asn34 | Gly27-Ile33 |
αB | Ile40-Asn45 | Asn39-Glu46 |
β3 | Ile48–Cys53 | Asp47-Cys53 |
αC | Phe66-Lys76 | Glu65-Thr75 |
β4 | Lys81–Tyr88 | Gly79-Ser87 |
αD | Lys94-Gly105 | Gly91-Gly107 |
β5 | Leu115–Gln118 | Glu112-Gln118 |
αE | Asp122-Ile136 | Glu120-Gln126,Gln126-Ile136 |
SSE . | wwPDB . | PMML . |
---|---|---|
β1 | Lys2-Trp6 | Met1-Tyr5 |
αA | Asn11-Glu25 | Asn11-Glu27 |
β2 | Asn31–Asn34 | Gly27-Ile33 |
αB | Ile40-Asn45 | Asn39-Glu46 |
β3 | Ile48–Cys53 | Asp47-Cys53 |
αC | Phe66-Lys76 | Glu65-Thr75 |
β4 | Lys81–Tyr88 | Gly79-Ser87 |
αD | Lys94-Gly105 | Gly91-Gly107 |
β5 | Leu115–Gln118 | Glu112-Gln118 |
αE | Asp122-Ile136 | Glu120-Gln126,Gln126-Ile136 |
The SSEs in the rows follow the order of their appearance along the chain of the protein from its N- to C-terminus. The column wwPDB gives the residue ranges of various SSEs as indicated in the wwPDB file 5NLL. The column PMML gives the corresponding residue ranges of the segmentation produced by PMML.
SSE . | wwPDB . | PMML . |
---|---|---|
β1 | Lys2-Trp6 | Met1-Tyr5 |
αA | Asn11-Glu25 | Asn11-Glu27 |
β2 | Asn31–Asn34 | Gly27-Ile33 |
αB | Ile40-Asn45 | Asn39-Glu46 |
β3 | Ile48–Cys53 | Asp47-Cys53 |
αC | Phe66-Lys76 | Glu65-Thr75 |
β4 | Lys81–Tyr88 | Gly79-Ser87 |
αD | Lys94-Gly105 | Gly91-Gly107 |
β5 | Leu115–Gln118 | Glu112-Gln118 |
αE | Asp122-Ile136 | Glu120-Gln126,Gln126-Ile136 |
SSE . | wwPDB . | PMML . |
---|---|---|
β1 | Lys2-Trp6 | Met1-Tyr5 |
αA | Asn11-Glu25 | Asn11-Glu27 |
β2 | Asn31–Asn34 | Gly27-Ile33 |
αB | Ile40-Asn45 | Asn39-Glu46 |
β3 | Ile48–Cys53 | Asp47-Cys53 |
αC | Phe66-Lys76 | Glu65-Thr75 |
β4 | Lys81–Tyr88 | Gly79-Ser87 |
αD | Lys94-Gly105 | Gly91-Gly107 |
β5 | Leu115–Gln118 | Glu112-Gln118 |
αE | Asp122-Ile136 | Glu120-Gln126,Gln126-Ile136 |
The SSEs in the rows follow the order of their appearance along the chain of the protein from its N- to C-terminus. The column wwPDB gives the residue ranges of various SSEs as indicated in the wwPDB file 5NLL. The column PMML gives the corresponding residue ranges of the segmentation produced by PMML.
8 CONCLUSION
We have presented a novel and efficient method to delineate protein structures using the MML framework; MML is tolerant to measurement error and other inaccuracies. The model used in this work is independent of preconceived notions of what substructures are being sought to simplify the observed coordinate data. Our method maximizes the economy of representation while minimizing the loss of information, taking into account even the loop regions of proteins. Analysis of the delineations of a large number of protein structures suggests that the method is consistent in, among others, delineating standard secondary structures. The concise representations produced by this method have a potential use for rapid and accurate structure comparison and lookup. An implementation of our program is available from http://www.csse.monash.edu.au/~karun/pmml/.
1 Banerjee et al., 1996) use a related minimum description length principle for their approach, which is a technique that was introduced a decade after Wallace and Boulton (1968) proposed the MML criterion. The two approaches are significantly different. See Wallace (2005) for a comparison.
2We use the term sequence in this paper to mean an ordered list. This should not be confused with the primary sequence of amino acids of a protein.
3Assume that the protein 𝒫 is oriented such that P1 is the origin, P2 lies on the positive x−axis, and P3 lies on the xy-plane. This is one of the many possible schemes that ensures that our method is invariant to rotation and translation of the frame-of-reference in which the coordinate data is defined. (See supplementary note for a detailed discussion on this issue.)
4log*x=logx+loglogx+··· (over all positive terms)
5Note that the coordinates of the bounding rectangular box is a constant given the data, so it can be ignored at least for the purposes of comparing two hypotheses.
6We do not distinguish between the three types of helices and treat them as one state.
ACKNOWLEDGEMENTS
We thank the anonymous referees for comments that improved the manuscript. L.A. and A.S.K. thank Nathan Hurst for useful pointers during the development of this work.
Funding: ASK's research is supported by Monash University's Talent Enhancement and Larkins Fellowship. NICTA is funded by the Australian Government as represented by the Department of Broadband, Communications and the Digital Economy and the Australian Research Council.
Conflict of Interest: none declared.