Abstract

Metagenomic projects using whole-genome shotgun (WGS) sequencing produces many unassembled DNA sequences and small contigs. The step of clustering these sequences, based on biological and molecular features, is called binning. A reported strategy for binning that combines oligonucleotide frequency and self-organising maps (SOM) shows high potential. We improve this strategy by identifying suitable training features, implementing a better clustering algorithm, and defining quantitative measures for assessing results. We investigated the suitability of each of di-, tri-, tetra-, and pentanucleotide frequencies. The results show that dinucleotide frequency is not a sufficiently strong signature for binning 10 kb long DNA sequences, compared to the other three. Furthermore, we observed that increased order of oligonucleotide frequency may deteriorate the assignment result in some cases, which indicates the possible existence of optimal species-specific oligonucleotide frequency. We replaced SOM with growing self-organising map (GSOM) where comparable results are obtained while gaining speed improvement.

1. Introduction

Metagenomics is an emerging area of genome research that allows culture-independent, functional, and sequence-based studies of microbial communities in environmental samples. Whole-genome shotgun (WGS) sequencing has been applied to most of the metagenomic projects [16]. These projects have unveiled remarkable information on microbial genomics and also brought an unprecedentedly comprehensive and clearer picture of microbial communities. In the WGS sequencing approach, random sampling of DNA fragments of all microbes that form a community in an environmental sample is performed. The individual DNA fragments are sequenced and then assembled into genomes by using computing techniques. However, a fundamental limit of WGS sequencing is that only the genomes of high-abundance species can be completely or near-completely assembled [7] due to the requirement of multiple overlapping fragments for a confident assembly. In one of the prominent metagenomic studies conducted by Venter et al. [2], about 1?Gb of DNA sequences has been successfully sequenced from Sargasso Sea samples. This study has clearly indicated the existence of far more diverse microbial communities than previously thought. Most of the environmental genomes sequenced to date contain only few high-abundance species but many low-abundance species in the communities that account for a large portion of the total genome size of an environmental sample. The presence of large amount of DNA fragments from the low-abundance species poses a problem for assembling genomes. In order to infer the biological functions of a microbial community from sequences, a process named “binning” is used to group these unassembled DNA sequence fragments and small contigs into biologically meaningful “bins,” such as phylogenetic groups [8].

There are a number of tools currently available for the binning process. These include Chisel System [9, 10], MetaClust [4, 11], TETRA [12, 13], PhyloPythia [14], and the combination of oligonucleotide frequency and SOM [15]. The Chisel System helps binning the sequences according to the identification, characterisation, and comparative analysis of taxonomic and evolutionary variations of enzymes. The remaining above-listed tools use the method of analysing nucleotide composition of sequences that is considered to have the potential of working well for the binning process in WGS sequencing [8]. MetaClust computes different DNA signatures followed by the use of a clustering algorithm to assign sequences into bins. TETRA bins the species-specific sequences by the use of tetranucleotide-derived -score correlations. PhyloPythia uses a supervised-learning approach where it trains a multiclass support vector machine (SVM) classifier using all the known genome sequences in the existing database then assigns the unknown environmental sequences to the closest clade in the selected taxonomic level. This method has been demonstrated to be able to classify most DNA sequence fragments with high accuracy. However, considering the current amount of known genomes which is far less than 1% of the entire microbial genomes [7], it is reasonable to assume that the currently available training data is insufficient to represent all the extremely diverse microbial genomes for supervised-learning methods. Unsupervised-learning may provide the answer to this problem. The combination of oligonucleotide frequency and the well-known unsupervised learning method self-organising map (SOM) was used by Abe et al. [15] to explore genome signatures. They used the di-, tri-, and tetranucleotide frequencies as the training features of SOM to cluster the 1?kb and 10?kb DNA sequence fragments derived from 65 bacteria and 6 eukaryotes. Clear species-specific separations of sequences were obtained in the 10?kb fragment tests. Their results showed that the combination of oligonucleotide frequency and SOM can be used as a powerful tool to cluster or bin the DNA sequence fragments after WGS sequencing.

In order to successfully bin the DNA sequence fragments, using an appropriate genome signature as the training feature is important. In recent years, researchers have found that, due to oligonucleotide frequency bias in various prokaryotic genomes, the oligonucleotide frequency can be used as a possible genome signature. The di-, tri-, and tetranucleotide frequencies, which are the frequencies using two, three, and four nucleotides respectively, have been well studied. Karlin et al. [16, 17] has shown the compositional bias of the di- and tetranucleotide contents of 15 prokaryotic genomes. Weinel et al. [18] found that 80% of Pseudomonas putida KT2440 genome have a similar bias in GC contents and di- and tetranucleotide contents. Teeling et al. [12] showed that the tetranucleotide frequency has a higher discriminatory power than GC content and used it for the assignment of genomic fragments to the taxonomic group. In addition, Sandberg et al. [19] employed a Bayesian approach to classify the short sequences and found that the classification accuracy increases with a higher-order oligonucleotide frequency. Above-mentioned papers provide evidences that there is a trend of using oligonucleotide frequency as prokaryotic genome signature, rather than the GC content. Thus, high-order oligonucleotide frequency may also be an appropriate training feature for binning DNA sequence fragments by unsupervised clustering methods.

Since the combination of oligonucleotide frequency and SOM appears as a promising binning strategy that can be further explored, we focus in this paper on improving the training features and the clustering algorithm. In Abe et al.’s work [15], there was no systematic way of comparing the quality of the SOM results. We tested the traditional clustering evaluation measures (recall, precision, and F-measure) and discovered the inadequacy of using them for examining the similarity of phylogenetic levels. Therefore, we introduce a method to quantitatively measure and assess the results of clustering DNA sequence fragments from a collection of species. In the investigation of evaluating suitable training features, we attempt to compare results for the di-, tri-, and tetranucleotide frequencies as well as the pentanucleotide frequency (the frequency usage of five nucleotides) to test if higher-order oligonucleotide frequency yields better binning of DNA sequence fragments. We also study the effectiveness and efficiency of the combination of oligonucleotide frequency and SOM by employing alternative clustering algorithms. We compare SOM with a variant of it called growing self-organising map (GSOM), which has been successfully applied in several different applications [2025] including microarray clustering [26]. These comparisons allow us to suggest a better compositional binning strategy for WGS sequencing using the method of combining oligonucleotide frequency and SOM-based clustering algorithm.

This paper is organized as follows: Section 2.1 gives a brief introduction to the SOM and GSOM clustering algorithms; Section 2.2 proposes a method of measuring inseparable species when DNA sequence fragments are clustered; Section 2.3 describes the procedures of preparing the three datasets used in this paper, and the data preprocessing step for preparing the input vectors; and Section 2.4 shows the details of the algorithm settings and the experiment set up for repeatability of the experiments. Section 3 presents the results of comparing the four orders of oligonucleotide frequencies and the comparison between SOM and GSOM; Finally, Section 4 gives the discussion, conclusion, and future work.

2. Methods

2.1. Growing Self-Organising Map

Growing self-organising map (GSOM) [27, 28] is an extension of self-organising map (SOM) [29]. GSOM is a dynamic SOM which overcomes the weakness of a static map structure of SOM. Both SOM and GSOM are used for clustering high-dimensional data. This is achieved by projecting the high-dimensional data onto a two- or three-dimensional feature map with lattice structure where every point of interest in the lattice represents a neuron or a node in the map. The mapping preserves the data topology, so that similar samples can be found close to each other on the 2D/3D feature map.

The SOM training consists of three phases: initialisation phase, ordering phase, and fine-tuning phase. The initialisation is crucial to achieve a quality-clustering result. The following parameters are determined in this phase:

(i) the map topology (either rectangular or hexagonal);(ii)the number of nodes which is the resolution of the map;(iii)the weight vector initialization of nodes;(iv)the width/height (or aspect) ratio of the map.The user determines the first two parameters and generally principle components analysis (PCA) is used for setting the last two parameters. The weight vectors are initialised by the first two principle vectors of the inputs and the aspect ratio of the map is determined based on the ratio of magnitudes of the first two principle components. In the ordering and fine-tuning phases, each input is presented to the map and the best matching unit or “winner,” which has the smallest Euclidean distance to the presented input, is identified. The weight vector of the winner and its neighbouring nodes are updated bywhere is the weight vector of the node at time , is the th input vector ( where is the dimensionality of data), is the learning rate and is the neighbourhood kernel function.

GSOM employs the same weight adaptation and neighbourhood kernel learning as SOM, but has a global parameter of growth named Growth Threshold (GT) that controls the resolution of the map. The Growth Threshold is defined as where is the user defined spread factor with 0 representing minimum growth (coarsest resolution) and 1 representing maximum growth (finest resolution).

There are three phases in the GSOM training: initialisation phase, growing phase, and smoothing phase. In the initialisation phase, the GSOM is initialised with a minimum single “lattice grid” depending on whether the rectangular or hexagonal topology is chosen. Due to the small number of nodes in the beginning of training, the weight vector initialisation has less effect on the clustering quality and these weights will be corrected quickly in the growing phase. During the growing phase, every node has an accumulated error counter and the counter of the winner () is updated by If the winner is at the boundary of the current map and exceeds GT, new nodes will be added to the surrounding vacant slots of the winner. In the case when exceeds GT and the winner is not a boundary node, is evenly distributed outwards to the winner’s neighbouring nodes. The smoothing phase is for fine-tuning the weights of nodes and no new node will be added to the map.

The major advantages of GSOM over SOM are summarised as follows.

(i)The shape of GSOM represents the hidden data structure better than SOM that leads to better identifiable clusters.(ii)New nodes are added to the necessary regions while keeping the order of nodes. Therefore, neither PCA nor ordering phase is required in the training.(iii)Fewer nodes at the beginning of the training leads to the speed improvement.

2.2. Quality Measurement of the Clustering Performance in the Mixing Region

In our preliminary test, the well-known F-measure [30], which computes both recall and precision into a single index from a contingency table, was used to evaluate the clustering results. However, after examining the cluster contents, it is apparent that, for binning applications, F-measure does not provide sufficient insight and description of ambiguities in terms of phylogenetic relationships (refer to Section 3). More specifically, one would expect phylogenetically-close groups as highly likely to be ambiguous, but F-measure does not account for such likelihoods. Therefore, we propose an alternative clustering evaluation measure specifically for this application.

When an SOM or GSOM is used to group species fragments into clusters on a 2D/3D map, it is often inevitable that regions with overlapping clusters (mixing regions where a neuron represents DNA sequence fragments from more than one species) will exist. To evaluate a clustering algorithm’s ability to group DNA sequence fragments into species-specific or “pure” clusters, we define two criteria that measure the clustering quality in the mixing region: intensity of mix (IoM) and level of mix (LoM), where the former measures the percentage of mixing and the later indicates the taxonomic level of ambiguity for a given pair of clusters.

The IoM is evaluated based on the concept of mixed pair described below. Let A and B be sets of vectors belonging to species A and B, respectively, and n(X) is the number of elements in set X. If A and B is a mixed pair, then the percentage of A in the mixing region of the two classes is and the percentage of B is . As illustrated in Figure 1, 11.6% of A sequences is mixed with B sequences in the B cluster and the complementary mix indicates that 10.6% of B sequences is mixed with A sequences in the A cluster. The same concept of mixed pair applies for B and C. Therefore, there are two mixed pairs in Figure 1, one is A and B and the other is B and C. For number of species, there can be up to mixed pairs. Additionally, a pair of clusters is only considered to be truly mixed when both clusters are heavily overlapped. Thus, as in Figure 1, when but < it indicates that only a small number of outliers of one species (C) is mixed with the other species (B). Therefore, this mixed pair is not considered as truly mixed. We use TH = 5% for the threshold of being truly mixed meaning that, statistically, we have a nonmixing confidence of 95%. The IoM measures the amount of mixing sequences and it is nonlinearly categorised into five levels: low (L) 5%–10%, medium low (ML) 10%–20%, medium (M) 20%–40%, medium high (MH) 40%–60%, and high (H) 60%–100%. For example, the IoM is ML for the truly mixed pair A and B in Figure 1.

To evaluate clustering results of species, we use LoM to describe the taxonomic level of the mixed species. For example, as in Figure 1, Bacillus subtilis is classified in Kingdom Bacteria and Phylum Firmicutes. Acinetobacter is classified in Kingdom Bacteria but Phylum Proteobacteria. Then the two species are mixed at the Phylum level. Because of the evolution of organisms, nucleotide composition of genomes belonging to the same lower taxonomic levels can be very similar. Clustering organisms at higher level of taxonomy should be easier than at lower level of taxonomy. Therefore, if truly mixed pair occurs, lower LoM (e.g., Species) is more acceptable and more desirable than higher LoM (e.g., Kingdom).

In summary, the proposed two measures are defined as

The two proposed measures, IoM and LoM, are only defined for truly mixed pairs to evaluate the clustering quality in the mixing regions of a map by the following steps.

(i)Find truly mixed pairs for all pairs of species where if and , then X and Y is a truly mixed pair.(ii)If X and Y are truly mixed, determine IoM according to min.(iii)Identify LoM of X and Y.

Clustering results can now be assessed based on three criteria: number of truly mixed pairs, IoM, and LoM. However, which criterion should have higher priority may vary between applications. Therefore, in our assessment, one result is better than another only when it is superior on at least two of the three measures.

2.3. Dataset Preparation and Data Preprocessing

The NCBI database (http://www.ncbi.nlm.nih.gov) contains 370 completed microbial genomes in early 2006, which includes 28 Archaea and 342 Bacteria. As the investigation seeks to cluster sequences of species, genomes are filtered so that there is no duplicating species. One strain was arbitrarily chosen if the same species genome contains more than one strain. After this process, there were 283 genomes remaining. Considering the available computing resources and the algorithm comparison focus of this paper, two artificial sets of prokaryotic DNA sequences (each of 10 different species out of the 283 species) were randomly sampled from the NCBI database.

In addition, three simulated metagenomic datasets were created by Mavromatis et al. [31] to facilitate benchmarking of metagenomic data processing methods, which include, but not limited to, binning methods. The three datasets vary in relative abundance and number of species that represent different complexity levels of real-world microbial communities. The sequence fragments in the simulated datasets were assembled using three commonly used sequence assembling programs: JAZZ, Arachne, and Phrap at U.S. Department of Energy (Wash, USA), DOE Joint Genome Institute (Calif, USA). In this paper, we tested one of the three simulated datasets named simMC and was assembled by Phrap (http://www.phrap.com). For simplicity, this dataset will be represented as simMC_Phrap throughout the paper.

The taxonomic distributions of the three sets of species are displayed graphically in Figure 3. Each letter represents a single species and species within a single rectangle have the same taxonomy at the specific level. The names of the species can be found in Section 1 of the supplementary material which consists of 6 sections showing the species names in Section 1, clustering evaluation methods in Sections 2 and 3, and the labelled cluster maps in Section 4 to 6 (http://www.mame.mu.oz.au/˜ckkc/Binning). The numbers below the taxonomic levels in Figure 3 indicate the maximum possible number of mixed pairs at that taxonomic level. For example, in Figure 3(a), the maximum number of mixed pairs at taxonomic level of Class is 12, which consists a mixed pair each from (a,j) and (c,e) and 5 pairs each from (c,b,d,g,h,i) and (e,b,d,g,h,i).

In the experiments of Abe et al. [15] that attempts to separate 1?kb and 10?kb DNA sequence fragments of 65 bacteria genomes containing 54 different species, it is visually shown that 1?kb DNA sequence fragments do not carry enough discriminatory information and hence could not completely separate the fragments into species-specific groups. Therefore, a sequence fragment length of 10?kb is used for the analysis in the two artificial datasets to ensure appropriate separation of species-specific groups. Sequences used in the two artificial datasets are produced from complete genome sequences to simulate the environment of WGS sequencing. Such a complete genome sequence is segmented into 10?kb nonoverlapping fragments. A sliding window with the size n is used for counting the oligonucleotide frequency for each of the fragments in which n is the nucleotide length. For example, the dinucleotide frequency (n = 2) for a short sequence “AATACTTT” is shown graphically in Figure 2. The oligonucleotide frequency count for each of the fragments yields a single input vector for clustering. The input vectors to the clustering algorithm will have 4n dimensions.

Whereas, the simMC_Phrap was preprocessed by extracting all sequences with contig length = 8?kb. The oligonucleotide frequency count is applied to these sequences to generate the input vectors for clustering. Finally, each input vector is normalised by the sequence length.

2.4. Algorithms Parameters and Experiment Details

In order to avoid the algorithm implementation bias, an in-house clustering program was developed consisting a data preprocessor, both SOM and GSOM algorithms, and a graphical neuron-map output. The program is written in C# with a Windows form interface. This program can be requested from the author for academic use. Since the hexagonal topology has better topology preservation [26, 29], it is used in both SOM and GSOM. In addition, the tests were conducted on a Pentium 4 3.2?GHz desktop PC running Windows XP and the same parameter settings are used in both algorithms for a fair computational speed comparison (as listed in Table 1).

To compare results from the 4 orders of oligonucleotide frequencies, we obtain similar map resolution (number of nodes) for both algorithms and for all nucleotide frequencies. Since the GSOM algorithm automatically determines the number of nodes, it can be used to determine the total number of nodes of SOM. This can be achieved by training GSOM with a specified resolution (we used SF = 0.4) for the dinucleotide frequency then the final number of nodes in GSOM is used to set the number of nodes in SOM, as well as determine the SF for other nucleotide frequencies. Using this scheme, we set SF = 0.4 for dinucleotide frequency (for a higher-resolution map) and experimentally determined that SF = 0.6 for trinucleotide frequency, SF = 0.8 for tetranucleotide frequency and SF = 0.9 for pentanucleotide frequency will result in similar map resolution. The SOM also requires setting the aspect ratio of the map and initializing the weight vectors. These two parameters are set by using PCA. The schematic diagram of this approach is shown in Figure 4.

3. Results

The proposed binning method is tested on two artificial datasets and a simulated metagenomic dataset (simMC_Phrap) which was created and published for benchmarking the metagenomic data processing methods. The two artificial datasets of prokaryotic DNA sequences (each of 10 different species out of the 283 species) were randomly sampled from the NCBI database. Each set of the genome sequences was preprocessed to obtain the 4 orders of oligonucleotide frequencies (di-, tri-, tetra-, and pentanucleotide frequencies) to form 4 datasets. This data preprocessing involves segmenting each genome sequence in the set into 10?kb lengths then produce input vectors by calculating the specific oligonucleotide frequency. After preprocessing, each of the 4 datasets from species Set 1 contains 4,145 input vectors. Whereas, each of the 4 datasets from Set 2 contains 2,398 input vectors. The simMC_Phrap was preprocessed by extracting all sequences with contig length = 8?kb then obtaining the 4 orders of oligonucleotide frequencies to form 4 datasets. The produced input vectors were normalised by the sequence length. After preprocessing, each of the 4 datasets contains 401 input vectors. The details of preparing these three datasets can be found in Section 2.

To evaluate the clustering performance, we used well-known clustering evaluation measure F-measure. The results for the three datasets are shown in Table 2. A summary of F-measure calculation can be found in Section 2 of the supplementary material.

From these results, we can observe that F-measure does not distinguish the clustering quality clearly enough for this species separation application. For example, in the results of Set 1 using GSOM, F-measure equals 0.97 for the tri-, tetra-, and pentanucleotide frequencies. However, by using the proposed evaluation, more details of the ambiguities can be seen. It shows that there are only two mixed pairs with low taxonomic level in the mixing region when using the pentanucleotide frequency. However, there are four mixed pairs with two of them having higher taxonomic levels when using the tri- and tetranucleotide frequencies (as shown in Table 3). This suggests that the pentanucleotide frequency provides a higher level of phylogenetic resolution, which cannot be detected via F-measure because the numbers of incorrectly assigned sequence fragments are similar for these three nucleotide frequencies.

We use two approaches to evaluate the performance of clustering DNA sequence fragments of species. The first approach is to observe the cluster formation of species sequences to verify the cluster formation similar to the method used by Abe et al. [15]. The second approach is to compare the LoM and IoM in the mixing region. A simple example is given in Section 3 of the supplementary material. It highlights the difference between the calculation of F-measure and IoM.

After the training is completed, for display purpose, we use the label information of the input data to display the labelled cluster map. The labelled cluster maps from the training of SOM and GSOM for the pentanucleotide frequency of species Set 1 are shown in Figure 5. The following points can be observed from this labelled cluster maps.

(i)All species are clearly clustered and are marked in the figures.(ii)The nodes that contain more than 2 species (which are coloured in grey) are mostly located at the border of the clusters.(iii)Species “d” is clustered as one group in GSOM, but separated by species “c” into two groups in SOM.These observations show that the GSOM have better cluster formation in terms of cluster identification than the SOM due to the flexibility of feature map shape. The labelled cluster maps for other datasets also show a clear cluster formation and can be found in Sections 4, 5, and 6 of the supplementary material.

We also interpret the clustering results of the species mixing regions by summarising the IoM and LoM in Tables 3, 4, and 5 for Set 1, Set 2, and simMC_Phrap, respectively.

By comparing the results of the four orders of oligonucleotide frequencies for Set 1, the number of truly mixed pairs for dataset that uses the dinucleotide frequency is almost twice the number of truly mixed pairs of the higher-order oligonucleotide frequencies as shown in Table 3. In addition, the LoM is also high for the dinucleotide frequency. Similarly in Table 4, there are three to four truly mixed pairs when using the dinucleotide frequency, but no more than one truly mixed pair when higher-order oligonucleotide frequencies are used. All four truly mixed pairs are of a very high LoM at the Phylum level. Since there are only very few species and number of sequence fragments in simMC_Phrap, the difference of using different nucleotide frequencies is not obvious. Nevertheless, the results of our two artificial sets indicate that dinucleotide frequency is not a strong signature for clustering the 10?kb fragments of species.

Furthermore, from Set 1, we can see that IoM and LoM tend to decrease as the order of oligonucleotide frequency increases. One would naturally suspect that higher-order oligonucleotide frequencies may carry more information so they can be used to achieve better clustering results. However, it is not the case for Set 2 and simMC_Phrap. In Set 2, there is a truly mixed pair in the tetranucleotide frequency but no mixed pair in the tri- and pentanucleotide frequencies when using SOM. For the GSOM algorithm, a truly mixed pair appears in the pentanucleotide frequency but not in the tri- and tetranucleotide frequencies. A detailed examination shows that they are the same pair, Acinetobacter sp.ADP1 and Bacillus subtilis subsp. subtilis str. 168, in both cases. Both of the IoM is low indicating a much better result than the dinucleotide frequency. As in Table 5 for simMC_Phrap, GSOM performs well for all four orders of nucleotide frequencies, whereas SOM shows inconsistent clustering quality for different nucleotide frequencies. There are two mixed pairs in the di- and pentanucleotide frequencies but only one mixed pair in the other two nucleotide frequencies. The mixed pair in the tetranucleotide frequency has the lowest IoM (IoM = M). These results also do not support the hypothesis that higher-order oligonucleotide frequencies are better clustering features. Therefore, we can only conclude that higher-order oligonucleotide frequencies are better features for clustering the species than using dinucleotide frequency. However, the optimal oligonucleotide frequency may vary in different species.

In terms of clustering quality, both SOM and GSOM have similar results. However, besides the mixing quality comparison, we also compare the training speed of them. The speed comparisons for the first two training phases and the overall training time of both algorithms for all 12 datasets are shown in Tables 6 and 7, respectively.

Comparing the time taken for SOM and GSOM to finish the first two training phases (as in Table 6), GSOM has more than 37% speed improvement than SOM. This speed improvement can be explained by considering the initial formation of the map structure. As discussed in Section 2, PCA is used to initialise SOM. However, it increases the computational cost exponentially when the data dimension and size increases. Therefore, even though time consumed for PCA initialisation in this experiment is negligible, it is expected to be significant in large-scale metagenomic analysis. In addition, SOM starts with all nodes fully covering the whole input space then adjusts the weights of all nodes to represent the input data better. On the other hand, GSOM starts with the minimum number of nodes and grows more nodes in the required direction to get an abstract representation of the input data while still correcting the weights of existing nodes. At the end of the second phase, both algorithms will be roughly representing the abstraction of the data. However, GSOM saves time by avoiding PCA calculation and operates on fewer numbers of nodes in the first two phases. Although the last training phase is basically identical for both algorithms and the learning length of the last training phase is much longer than the first two phases, GSOM still has 7%–15% of speed improvement for the overall training. It was reported that the use of this strategy involving SOM was used for analysing a large amount of eukaryote genomes and one of the highest performance supercomputers in the world was required [32]. For the large-scale analysis, which can take weeks to complete, 7%–15% of speed improvement means 1-2 days of time saving for a two-week computation.

Besides using Tables 6 and 7 to compare the training speed of SOM and GSOM, the tables also show that the training time grows exponentially from one order of oligonucleotide frequency to the higher-order oligonucleotide frequency. It is because of the rapid increment of dimensions when the order of oligonucleotide frequency increases.

4. Discussion, Conclusion, and Future Work

We have investigated four orders of oligonucleotide frequencies: di-, tri-, tetra-, and pentanucleotide frequencies on two artificial sets of species and a published simulated metagenomic dataset. Each of the two artificial sets contains 10 randomly selected species from NCBI database. We noticed that the F-measure can not distinguish the clustering quality for this application. Therefore, two methods have been defined for evaluating the performance of clustering DNA sequence fragments of species. This is done by observing the cluster formation with the labelled cluster map and then qualitatively and quantitatively comparing the LoM and IoM in the mixing region.

The results have shown that dinucleotide frequency is not a sufficiently strong signature for the tested 10?kb DNA sequences on the SOM-based algorithm. Similar to other reports, we also found that higher-order oligonucleotide frequencies, such as tri-, tetra-, and pentanucleotide frequencies, are carrying reasonably adequate genomic information to group intraspecies sequences and separate interspecies sequences [12, 19]; but the required computational power increases exponentially for each increased order of oligonucleotide frequency. Additionally, we noticed that increase of the order of oligonucleotide frequency may deteriorate the assignment of DNA sequence fragments to classes in some cases, which indicates the possible existence of optimal species-specific oligonucleotide frequency. For example, the trinucleotide frequency has a better discrimination power for Acinetobacter sp. ADP1 and Bacillus subtilis subsp. subtilis str. 168 then the tetra- and pentanucleotide frequencies. Therefore, analysts are recommended to start with trinucleotide frequency in large-scale projects and higher-order oligonucleotide frequencies may not always be better.

We also compare the SOM and GSOM algorithms for clustering the DNA sequence fragments of species. Both SOM and GSOM have shown similar results. However, in term of speed comparison, GSOM has more than 37% speed improvement over SOM in the first two training phases and 7%–15% speed improvement in the overall training. Therefore, GSOM is potentially a better alternative clustering tool. As a result of this study, we would suggest to use GSOM and a higher-order oligonucleotide frequency (at least trinucleotide frequency) to improve the strategy proposed by Abe et al. [15] for the binning process after WGS sequencing.

The method of combining oligonucleotide frequency and the SOM-based algorithm has provided a promising way of binning after WGS sequencing. However, there are limitations with this method. Since SOM-based algorithms are essentially data visualisation techniques, it is difficult to identify the exact cluster boundaries when clusters severely overlap with each other. The overlapping cluster can often be misinterpreted as a single cluster when no label is available. Therefore, a further development to overcome this cluster-overlapping problem is necessary for such SOM-based binning method to be fully practical. Additionally, at the current state, due to the high diversity of microbial communities and the nature of WGS sequencing, most of the unassembled sequences are less than 10?kb. In order to maximize the use of this binning strategy, more investigation on the optimal sequence length will need to be performed in the future work. On the other hand, the rapidly advancing sequencing technology and techniques that are capable of faster sequencing, higher coverage, and longer contig length are continuously being developed [33, 34]. The length of the unassembled fragment is expected to increase in the near future. Therefore, this binning strategy is useful for the analysis after WGS sequencing. Alternatively, when one is attempting to identify a specific species in the metagenome, which has already been sequenced, supervised learning methods can be applied. While PhyloPythia employs SVM as its supervised learning classifier, one can opt for other well-known supervised learning methods that has been used in other various applications [35, 36].