1 Introduction

Gelatine is a mixture of insoluble collagenous polypeptides which is derived from acidic and alkaline hydrolyses of animal bones, cartilage, hides, tendons, skins, and sinews [1]. The gelatine production worth of USD1.34 billion consists of 80% porcine skin, 15% bovine hide and 5% porcine bone, bovine bone, and fish skin. The manufacturing of gelatine is aiming for various applications entailing dairy products, sausages, candies, gummies and marshmallows, capsules [2], excipients and beauty products. Gelatine contains 92% polypeptides, including 18 AAs and 2% salt and 6% moisture [3]. Although the human body can synthesize protein from naturally developed AAs, some AAs could not be produced by the human body; thus, it has to digest the AAs from food [4]. Such AAs known as essential AAs are His, Ile, Leu, Lys, Met, Phe, Thr, tryptophan, and Val [5]. On the other hand, the human body naturally synthesizes nonessential AAs that includes Ala, Arg, asparagine, Asp, cysteine, Glu, glutamine, Gly, Pro, Ser, and Tyr. Also, there is conditional AAs which are nonessential but important for the human body during stress and illness entailing Arg, cysteine, glutamine, Tyr, Gly, ornithine, Pro and Ser. Due to this reason, gelatine receives interest from food manufacturers to cater to human needs. However, some manufacturers make a false claim of the gelatine source for their business purposes which then has raised concern from certain groups of consumers such as vegetarians [6], Jews [7] and Muslims consumers [8]. This false claim has tainted the food integrity when the food does not correctly represent its claim [9] and may expose the consumer to diseases such as bovine spongiform encephalopathy or mad cow disease from the consumption of bovine-sourced food [10].

Various testing methods have been utilized to address the issue of a false claim and food integrity—most gelatine testing use polymerase-chain-reaction (PCR) method to identify the source of the gelatine. Beside of the PCR, the immunological method using polyclonal anti-peptide antibodies in indirect and competitive indirect enzyme-linked immunosorbent assay (ELISA) has also been reported as another successful method; however, these methods are sophisticated, very costly and prone to contamination [2]. Liquid chromatography methods, e.g. liquid chromatography time-of-flight mass spectrometer (LC-QTOF/MS) and liquid chromatography-mass spectrometer (LC/MS) have rendered capability to differentiate the animal sources. Still, most of their applications went to meat speciation and highly expensive [11] for the maintenance in the testing laboratories, especially for the new ones. However, a much affordable liquid chromatography method such as high-performance liquid chromatography has proven to a successful application to differentiate gelatine sources [10]. To improve the sensitivity of the AA analysis in gelatine, this study employed Ultra-High-Performance Liquid Chromatography Diode-Array Detector (UHPLC-DAD), which has not been reported previously.

Although UHPLC-DAD may render lower sensitivity of AA detection, it could not differentiate the gelatine source via comparison of the individual AA from the gelatine against the AA standard since all gelatines of animal source possess similar distribution of AAs. This claim was evident from a comparison study between AAs from porcine and bovine skins without indication of the significant difference between the two gelatines [12]. Previous research also did not identify which AAs were selected as the biomarkers to differentiate the gelatine source. Thus, our study performed the analysis of variance (ANOVA) to identify which AAs were significantly different in fish, bovine and porcine gelatines. Nevertheless, the ANOVA was not enough to discriminate the gelatine source unless the dataset was subjected to multivariate data analysis such as principal component analysis [14].

Principal component analysis (PCA) is an unsupervised technique to explore the dataset of multivariate data which requires prerequisite analyses prior to the analysis including (1) removal of outliers [15], (2) ensuring the dataset adequacy [16], and (3) data transformation for normal distribution [17]. However, most dataset exploratory via PCA for food analysis did not undergo these prerequisite steps, including AA analysis prior to PCA [18]. Other studies performed dataset transformation based on previous reports without exhaustively investigate the suitable dataset transformation method for a specific matrix. For instance, although dataset transformation of standardize (n − 1) has been tested on gelatine matrix [10], the same gelatine dataset should also be tested on other transformation methods such as the log transformation before concluding the most suitable dataset transformation for the gelatine matrix. Without the fulfilment of these prerequisite analyses, the PCA may lead to erroneous result and interpretation. Previous researches have neglected the analysis of correlation test between AAs to support the PCA result and explained the effect of factor loading (FL) of each AA. The FL can be divided into strong, moderate and weak FL where the FL play an essential role in determining the apportionment of AAs in fish, bovine and porcine gelatines. Hence, this study provided step-by-step PCA procedure to explore the gelatine dataset and identify the apportionment of the AAs in the fish, bovine and gelatine sources. This study also anticipated the certification or regulatory bodies at the governmental level to adopt this step-by-step PCA procedure in their development of guideline or standard for testing laboratory.

2 Methodology

2.1 Preparation of calibration standard solution of amino acids

A mixture of 17 standard stock solutions (SSS) of AA hydrolysate was purchased from Waters, USA containing 2.5 µmol/mL of l-Histidine (His), l-Serine (Ser), l-Arginine (Arg), Glycine (Gly), l-Aspartic acid (Asp), l-Glutamic acid (Glu), l-Threonine (Thr), l-Alanine (Ala), l-Proline (Pro), l-Lysine (Lys), l-Tyrosine (Tyr), l-Methionine (Met), l-Valine (Val), l-Isoleucine (Ile), l-Leucine (Leu) and l-Phenylalanine (Phe), and 1.25 µmol/mL L-Cystine (Cys). An SSS of L-Hydroxyproline (Hyp) and internal standard solution (ISS) of L-Aminobutyric acid (AABA) were prepared at 2500 µmol/mL each. A series of calibration standard solution (CSS) consisted of 37.5 pmol/µL, 100 pmol/µL, 250 pmol/µL, 500 pmol/µL and 1000 pmol/µL were prepared from the SSS while 100 pmol/µL of AABA was prepared from the ISS for each CSS.

2.2 Sample preparation

Before the sample preparation, gelatines from cold-water fish skin (G7041), bovine skin (G9382) and porcine skin (G6144) were freeze-dried to ensure their moisture content was less than 10%. An amount of 0.1–0.2 g of 54 G7041 gelatines, 50 G9382 gelatines and 54 G6144 gelatines were acid hydrolysed with 5 mL of 6 N hydrochloric acid and incubated at 110 °C for 24 h. The hydrolysate solution was mixed with 100 pmol/µL AABA, diluted to 100 mL with distilled water, filtered by 0.45 µm cellulose acetate membrane to produce mixture A.

2.3 Sample derivatization

A volume of 10 µL mixture A was derivatized with 70 µL AccQ.Fluor borate buffer (Waters, Massachusetts, USA) to generate pH 8.2–10.0 condition, derivatized with 20 µL of prepared AccQ.Fluor reagent (Waters, Massachusetts, USA) to produce mixture B. The prepared AccQ.Fluor reagent consists of a mixture of 6-aminoquinolyl-N-hydroxysuccinimidyl carbamate (AQC) derivatizing reagent and AccQ.Fluor reagent diluent. The mixture B was heated at 55 °C for 10 min to ensure complete derivatization of the AQC with primary and secondary AAs prior injection to Ultra-High-Performance Liquid Chromatography Diode-Array Detector (UHPLC-DAD).

2.4 Amino acids analysis of gelatine samples

Prior to injection, a binary solvent system was used consisting of (A) aqueous solution of AccQ.Tag™ Eluent A, concentrate (WAT052890) (1:10) and (B) acetonitrile in deionised water (60:40) that were filtered using 0.45 µm membrane filters. A volume of 1 µL mixture B was injected into UHPLC-DAD (Agilent, USA) and eluted by the mobile phases according to gradient elution set-up: 0–0.5 min: 99.9% A, 0.1% B, 0.5 min: 99.9% A, 0.1% B, 5.7 min: 90.9% A, 9.1% B, 6.4 min: 87.4% A, 12.6% B, 6.6 min: 87% A, 13% B, 7.7 min: 78.8% A 21.2% B, 8.0 min: 40.4% A, 59.6% B, 8.6 min: 40.4% A, 59.6% B, 8.73 min: 99.9% A, 0.1% B and 9.5 min: 99.9% A, 0.1% B. The AAs in the mixture B were separated by Waters AccQ.Tag column (3.9 mm × 150 mm) at 1 mL/min and 36 °C and detected at 260 nm. The AA peaks are recorded in a chromatogram.

2.5 Method linearity and accuracy

The method linearity was established by injecting the 37.5 pmol/µL, 100 pmol/µL, 250 pmol/µL, 500 pmol/µL and 1000 pmol/µL CSS into the HPLC-DAD. Each concentration was measured in triplicate, and the ratio of the peak area of CSS over the peak area of AABA was computed and plotted against the ratio of CSS concentration over the AABA concentration. This plot constructed 17 linear regression lines with their respective slope, intercept and determination coefficients (R2). The linear regression line with R2 nearest to 1.00 indicates method linearity [19].

The method accuracy was carried out under repeatable analysis on fish, bovine and porcine gelatine which were spiked with 50 pmol/µL, 250 pmol/µL and 1000 pmol/µL amino acid standards that covered the working range. Each spiked concentration was analyzed separately in ten replication, and the percentage recovery of the spiked concentration was determined via this formulae:

$${\text{Recovery}}\;(\% ),\;R = \frac{{\bar{x} - x_{spike} }}{{x_{spike} }} \times 100$$

where \(\bar{x}\) is the mean concentration of amino acid in the gelatines and \(x_{spike}\) is the spiking concentration of amino acids standards.

2.6 Dataset pre-processing

Peak area of AAs from each gelatine was imported to the dataset table in XLSTAT 2016 software. An about 54 fish, 50 bovine and 54 porcine gelatines with 17 AAs identified in the dataset was pre-processed to facilitate the process of differentiation among samples. This pre-processing step is applied to reduce the variation of the amino acids in the dataset. Dataset pre-processing tests consisted of the analysis of variance (ANOVA), box and whisker plot, dataset transformation, Kaiser–Meyer–Olkin (KMO) test, correlation test and principal component analysis (PCA) were conducted using XLSTAT-Pro (2017) statistical software (Addinsoft, Paris, France).

2.6.1 Outlier removal by box and whisker plot

Prior to ANOVA and principal component analysis (PCA), the individual dataset of fish, bovine and porcine gelatines was subjected to outlier removal by using box and whisker plot (BWP) method from a standardized dataset. The confidence interval of the BWP was set at 95%. The skewness of the BWP was examined to confirm the need for dataset transformation. The dataset, which showed different pattern within the individual AA, was discriminated and shown in the BWP as an outlier. Outlier value which exceeded three times of the box’s height, was signed with a dot, star or asterisk [20] and subjected to removal. After removing the outliers, the new dataset consisting 41, 40 and 45 fish, bovine and porcine gelatines, respectively was subjected to KMO test.

2.6.2 Analysis of variance

Results were expressed as mean ± standard deviation of triplicate for distribution of AAs in the gelatines. The analysis of variance (ANOVA) test of Tukey’s test was applied to determine the significant difference between the means of AAs at a 95% confidence level (p < 0.05).

2.6.3 Kaiser–Meyer–Olkin test

The dataset was analysed for dataset adequacy by KMO test. Adequate dataset determines the ability of a generated model to extract latent variables from the dataset. In this study, the KMO test was employed at a significant level (α) of 0.05. The calculated KMO was ranked as: KMO < 0.5 = inadequate, 0.5 < KMO < 0.7 = mediocre, 0.7 < KMO < 0.8 = good, 0.8 < KMO < 0.9 = very good and KMO > 0.9 excellent to indicate the dataset adequacy; and only KMO > 0.5 was acceptable for PCA [16, 21].

2.6.4 Dataset transformation

To ensure dataset follow normal distribution before the PCA, the dataset normality was tested using Shapiro–Wilk, Anderson–Darling and Lilliefors at α of 0.05. The dataset was transformed using standardize n-1, standardize (n), centre, standard deviation−1 (n-1), standard deviation−1 (n), rescale from 0 to 1, rescale from 0 to 100, Pareto and log transformation methods. For log transformation, the AA with 0 pmol/L was removed before the transformation. The transformation of each AA was employed to ensure the transformed dataset remained closer to the original AA dataset [22]. The normal distribution of the transformed dataset was evaluated by normality test of Shapiro–Wilk, Anderson–Darling and Lilliefors at α of 0.05. The best transformation method and normality test were selected from the result.

2.6.5 Correlation test

Study of correlation between was carried out using Pearson correlation to measure the strength (weak, moderate or strong) and direction (positive and negative) of the linear relationship between two AAs. The formula of the Pearson correlation matrix (R) of AA a and b as follows:

$$R_{ab} = n\sum a_{i} b_{i} - \sum a_{i} \sum b_{i} )/\left( {\sqrt {n\sum a_{i}^{2} - (\sum a_{i} )^{2} } \times \sqrt {n\sum b_{i}^{2} - (\sum b_{i} )^{2} } } \right)$$

where n was the number of gelatines, ai was the value of AA a for ith gelatine and bi was value of AA b for ith gelatine. In this study, the strong, weak and moderate correlation were determined correlation matrix (CM) value; |0.000| < R < |0.300| for weak, |0.300| < R < |0.700| for moderate and |0.700| < R < |1.000| for strong CMs.

2.7 Dataset exploratory by principal component analysis

The PCA of Pearson correlation was applied to recognize the dataset pattern, explore the contribution of each AA to the gelatines, find and explain the variance of intercorrelated AAs and transform the dataset into smaller sets of new independent variables which are called as principal components (PCs). The principle of PCA was to reduce significantly (p < 0.05) the dataset dimensionality to achieve these aims. The PCs can be expressed as:

$$C_{xy} = a_{x1} b_{1y} + a_{x2} b_{2y} + a_{x3} b_{3y} + \cdots + a_{xn} b_{ny}$$

where C is the component score, a is the factor loading (FL), b is the AA concentration, x is the PC number, y is the sample number and n is the total number of the AA.

In this study, the number of PCs selected was based on cumulative variability of ≥ 75% and eigenvalue criteria ≥ 1. The AA with strong FL (FL ≥ |0.750|) was considered as an AA with strong factor loading while AAs with VF coefficient of |0.500| < FL < |0.749| and FL ≤ |0.499| were considered as having moderate and weak factor loadings, respectively [23]. To determine the contribution of AAs to the gelatine sources, the correlation, symmetric and distance biplots were plotted, and the biplot that clearly showed the gelatine groupings was selected and the AAs which contributed to each grouping were examined and further explained.

2.8 Apportionment of amino acids in gelatine groupings by principal component analysis

The AAs with moderate and weak FL and AAs which did not contribute to the gelatine grouping were removed from the gelatine dataset (Ile, Glu, Ala, Lys and Asp). By using 12 AAs with strong FL, a new PCA was executed to produce a new AA plot and correlation biplot. The FL and correlations of the AAs were assessed, the apportionment of AAs to the gelatine groupings was examined, and the improvement of the gelatines that made the gelatine groupings was observed and compared to the previous PCA of 17 AAs.

3 Result and discussion

3.1 Method linearity and accuracy

Table 1 exhibits the linearity of the CSS by providing the linear regression line for each amino acid. The relationships between amino acid concentration and peak area were linear over the 37.5–1000 pmol/µL range with R2 between 0.96 and 1.00. Since R2 > 0.95 denotes a strong linear relationship between the amino acid concentration and the peak area [19], thus the established R2 in this study indicated method linearity.

Table 1 Linearity and accuracy of method of analysis for amino acid

Table 1 explains the method accuracy, where the overall amino acids had method recovery between 85 and 111%. This is evident when 50 pmol/µL, 250 pmol/µL and 1000 pmol/µL amino acid standards rendered 85.3–110.7%, 92.4–111.4%, 87.6–110.8% recoveries, respectively. Among the standards, the 50 pmol/µL Leu had the lowest recovery of 85.3% while 1000 pmol/µL Gly had the highest recovery. These results indicated that the method bias was minute, which was in accordance with the recovery result produced by Azilawati [19] and fulfilled the requirement of the Eurachem guidelines [24].

3.2 Amino acids concentration in fish, bovine and porcine gelatines

The concentrations of individual AA in fish, bovine and porcine gelatines are presented in Table 2, and the chromatogram of the fish, bovine and porcine gelatines are depicted in Fig. 1. For fish gelatine, the lowest and highest concentration of AAs were Tyr at the mean value of 113 pmol/L and Gly at the mean value of 12,268 pmol/L, respectively. The ranking of AA concentration in fish gelatine as follows: Gly > Ala > Pro > Arg > Glu > Hyp > Ser > Lys > Asp > Thr > Val > Leu > Met > Phe > Ile > His > Tyr. This ranking supported the finding of AAs in Chitala ornata gelatines with Gly > Ala > Pro ranking [25] although other researches findings have slight differences; Gly, Pro and Ala were the highest identified AAs with Gly > Pro > Ala ranking in Oreochromis nilotica, Clarias batrachus and Pangasius sutchi [26]. To date, many research of fish-sourced gelatine found the highest concentration of Gly in Epinephelus sexfasciatus, Lutjianus argentimaculatus, Rastrelliger kanagurta, and Pristipomodes typus [27], Oreochromis mossambicus [28], Probarbus Jullieni [29] as well as fish skin in our study. Jamilah et al. [26] also identified Leu > Lys > Tyr as the AAs in Oreochromis nilotica with the lowest concentrations which were absent of His and Phe as obtained in our finding. These different findings were due to our fish gelatine was of cold-water fish skin while Jamilah et al. [26] analysed freshwater fish skin.

Table 2 The concentration of amino acids in fish, bovine and porcine gelatines
Fig. 1
figure 1

Chromatogram of amino acids of a fish, b bovine and c porcine gelatines

The ranking of AA concentration in bovine gelatine as follows: Gly > Pro > Hyp > Ala > Glu > Arg > Lys > Asp > Ser > Leu > Val > Thr > Phe > Ile > Met > Tyr > His (Table 2). The presence of Gly, Pro and Hyp was anticipated since these AAs were the building block of collagen; thus, bovine gelatine was preferable in producing collagen products [30]. However, Gly and Pro were of the highest concentration of AAs in the same type of bovine skin [12] that was following our finding while Hyp was not analysed in Hafidz’s study.

The ranking of AA concentration in porcine gelatine as follows: Gly > Pro > Hyp > Ala > Glu > Arg > Lys > Asp > Ser > Leu > Val > Thr > Phe > Ile > Met > Tyr > His. Azilawati, et al. [10] identified Gly, Pro and Hyp and Tyr, Met and Ile as the highest and lowest concentrations of AAs in porcine, respectively. This finding of the Tyr, Met and Ile differed to our finding might be due to lower UHPLC-DAD employed in our study.

By comparing the acid amino rankings, bovine and porcine gelatines showed a similar ranking while the Ala, Arg and Glu concentrations had differed the fish gelatine form the bovine and porcine gelatines. These findings indicated that (1) fish gelatine could be clearly discriminated from bovine and porcine gelatines and (2) bovine and porcine gelatines were hardly differentiated via UHPLC-DAD alone. Some studies suggested an application of multivariate dataset analysis to discriminate observations with similar distributions [13, 14]. Nevertheless, the discrimination of these gelatines could be achieved by assessing the significantly different (p < 0.05) AAs among the gelatines in Table 2. The table exhibited the highest concentration of Ser, Arg, Gly, Thr and Met in fish gelatine and Hyp, Pro, Tyr, Val, Leu and Phe in porcine gelatine. These AAs may characterize the gelatine sources based on the concentration of Hyp, Ser, Thr, Pro, Met as discovered by Azilawati et al. [10] and Gly, Pro and Hyp by Gómez-Guillén et al. [31]. Azilawati et al. [10] also recommended applying multivariate dataset analysis, such as principal component analysis (PCA) [18] to assist the exploratory of gelatine dataset. Although only limited studies have employed PCA in gelatine study, the exploration of gelatine dataset may contribute to upholding the food integrity since gelatine which was not represent the actual source as it was claimed may deteriorate the confidence on a gelatine product especially from the vegetarians [6], Jews [7] and Muslims consumers [8]. False claim of the gelatine source may expose consumers to potential infection of bovine spongiform encephalopathy or ‘mad cow disease’ [10].

3.3 Outlier removal

Figure 2 shows the BWP of fish, bovine and porcine gelatines. The BWP of fish gelatine in Fig. 2a demonstrated outliers in the Hyp (1987 pmol/L), Ser (1807 pmol/L), Arg (2130 pmol/L), Gly (13,215 pmol/L), Thr (13,215 pmol/L) and Met (549 pmol/L) where observations appeared beyond the box’s whisker [32]. Of these outliers, all outliers are of AAs with low concentrations than the mean except the Gly. The dataset of fish gelatine in Fig. 2a exhibited five AAs (Hyp, Ser, Pro, Tyr and Met) which had the median value that equal to the mean value where the mean value was assigned as a red plus. The His, Arg, Gly, Thr and Phe demonstrated right-skewed distribution because of the median value > mean value. Besides, the Asp, Glu, Ala, Lys, Val, Ile and Leu showed left-skewed distribution since the median value < mean value [15].

Fig. 2
figure 2

Box plots of standardized amino acids of a fish, b bovine and c porcine gelatines

The dataset of bovine gelatine in Fig. 2b exhibited the Ser and Arg had median value = mean value. Other AAs demonstrated right-skewed distribution except for Tyr and Phe. The BWP of bovine gelatine in Fig. 2b also showed outliers in Arg, Gly and Phy where observations appeared beyond the box’s whisker. Of these outliers, all outliers are of AAs with low concentration.

The dataset of porcine gelatine in Fig. 2c exhibited three AAs (Pro, Val and Ile), which had a median value = mean value. The Hyp, Arg, Gly, Tyr, Met, Leu and Phe demonstrated right-skewed distribution because of the median value > mean value. Besides, the Ser, Asp, Glu, Thr, Ala, and Lys confirmed left-skewed distribution. The BWP of porcine gelatine in Fig. 2c demonstrated outliers in Hyp, His, Val and Ile where observations appeared beyond the box’s whisker. Of these outliers, outliers of the Hyp, Val, Ile are of AAs with low concentrations except His. The Hyp, His and Val exhibited more than one outlier.

The removal of outliers had reduced the dataset distortion of the PCA as achieved in AAs study by Azilawati et al. [10]. Post outlier removal via BWP, the new dataset comprised 41, 40 and 45 fish, bovine and porcine gelatines, respectively. The AAs with the skewed distribution of BWP were also subjected to dataset normalization before the PCA.

3.4 Kaiser–Meyer–Olkin test

The KMO test calculated the value of 0.7542 for the new dataset. Researches from different fields have set appropriate KMO values prior to KMO test; KMO > 0.6 for study on sleep quality [33], customer attitude [34] and seafood spoilage [35]; and KMO > 0.7 for water quality assessment [36]. To the authors’ knowledge, very limited researches have tested their dataset adequacy with KMO test prior to dataset transformation and PCA. This absent was evident in gelatine studies by Azilawati et al. [10] and Widyaninggar et al. [18] which may render erroneous findings due to small dataset for PCA [16]. Moreover, the preliminary study of AAs in gelatine by Widyaninggar et al. [18] did not explain total samples in their research. Based on these researches, KMO > 0.6 had been accepted as the appropriate KMO before performing PCA. Since our KMO value was between 0.7 and 0.8, the dataset was deemed as suitable for PCA and subjected to dataset transformation and PCA.

3.5 Dataset transformation

This study investigated issues of (1) which dataset transformation was suitable for AAs analysis in gelatine and (2) which normality test was the best to examine dataset normality.

The most common dataset transformation was the standardize (n-1) and followed by other dataset transformations deemed suitable according to sample type. After testing the dataset with all dataset transformations, the AAs of transformed dataset still demonstrated non-normal distribution except for log transformation.

Prior to dataset transformation, the normality test of Shapiro–Wilk exhibited three AAs involving Asp, Ala and Lys that followed a normal distribution (Table 3). The normality test for Glu demonstrated 0.1642 p value after log transformation, which rendered normal distribution. However, the log transformation also changed Ala and Lys into non-normal distribution via p-value changes from 0.0940 to 0.0145 and from 0.0794 to 0.0045, respectively. Since the premise of normal distribution applied on the individual AA, hence Asp, Ala, Lys and Glu were confirming to the normal distribution after dataset transformation.

Table 3 Normality test of Shapiro–Wilk after dataset transformations

The normality test of Anderson–Darling exhibited that Ala, Lys and Val followed a normal distribution (Table 4) before the dataset transformation. The Asp and Glu also showed the normal distribution after log transformation with 0.1703 and 0.1023 p-values, respectively. Nonetheless, Lys exhibited opposite result after the same transformation as the p-value changed from 0.1274 to 0.0380. In this given view, the Ala, Lys, Val, Asp and Glu followed the normal distribution after the dataset transformation.

Table 4 Normality test of Anderson–Darling after dataset transformations

The normality test of Lilliefors showed the Asp, Ala, Lys, Val and Ile followed the normal distribution prior to the dataset transformation (Table 5). After the log transformation, His and Glu exhibited normal distribution with 0.0723 and 0.0717, respectively, while no other AAs exhibited non-normal distribution after the log transformation. To note, the AAs of porcine gelatine contained 0 pmol/L of His, which the log transformation of 0 value yielded undefined value. This study removed the 0 pmol/L before the log transformation and normality test; hence, we can omit the p-value of His because not all His concentrations were included in this dataset transformation. Subsequently, the Asp, Ala, Lys, Val, Ile and Glu had achieved this normal distribution after the dataset transformation. To note, Glu was the only AA followed the normal distribution after log transformation by the three normality tests.

Table 5 Normality test of Lilliefors after dataset transformation

Other studies have investigated the suitability of different dataset transformations for specific matrix before confirming the most suitable one. For instance, log, Box-Cox Standardize (n), 0 to 100 rescaling and Pareto transformations were confirmed suitable for microbiological, freshwater, sugarcane spirits [37], water quality [38] plant volatiles [39] matrices, respectively. For gelatine matrix, Azilawati et al. [10] has transformed the gelatine dataset via standardize (n-1) but has never tested the same dataset on other transformations, which our study had proven that log transformation was also suitable for gelatine matrix provided the dataset only has an integer value. On the contrary, a study of medical dataset recommended omitting any dataset transformation because the dataset distribution will never achieve normal distribution [40]. Our study suggested testing the gelatine dataset with different transformations before confirming the most suitable transformation for the gelatine matrix. Hence, for further analysis in this study, the 16 AAs were transformed using the standardize (n-1) while Glu was converted using log transformation.

This study also recommended the Lilliefors as the best normality test to investigate dataset normality of AAs in gelatine, which was contrary to the study of Razali et al. [41] that found that the Shapiro–Wilk was the best normality test. However, Razali et al. [41] also found that normality test of Shapiro–Wilk was ineffective in low sample size (n < 10,000) which in line with our result due to low sample size (n = 125). Nonetheless, Bower [42] accepted the non-normal distribution prior to further dataset analysis because (1) the instrumental measurement of food sample was deemed as a continuous dataset that follows a normal distribution and (2) sample size (n > 100) has already followed the central limit theorem of the normal distribution [40, 43, 44].

3.6 Correlation test

The correlation test was sought to select the suitable correlation types to determine correlations between two AAs. The XLSTAT2017 software provided two correlation tests: Pearson or Spearman correlations. This study employed the Pearson correlation to explore the dataset in investigating the strength (weak, moderate, or strong) and direction (positive and negative) of the linear relationship between two AAs. Our study also emphasized the application of the Pearson correlation than the Spearman correlation since the former investigated the linear relationship between two continuous or parametric variables, i.e. AAs. On the contrary, the Spearman correlation is more suitable for non-parametric dataset such as ordinal or ranking dataset [45]. Thus, our study used the Pearson correlation to explore gelatine dataset.

Table 6 shows the correlation matrix (CM) of the AAs in fish, bovine, and porcine gelatines. The CM measures the strength and direction of a linear relationship between the two AAs. The CM indicated that positive and negative correlations among the AAs at |0.700| < R < |1.000| for strong CM, |0.300| < R < |0.700| for moderate CM and 3|0.000| < R < |0.300| for weak CM as in generic guideline [46] although various research fields provided lower ranges for strong CM; |0.500| < R < |1.000| in household food security [47] and in phenolic compounds of mungbean [48]. Study on AAs in Limonia acidissima fruit adopted the generic guideline of |0.700| < R < |1.000| for strong CM [49]; therefore, our study referred this range to study the strong CM of the AAs. Table 6 exhibited the CM of these 17 AAs; Ser had 7 strong correlation matrices (CMs); Hyp and His had 6 strong CMs; Thr, Pro, Met and Leu had 5 strong CMs; Arg and Val had 4 strong CMs; Asp, Ala, Glu and Lys had 3 strong CMs; Tyr and Phe had 2 strong CMs; Gly had 1 CM; and Ile had no strong CM. Although these AAs rendered strong CM, the CM value only explained the correlation between two AAs [46] while the CMs among more than two AAs brought more meaningful information, which the PCA can cater for this purpose [1].

Table 6 Correlation matrix and factor loading of amino acids in bovine, porcine and fish gelatines

3.7 Dataset exploratory by principal component analysis

The aim of using PCA in this study was to explore the gelatine dataset by (1) identifying AAs with different factor loadings, (2) differentiating type of biplots to study the correlation of the AAs and gelatine sources and (3) proposing the significant AAs that might contribute to the gelatine sources.

The exploratory of the dataset via PCA demonstrated PC1, PC2 and PC3 with eigenvalue (EV) > 1 [50] which explained 93.366% cumulative variability (CV) of the dataset (Table 6). The EV and percentage variability (PV) reflect the size and significant PC (p < 0.05) whereby PC1 has the largest EV than the next PC. The EV information supported or our result that the EVs decreased as the PC number increased, i.e. PC1 (EV = 7.656, PV = 45.036) > PC2 (EV = 5.436, PV = 31.977). Although there is no cut-off value of EV or PV, our study adopts the suggested EV > 1 as a principal guideline for food composition study [50] and further determined the PC selection based on the best PC numbers that explained the highest CV [51] via visual dimension. According to this principle, the PC1 and PC2 explained 77.013% CV at EV of 5.436 in this study. Furthermore, although the PCA generated three PCs, only PC1 and PC2 provided AAs with FL ≥ |0.75|: Hyp, His, Ser, Thr, Pro, Met, and Leu in PC1 and Asp, Glu, Ala, and Lys in PC2. The AAs with moderate FL (|0.500| < FL < |0.749|) were Arg, Gly, Val in PC1 and weak FL (FL ≤ |0.499|) were Tyr, Ile and Phe in PC1. Our study found the AAs with strong FL in PC1 had polar and nonpolar side chains. The Hyp, His, Ser and Thr had polar side chain while Pro, Met and Leu had nonpolar side chain. The AAs with strong FL in PC2 had polar side chain. AAs with moderate and weak FL in PC1 had nonpolar side chain [52]. Therefore, the PCA had identified the PC according to the polarity of the side chain of AA.

The strong FL of AAs in PC1 were also the same AAs with 7, 6 and 5 strong CMs as determined in the correlation test (Table 6). In contract, the AAs of strong FL in PC2 were the AAs with 3 strong CMs via correlation matrix study. Surprisingly, AAs with moderate FL were the AAs with 1 and 4 strong CMs and AAs with weak FL were the AAs with 0 and 2 strong CMs. This finding indicated that AAs with 0 and 2 strong CMs were the least contributing to dataset variability since dataset variability assists the determination of gelatine sources.

The positive and negative signs of the FL (Table 6) or AA plot in Fig. 3a explained the direction of the linear relationship between two AAs. The AAs with positively correlated FL indicated that positive change of AA would render positive change towards other AAs, while negatively correlated FL exhibited inverse change among them. The AAs with the same sign were positively correlated and vice versa. Based on this principle, the Hyp, Pro and Leu were positively correlated, and had negative correlations against His, Ser, Thr and Met which were aligned with Azilawati et al. [10] except our study (1) did not identify the positive correlation of Ile and Val and (2) provided with His as additional AA. For PC2, the Asp, Glu, Ala and Lys were positively correlated and was slightly different from Azilawati et al. [10], which the Ala and Lys were absent in the latter’s result.

Fig. 3
figure 3

a Amino acid plot and biplot of b correlation, c symmetric, and d distance of amino acids and gelatines

Figure 3a shows the AAs with weak or without correlations due to the directions of their FL were ~ 90° [53]; Met, Thr, Ser and His against Glu, Ala and Lys; Hyp against Glu, Ala and Lys; and Val, Pro and Leu against Asp, Gly and Arg. These results supported the CM in Table 6 but were not discussed in Azilawati et al. [10], and not harmonized with the result of Widyaninggar et al. [18]. The instrumentation set-up, low sample numbers or deficiency of dataset pre-processing before PCA might have caused these differences.

3.8 Apportionment of amino acids in gelatine groupings by principal component analysis

Biplot in Fig. 3 showed the AAs that significantly contributed to the gelatine sources. This study compared the correlation (CB), symmetric (SB) and distance biplots (DB) to determine which biplot had the best visual display of the AAs and gelatine sources. All biplots showed the same pattern of AA and gelatine distributions, whereby the strong FL and correlations in the biplots supported the FL results in Table 6. Generally, each biplot exhibited three-gelatine groupings, and the biplots were not rendered visual difference. However, the CB in Fig. 3b showed the gelatine proximity in each grouping, while SB showed a little space among the gelatines in each grouping in Fig. 3c. The DB in Fig. 3d depicted the most significant space among the gelatines in each grouping than in the SB and CB, although the three-gelatine groupings in the DB remained separated. Hence, we proposed that the CB was the best visual display of biplot to explore the gelatine dataset. This is because the gelatines within each grouping were closer to each other than the gelatines in SB and DB. In the CB, 100% fish, 65% bovine and 71% porcine gelatines dominated the groupings of 1, 2 and 3, respectively; hence, we may conclude that the fish, bovine and porcine gelatines represented grouping 1, 2 and 3, respectively (Fig. 3b).

The Fig. 3b showed which AA significantly contributed (p < 0.05) to the gelatine grouping by examining which variable fall into the observation grouping [54]: (1) high concentration of Ser, Arg, Gly, Thr, Met, His and Ile and low concentration of Hyp for fish gelatine; (2) high concentration of Hyp and low concentration of Met, Thr, Ser, His, Gly, Arg and Ile for bovine gelatine; and (3) high concentration of Pro, Tyr, Val, Leu, and Phe for porcine gelatine. The Ser, Thr and Met, which have been associated with fish gelatine were in line with Azilawati et al. [10]. Nonetheless, the AAs-correlated bovine and porcine gelatines contradicted Azilawati et al. [10].

In Fig. 3b, the Glu, Ala, Lys, Asp did not contribute to any gelatine grouping while Ile had moderate FL; thus our study had removed these AAs from the dataset and performed another PCA to assess: (1) the FL and AAs correlation, (2) AAs contribution to the gelatine groupings and (3) gelatine type in the gelatine groupings. The AA plot in Fig. 4a showed a higher CV at 90.89% than the CV in Fig. 3a. The FLs were visually improved and had maintained the strong and directive of FL (Fig. 4a). The apportionment of the AAs to the gelatine grouping was also maintained (Fig. 4b) as follows: (1) high concentration of Ser, Arg, Gly, Thr, Met and His and low concentration of Hyp for fish gelatine; (2) high concentration of Hyp and low concentration of Met, Thr, Ser, His, Gly, and Arg for bovine gelatine; and (3) high concentration of Pro, Tyr, Val, Leu, and Phe for porcine gelatine. The Fig. 4b also exhibited 100% fish, 100% bovine and 100% porcine made the grouping 1, 2 and 3, respectively, which had improved the gelatine grouping as compared to CB in Fig. 3b. These results proved that AAs with strong FL had improved the CB and the gelatine grouping. Hence, our study had selected the Hyp, His, Ser, Arg, Gly, Thr, Pro, Tyr, Met, Val, Leu and Phe as the biomarkers for identification of the gelatine source.

Fig. 4
figure 4

a Amino acid plot and b correlation biplot of strong factor loading amino acids on fish, bovine and porcine gelatines

The AAs that significantly contributed (p < 0.05) to the fish gelatine supported our ANOVA (Table 2) except the CB rendered His and Hyp as additional AAs. For porcine gelatine, the CB (Fig. 4b) identified all significant AAs (p < 0.05) calculated from ANOVA (Table 2) except Hyp. The PCA identified Hyp, Met, Thr, Ser, His, Gly, and Arg as the biomarkers (p < 0.05) to the bovine gelatine, which ANOVA failed to identify. From these results, it was evident that PCA was a useful tool to analyse a multivariate dataset that could provide an in-depth understanding of AA distributions as compared to ANOVA and correlation test.

4 Conclusion

The demand to produce gelatines for food, pharmaceutical and cosmetic products has shown an exponential growth which requires gelatine authentication as a forensic food tool by the testing laboratory. The AA testing analysis by UHPLC-DAD incorporated with PCA is imperative to address the false claim and food integrity of gelatines by the manufactures. To explore the gelatine dataset, this manuscript only explained about the PCA as one of frequently utilized multivariate data analysis (MDA) for authentication while other MDA such as multiple linear regression, discriminant analysis and partial least square were not discussed. The analysis undergoes a step-by-step procedure of gelatine hydrolysis, AA derivatization, dataset pre-processing and PCA to avoid false-positive and false-negative results. Since only limited numbers of testing laboratories provide the AA testing, this manuscript may guide the testing laboratories to extend their scope of analysis and suggested the application of UHPLC-DAD-incorporated PCA as an alternative for the PCR method. The certification or regulatory bodies in the governmental level and testing laboratories could adopt this step-by-step procedure to develop a standard of AA analysis and achieve quality testing to authenticate gelatine products.