Implications of deep learning for the automation of design patterns organization

https://doi.org/10.1016/j.jpdc.2017.06.022Get rights and content

Highlights

  • There is a need to bridge the gap between the semantic relationship between patterns.

  • We propose an approach by leveraging a powerful deep learning algorithm named Deep Belief Network (DBN).

  • The DBN learns on the semantic representation of documents formulated in the form of feature vectors.

  • We performed a case study in the context of a text categorization based automated system.

  • The experimental promising results suggest the significance of the proposed approach to construct a more representative feature set.

Abstract

Though like other domains such as email filtering, web page classification, sentiment analysis, and author identification, the researchers have employed the text categorization approach to automate organization and selection of design patterns. However, there is a need to bridge the gap between the semantic relationship between design patterns (i.e. Documents) and the features which are used for the organization of design patterns. In this study, we propose an approach by leveraging a powerful deep learning algorithm named Deep Belief Network (DBN) which learns on the semantic representation of documents formulated in the form of feature vectors. We performed a case study in the context of a text categorization based automated system used for the classification and selection of software design patterns. In the case study, we focused on two main research objectives: 1) to empirically investigate the effect of feature sets constructed through the global filter-based feature selection methods besides the proposed approach, and 2) to evaluate the significant improvement in the classification decision (i.e. Pattern organization) of classifiers using the proposed approach. The adjustment of DBN parameters such as a number of hidden layers, nodes and iteration can aid a developer to construct a more illustrative feature set. The experimental promising results suggest the significance of the proposed approach to construct a more representative feature set and improve the classifier’s performance in terms of organization of design patterns.

Introduction

The text categorization approach has been applied in many domains to introduce an automated system to decrease the cost and computational time across the problems. The aim of automated systems is to classify the texts into appropriate classes with respect to their content, such as spam email filtering [17], web page classification [29], sentiment analysis [23], and author identification [43].

The steady increase in the number of design patterns in the literature and online repositories causes certain issues [2] such as (1) deep knowledge to understand the classification scheme used to organize the patterns [3], (2) the lack of classification scheme [11], (3) formal specification of patterns [19], [20], (4) the heterogeneous description of design patterns [38], and (5) the selections of right design patterns for inexperienced developer in the existence of complex [3] and spoiled patterns [4]. In order to address these issues, the text categorization approach has been employed to automate the organization and selection of design patterns [15], [16]. In the text categorization based automated system, it has been reported that the use of an extreme number of features may have undesirable effects on the classification accuracy and computation time of the automated systems [8], [14], [37]. The wrapper, embedded, and filter are three common schemes which are used to formulate the feature selection methods. The wrapper and embedded based feature selection methods required classifier interaction, while the filter-based methods do not need such interaction to construct a feature set. Besides, the running time of wrappers and embedded methods may be increased due to the classifiers interaction [36], [42]. The wrappers and embedded methods are specific to the learning model as compared to filter-based methods. Consequently, the use of filter-based methods are recommended for the automatic systems implemented via text categorization approach. Though in the context of text categorization based automated systems for the organization and selection of design pattern, the effect of global filter-based feature selection methods have been reported [8], [37]. However, the association of semantic relationship between design patterns (i.e. Document) and the constructed features (i.e. Attributes of feature space constructed through feature selection method) still need to be addressed.

Recently, researchers have adopted the use of deep learning algorithms to improve certain research tasks such as to improve fault localization [21], model the programming languages to suggest the code [40], to retrieve program’s structural information, and produce features from program source code for the prediction of defects [39], [41]. Subsequently, the applicability of unsupervised feature learning and deep learning in certain areas have been reported [1]. In this paper, we look the capacity of deep learning [13] and propose an approach by leveraging a powerful representation learning algorithm Deep Belief Network [12], to learn features from feature vectors extracted from the problem domain of design pattern collection [6], [9], [12]. The aim of the proposed approach is to fill the gap between the semantic relationship between documents(or Patterns) and the features which are used for the organization of design patterns. The Deep Belief Network (DBN) is a generative graphical model to learn a representation which can aid to reconstruct the training data with a high probability. Before applying DBN to learn features from patterns catalog, we perform some pre-processing tasks such as remove stopwords and words stemming to preserve the features (with their frequencies) in the feature vector form [14]. Subsequently, these feature vectors are used as input to the DBN algorithm. We performed certain experiments to evaluate the effectiveness of the proposed approach. Our main contributions in this paper are.

  • We leverage the Deep Belief Network (DBN) algorithm and extract the features which are learned from the semantic relationship between design patterns of a collection in order to improve the performance of classifiers employed for the automation to organize the design patterns.

  • We propose an evaluation model to evaluate the performance of classifiers in the context of organization of design patterns.

  • We evaluate the capacity of the proposed approach in the construction of more illustrative feature sets as compared to existing filter-based feature selection methods, such as Document Frequency (DF) [44], Improved Gini Index [31], Correlation, Gain Ratio, Information Gain [22].

  • We used three well-known classifiers [37] and three widely used design pattern collections [6], [9], [12] which have been used as a benchmark in different domains and evaluate the effectiveness of the proposed approach.

The rest of the paper is organized into eight sections. In Section 2, we summarized the related work regarding the classification of design pattern, feature selection methods, and the application of deep learning. In Section 3, we present the brief introduction of text categorization approach. In Section 4, we describe the working procedure of the proposed approach. In Section 5, we describe the experimental procedure and present a case study in the domain of classification of design patterns. In Section 6, we discussed the results of the presented case study. Subsequently, in Section 7, we summarized the evaluation results. Finally, in Sections 8 Threats to validity, 9 Conclusion, we discuss some threats to the validity and the conclusion of our work respectively.

Section snippets

Related work

We performed a case study to describe the effectiveness of the proposed approach. Since the context of case study includes the empirical investigation of existing filter-based feature selection methods (such as Document Frequency, Improved Gini Index, Correlation, Gain Ratio, Information Gain) and the proposed approach with three widely known design pattern collections. Consequently, we summarized the related work with respect to use of filter-based feature selection methods, organization of

Brief introduction of text categorization approach

In machine learning and text mining domains, automatic text categorization is performed either through supervised or unsupervised learning techniques. The former techniques use class labels assigned to documents and latter techniques use attributes of the data and dis (similarity) measures to automate their learning. Text preprocessing is mandatory for the text categorization and should be performed before the learning process [19], [20]. The main steps of text preprocessing are:

Proposed approach

We consider the design pattern organization as an information retrieval problem [15], [16]. We look at the capacity of deep learning algorithm to address the gap between the semantic relationship between design patterns and the features which are used for the organization of design patterns. The aim of the proposed approach is to construct a more illustrative feature set to improve the performance of classifiers employed in the text categorization based automated system. The proposed approach

Experimental setup

We performed a set of experiments to investigate the effectiveness of the proposed approach against the individual effect of four global filter-based feature selection methods named Document Frequency (DF) [42], Improved Gini Index [31], Gain Ratio and Information Gain [22], [37], on the performance of classifiers for the organization of a target pattern catalog. We have summarized the experimental procedure and presented through the following pseudocode.

Results and discussion

For each design pattern collection under study, we called the pseudocode (Section 5) to determine the (1) best weighting method for each supervised learner, (2) outperform classifiers with their best weighting method, (3) effect of the constructed feature set (via proposed approach and global filter-based feature selection method under study) on the performance of classifiers, and (4) improvement in the performance of classifiers by applying the proposed approach. Subsequently, in each

Evaluation summary

The main objective of the proposed framework is to employ a well-known deep learning algorithm to construct a more representative feature set and evaluate its impact on the organization of design patterns. We employed the proposed approach within the context of three widely used design pattern collections. The main consequences of our proposed study are as follows.

  • We applied three supervised learning techniques to evaluate the effectiveness of the proposed approach. In the

Threats to validity

In our study, we also find some threats. The first threat is related to the generalization of experimental results. We performed experiments and concluded the results with a limited number of pattern collection and classifiers. In order to generalize the applicability of proposed approach, we need to extend our work with a large number of datasets and classifiers. The second threat is related to use of the default values adjusted for the parameter of DBN. The effectiveness of proposed approach

Conclusion

The experimental results illustrate that the proposed approach can be applied to construct a more illustrative feature set via a deep learning named Deep Belief Network (DBN) and aid to improve the classification performance. We leveraged a DBN algorithm to construct a more illustrative feature set on the basis of semantic relationship between pattern across the collection. The aim of the proposed approach is to improve the classifier’s performance in order to organize the design patterns. We

Acknowledgments

This work is supported in part by the General Research Fund of the Research Grants Council of Hong Kong (No. 11208017 and 11214116), the research funds of City University of Hong Kong (No. 7004683 and 7004474), and the NRF of Korea Grant funded by the Korean Government (2015R1D1A1A01058171).

Shahid Hussain received his M.Sc. (Computer Science) and M.S. (Software Engineering) degrees from Gomal University, DIK and City University of Science and Information Technology (CUSIT), Peshawar, Pakistan. Currently, he is pursuing Ph.D. (Software Engineering) with the Department of Computer Science, City University of Hong Kong. His research interests include the Software Design patterns and metrics, text mining, empirical studies, and software defect prediction. Mr. Shahid Hussain is a

References (44)

  • ZhangC. et al.

    Authorship identification from unstructured texts

    Knowl. Based Syst.

    (2014)
  • BengioT. et al.

    Representation learning: A review and new perspectives

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2013)
  • A. Birukou, A survey of existing approaches for pattern search and selection, in: Proceeding of PLoP,...
  • BoochG.

    Handbook of Software Architecture

    (2006)
  • BouhoursC. et al.

    Spoiled patterns: How to extend the GoF

    Softw. Qual. J.

    (2015)
  • CoadP. et al.

    Object Models: Strategies, Patterns, Applications

    (1995)
  • DouglassB.P.

    Real-Time Design Patterns: Robust Scalable Architecture for Real-Time Systems

    (2002)
  • EdwardsC.

    Growing pains for deep learning

    Commun. ACM

    (2015)
  • FormanG.

    An extensive empirical study of feature selection metrics for text classification

    J. Mach. Learn. Res.

    (2003)
  • GammaE. et al.

    Design Patterns: Elements of Reusable Object-Oriented Software

    (1995)
  • GunalS.

    Hybrid feature selection for text classification

    Turk. J. Electr. Eng. Comput. Sci.

    (2012)
  • S. Hasso, C.R. Carlson, A theoretically-based process for organizing design patterns, in: Proceedings of 12th Pattern...
  • Cited by (52)

    • Feature-based software design pattern detection

      2022, Journal of Systems and Software
      Citation Excerpt :

      Ferenc et al. (2005) applied machine learning algorithms to filter false positives out of the results of a graph matching phase, thus providing better precision in the overall output while considering variants. A recent work by Hussain et al. (2018) leverage deep learning algorithms for the organisation and selection of DPs based on text categorisation. To reduce the size of training examples for DP detection, a clustering algorithm is proposed by Dong et al. (2008) based on decision tree learning.

    • An abstract reasoning architecture for privacy policies monitoring

      2020, Future Generation Computer Systems
      Citation Excerpt :

      The definitions of concepts and relations used within the text of the GDPR are given in compliance with SKOS [32], the Simple Knowledge Organization System. SKOS is a W3C recommendation aiming at formalizing thesauri, classification schemes [33,34], taxonomies, subject-heading systems and vocabulary. SKOS is part of the Semantic Web standards built upon RDF and RDFS [35].

    • A Dynamical Systems Approach to Machine Learning

      2023, International Journal of Computational Methods
    View all citing articles on Scopus

    Shahid Hussain received his M.Sc. (Computer Science) and M.S. (Software Engineering) degrees from Gomal University, DIK and City University of Science and Information Technology (CUSIT), Peshawar, Pakistan. Currently, he is pursuing Ph.D. (Software Engineering) with the Department of Computer Science, City University of Hong Kong. His research interests include the Software Design patterns and metrics, text mining, empirical studies, and software defect prediction. Mr. Shahid Hussain is a student member of ACM and IEEE.

    Jacky Keung received his B.Sc. (Hons) in Computer Science from the University of Sydney, and his Ph.D. in Software Engineering from the University of New South Wales, Australia. He is Assistant Professor in the Department of Computer Science, City University of Hong Kong. His research interests include software effort and cost estimation, empirical modeling and evaluation of complex systems, and intensive data mining for software engineering datasets. He has published papers in prestigious journals including IEEE-TSE, IEEE-SOFTWARE, EMSE and many other leading journals and conferences.

    Arif Ali Khan received his B.S. in Software Engineering from University of Science and Technology Bannu, Pakistan in 2010. Arif has M.Sc. by research in Information Technology from Universiti Teknologi PETRONAS, Malaysia. He is currently pursuing Ph.D. degree with the Department of Computer Science, City University of Hong Kong. He is an active researcher in the field of empirical software engineering. His is interested in Software Process Improvement, 3C’s (Communication, Coordination, Control), Global Software Development and Systematic Reviews. He has participated in and managed several software Engineering related research projects. He is a student member of IEEE and ACM.

    Awais Ahmad received his B.S. in Computer Science and M.S. in Telecommunication and Networking from University of Peshawar and Bahria University Islamabad in 2008 and 2010, respectively. During his masters research work, he worked on energy efficient congestion control schemes in Mobile Wireless Sensor Networks (WSN). In 2011, he was appointed as a Lecturer in Comsats Institute of Information Technology, Islamabad. In the mid of 2012, he moved to Republic of Korea for his Ph.D. course at Kyungpook National University (KNU), Daegu. Throughout his time at KNU, as a student and researcher in the field of Big Data, Social Internet of Things, Internet of Things, Vehicular Communication, WSN, and Machine-to-Machine Communication. Dr. Awais has been a key contributor in the said fields. He was also serving as a Lab Admin of CCMP Labs since 2013. He was also awarded as a Best Outgoing Researcher of CCMP labs. Dr. Awais was awarded a Ph.D. in February 2017. After his graduations, he was selected as an Assistant Professor in the Department of Information and Communication Engineering, Yeungnam University Korea. He serves as Guest Editor in various Elsevier and Springer Journals. He is an invited reviewer in IEEE Communication Letters, IEEE JSTAR, IEEE Transactions on Intelligent Transportation Systems, and several other IEEE and Elsevier Journals. He has also published more than 65 research papers (Journals and conferences) and also several book chapters related to big data and IoT. Dr. Ahmad was the recipient of three prestigious awards: (1) Research Award from President of Bahria University Islamabad, Pakistan in 2011, (2) best Paper Nomination Award in WCECS 2011 at UCLA, USA, and (3) best Paper Award in 1st Symposium on CS&E, Moju Resort, South Korea in 2013.

    Salvatore Cuomo is an Assistant Professor in Numerical Analysis at University of Naples Federico II with interests in several Applied Mathematics topics. More in detail, he deals with: (i) numerical approximation problems (theory, practice and applications); (ii) parallel, GPU and scientific computing issues; (iii) Inverse problems. Finally, his passion is the technology transfer in data processing context and smart education environments. He is a co-founder of Educabile start-up innovative company.

    Francesco Piccialli is a researcher at the University of Naples “Federico II”, Department of Mathematics and Applications “Renato Caccioppoli”, Italy. His research topics are Smart Environment design, Internet Technologies, Data Mining on Internet of Things systems with application in the Smart City framework and the Cultural Heritage domain.

    Gwanggil Jeon received the B.S., M.S., and Ph.D. (summa cum laude) degrees from the Department of Electronics and Computer Engineering from Hanyang University, Seoul, Korea, in 2003, 2005, and 2008, respectively. From 2008 to 2009, he was with the Department of Electronics and Computer Engineering, Hanyang University, from 2009 to 2011, he was with the School of Information Technology and Engineering (SITE), University of Ottawa, as a postdoctoral fellow, and from 2011 to 2012, he was with the Graduate School of Science & Technology, Niigata University, as an Assistant Professor. He is currently an Associate Professor with the Department of Embedded Systems Engineering, Incheon National University, Incheon, Korea. His research interests fall under the umbrella of image processing, particularly image compression, motion estimation, demosaicking, and image enhancement as well as computational intelligence such as fuzzy and rough sets theories.

    Adnan Akhunzada is currently an Assistant Professor at CIIT Islamabad. He received his Ph.D. from University of Malaya, Malaysia. He had a great experience in teaching international modules of the University of Bradford, United Kingdom. He has published several high impact research journal papers. His current research interests include secure design and modeling of software defined networks, man-at-the-end attacks, lightweight cryptography, human attacker attribution and profiling, and remote data auditing.

    View full text