Abstract
Quality of predictions depends heavily on features that are chosen for a classification system to rely on. It is one of the reasons why approaches, focused on feature selection and reduction, play a significant role in data mining. Among all available attributes, these should be detected that are of the highest relevance and importance for a given task. This objective can be achieved by an application of one of feature ranking algorithms. Some of data exploration methods have their own inherent mechanisms dedicated to feature reduction, and decision reducts, defined within rough set theory, offer such option. The chapter presents research on application of reduct-based characterisation of features, employed to support classification by selected inducers working outside rough set domain. The problem to be solved comes from the field of stylometry. It is the study of writing styles with the main task of authorship attribution, while using characteristic features not of qualitative, but quantitative type.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
References
J. Han, M. Kamber, J. Pei, Data Mining: Concepts and Techniques (Morgan Kaufmann, 2011)
M. Dash, H. Liu, Feature selection for classification. Intell. Data Anal. 1, 131–156 (1997)
U. Stańczyk, Relative reduct-based estimation of relevance for stylometric features, in Advances in Databases and Information Systems. ed. by B. Catania, G. Guerrini, J. Pokorny, LNCS, vol. 8133 (Springer, Berlin, 2013), pp. 135–147
L. Yu, H. Liu, Efficient feature selection via analysis of relevance and redundancy. J. Mach. Learn. Res. 5, 1205–1224 (2004)
J. Biesiada, W. Duch, A. Kachel, S. Pałucha, Feature ranking methods based on information entropy with Parzen windows, in Proceedings of International Conference on Research in Electrotechnology and Applied Informatics, Katowice, Poland (2005), pp. 109–119
I. Witten, E. Frank, M. Hall, Data Mining. Practical Machine Learning Tools and Techniques, 3rd edn. (Morgan Kaufmann, 2011)
Z. Pawlak, Rough sets and intelligent data analysis. Inf. Sci. 147, 1–12 (2002)
Z. Pawlak, A. Skowron, Rough sets and boolean reasoning. Inf. Sci. 177(1), 41–73 (2007)
U. Stańczyk, B. Zielosko, K. Żabiński, Application of greedy heuristics for feature characterisation and selection: a case study in stylometric domain, in Proceedings of the International Joint Conference on Rough Sets, IJCRS 2018. Volume 11103 of Lecture Notes in Computer Science, ed. by H. Nguyen, Q. Ha, T. Li, Przybyla-Kasperek, M. (Springer, Quy Nhon, Vietnam, 2018), pp. 350–362
D. Holmes, Authorship attribution. Comput. Hum. 28, 87–106 (1994). (April)
S. Argamon, K. Burns, S. Dubnov (eds.), The Structure of Style: Algorithmic Approaches to Understanding Manner and Meaning (Springer, Berlin, 2010)
H. Liu, H. Motoda, Computational Methods of Feature Selection. Data Mining and Knowledge Discovery Series (Chapman & Hall/Crc, 2007)
I. Guyon, S. Gunn, M. Nikravesh, L. Zadeh (eds.), Feature Extraction: Foundations and Applications. Volume 207 of Studies in Fuzziness and Soft Computing (Physica-Verlag, Springer, 2006)
E. Mansoori, Using statistical measures for feature ranking. Int. J. Pattern Recognit. Artifficial Intell. 27(1), 1350003–14 (2013)
U. Stańczyk, Weighting attributes and decision rules through rankings and discretisation parameters, in Machine Learning Paradigms: Theory and Application. ed. by A.E. Hassanien (Springer International Publishing, Cham, 2019), pp. 25–43
U. Stańczyk, RELIEF-based selection of decision rules. Procedia Comput. Sci. 35, 299–308 (2014)
B. Zielosko, M. Piliszczuk, Greedy algorithm for attribute reduction. Fundam. Inform. 85(1–4), 549–561 (2008)
M. Reif, F. Shafait, Efficient feature size reduction via predictive forward selection. Pattern Recognit. 47, 1664–1673 (2014)
Z. Pawlak, A. Skowron, Rudiments of rough sets. Inf. Sci. 177(1), 3–27 (2007)
J.W. Grzymała-Busse, S.Y. Sedelow, W.A. Sedelow, Machine learning & knowledge acquisition, rough sets, and the english semantic code, in Rough Sets and Data Mining: Analysis of Imprecise Data. ed. by N. Cercone, T. Lin (Springer, Boston, 1997), pp. 91–107
X. Jia, L. Shang, B. Zhou, Y. Yao, Generalized attribute reduct in rough set theory. Knowl.-Based Syst. 91, 204–218 (2016)
A. Janusz, D. Ślȩzak, Rough set methods for attribute clustering and selection. Appl. Artif. Intell. 28(3), 220–242 (2014)
U. Stańczyk,, B. Zielosko, Assessing quality of decision reducts, in Knowledge-Based and Intelligent Information & Engineering Systems: Proceedings of the 24rd International Conference KES-2020, Verona, Italy, 16-18 September 2020, ed. by M. Cristani, C. Toro, C. Zanni-Merk, R.J. Howlett, L.C. Jain. Volume 176 of Procedia Computer Science (Elsevier, 2020), pp. 3273–3282
B. Zielosko, U. Stańczyk, Reduct-based ranking of attributes, in Knowledge-Based and Intelligent Information & Engineering Systems: Proceedings of the 24rd International Conference KES-2020, Verona, Italy, 16-18 September 2020, ed. by M. Cristani, C. Toro, C. Zanni-Merk, R.J. Howlett, L.C. Jain. Volume 176 of Procedia Computer Science. (Elsevier, 2020), pp. 2576–2585
F. Mosteller, D. Wallace, Inference in an authorship problem. J. Am. Stat. Assoc. 58(303), 275–309 (1963)
J. Rybicki, M. Eder, D. Hoover, Computational stylistics and text analysis, in Doing Digital Humanities: Practice, Training, Research, ed. by C. Crompton, R. Lane, R. Siemens, 1st edn. (Routledge, 2016), pp. 123–144
L. Pearl, M. Steyvers, Detecting authorship deception: a supervised machine learning approach using author writeprints. Lit. Linguist. Comput. 27(2), 183–196 (2012)
M. Koppel, J. Schler, S. Argamon, Authorship attribution: what’s easy and what’s hard? J. Law Policy 21(2), 317–331 (2013)
H. Baayen, H. van Haltern, F. Tweedie, Outside the cave of shadows: using syntactic annotation to enhance authorship attribution. Lit. Linguist. Comput. 11(3), 121–132 (1996)
Y. Zhao, J. Zobel, Searching with style: authorship attribution in classic literature, in Proceedings of the Thirtieth Australasian Conference on Computer Science - Volume 62. ACSC ’07, Darlinghurst, Australia, Australian Computer Society, Inc. (2007), pp. 59–68
M. Koppel, J. Schler, S. Argamon, Computational methods in authorship attribution. J. Am. Soc. Inf. Sci. Technol. 60(1), 9–26 (2009)
E. Stamatatos, A survey of modern authorship attribution methods. J. Am. Soc. Inf. Sci. Technol. 60(3), 538–556 (2009)
D. Khmelev, F. Tweedie, Using Markov chains for identification of writers. Lit. Linguist. Comput. 16(4), 299–307 (2001)
S. García, J. Luengo, J.A. Sáez, V. López, F. Herrera, A survey of discretization techniques: taxonomy and empirical analysis in supervised learning. IEEE Trans. Knowl. Data Eng. 25(4), 734–750 (2013)
H. Liu, F. Hussain, C. Tan, M. Dash, Discretization: an enabling technique. Data Min. Knowl. Discov. 6(4), 393–423 (2002)
U. Stańczyk, B. Zielosko, G. Baron, Discretisation of conditions in decision rules induced for continuous data. PLOS ONE 15(40), 1–33 (2020)
Y. Yang, G.I. Webb, X. Wu, Discretization methods, in Data Mining and Knowledge Discovery Handbook. ed. by O. Maimon, L. Rokach (Springer, US, Boston, MA, 2005), pp. 113–130
U. Fayyad, K. Irani, Multi-interval discretization of continuous valued attributes for classification learning, in Proceedings of the 13th International Joint Conference on Artificial Intelligence, vol. 2 (Morgan Kaufmann Publishers, 1993), pp. 1022–1027
U. Stańczyk, Evaluating importance for numbers of bins in discretised learning and test sets, in Intelligent Decision Technologies 2017: Proceedings of the 9th KES International Conference on Intelligent Decision Technologies (KES-IDT 2017) – Part II. Volume 72 of Smart Innovation, Systems and Technologies, ed. by I. Czarnowski, J.R. Howlett, C.L. Jain (Springer International Publishing, 2018), pp. 159–169
S.G. Weidman, J. O’Sullivan, The limits of distinctive words: re-evaluating literature’s gender marker debate. Digit. Sch. Hum. 33, 374–390 (2018)
U. Stańczyk, The class imbalance problem, in construction of training datasets for authorship attribution, in Man-Machine Interactions 4. ed. by A. Gruca, A. Brachman, S. Kozielski, T. Czachórski, AISC, vol. 391 (Springer, Berlin, 2016), pp. 535–547
M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, I. Witten, The WEKA data mining software: an update. SIGKDD Explor. 11(1), 10–18 (2009)
J. Bazan, M. Szczuka, The rough set exploration system, in Transactions on Rough Sets III, ed. by J.F. Peters, A. Skowron. Lecture Notes in Computer Science, vol. 3400 (Springer, Berlin, 2005), pp. 37–56
S. Theodoridis, K. Koutroumbas, Pattern Recognit, 4 edn. (Academic Press, 2008)
G. Baron, Analysis of multiple classifiers performance for discretized data in authorship attribution, in Intelligent Decision Technologies 2017: Proceedings of the 9th KES International Conference on Intelligent Decision Technologies (KES-IDT 2017) – Part II. Volume 73 of Smart Innovation, Systems and Technologies, ed. by I. Czarnowski, J.R. Howlett, C.L. Jain (Springer International Publishing, 2018), pp. 33–42
G. Baron, Influence of data discretization on efficiency of Bayesian Classifier for authorship attribution. Procedia Comput. Sci. 35, 1112–1121 (2014); Knowledge-Based and Intelligent Information & Engineering Systems 18th Annual Conference, KES-2014 Gdynia, Poland, September 2014 Proceedings
J.R. Quinlan, C4.5: Programs for Machine Learning (Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1993)
D.M. Farid, L. Zhang, C.M. Rahman, M. Hossain, R. Strachan, Hybrid decision tree and Naive Bayes classifiers for multi-class classification tasks. Expert Syst. Appl. 41(4, Part 2), 1937–1946 (2014)
K. Sta̧por, Evaluation of classifiers: current methods and future research directions, in Proceedings of the Federated Conference on Computer Science and Information Systems (FedCSIS). Volume 13 of ACSIS (2017), pp. 37–40
Acknowledgements
The research works presented in the chapter were performed within the statutory project of the Department of Graphics, Computer Vision and Digital Systems (RAU-6, 2021), at the Silesian University of Technology, Gliwice, Poland.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this chapter
Cite this chapter
Stańczyk, U. (2022). Application of Rough Set-Based Characterisation of Attributes in Feature Selection and Reduction. In: Virvou, M., Tsihrintzis, G.A., Jain, L.C. (eds) Advances in Selected Artificial Intelligence Areas. Learning and Analytics in Intelligent Systems, vol 24. Springer, Cham. https://doi.org/10.1007/978-3-030-93052-3_3
Download citation
DOI: https://doi.org/10.1007/978-3-030-93052-3_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-93051-6
Online ISBN: 978-3-030-93052-3
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)