Abstract
Log line clusters usually lack meaningful descriptions that are required to understand the information provided by log lines within a cluster. Template generators allow to produce such descriptions in form of patterns that match all log lines within a cluster and therefore describe the common features, e.g., substrings, of the lines. Current approaches only allow the generation of token-based (e.g., space-separated words) templates, which are often inaccurate for log lines, because they usually do not account for existing string similarities in, for instance fully qualified system names or domain names. Consequently, novel character-based template generators are required that provide robust templates for any type of computer log data, which can be applied in security information and event management (SIEM) solutions, for continuous auditing, quality inspection and control. In this chapter, we propose a novel approach for computing character-based templates, which combines comparison-based methods and heuristics. To achieve this goal, we solve the problem of efficiently calculating a multi-line alignment for a group of log lines and compute an accurate approximation of the optimal character-based template.
Parts of this chapter have been published in [119].
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
A sequence alignment is the result of an algorithm that arranges two strings, so that the least number of operations (i.e., insertions, deletions, or replacements of characters) is required to transform one string into the other one, i.e., it assumes the highest possible similarity.
- 2.
Note, the direction is also diagonal when a character should be replaced.
- 3.
References
Wael H Gomaa and Aly A Fahmy. A survey of text similarity approaches. International Journal of Computer Applications, 68(13):13–18, 2013.
Pinjia He, Jieming Zhu, Shilin He, Jian Li, and Michael R Lyu. An evaluation study on log parsing and its use in log mining. In 2016 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pages 654–661. IEEE, 2016.
D. Jurafsky and J.H. Martin. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Pearson international edition. Prentice Hall, 2009.
Vladimir I Levenshtein. Binary codes capable of correcting deletions, insertions, and reversals. Soviet physics doklady, 10(8):707–710, 1966.
Saul B Needleman and Christian D Wunsch. A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of molecular biology, 48(3):443–453, 1970.
Cédric Notredame. Recent evolutions of multiple sequence alignment algorithms. PLoS computational biology, 3(8):e123, 2007.
Markus Wurzenberger, Georg Höld, Max Landauer, Florian Skopik, and Wolfgang Kastner. Creating Character-based Templates for Log Data to Enable Security Event Classification. In Proceedings of the 15th ACM Asia Conference on Computer and Communications Security, pages 141–152, 2020.
Author information
Authors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this chapter
Cite this chapter
Skopik, F., Wurzenberger, M., Landauer, M. (2021). Generating Character-Based Templates for Log Data. In: Smart Log Data Analytics. Springer, Cham. https://doi.org/10.1007/978-3-030-74450-2_4
Download citation
DOI: https://doi.org/10.1007/978-3-030-74450-2_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-74449-6
Online ISBN: 978-3-030-74450-2
eBook Packages: Computer ScienceComputer Science (R0)