Skip to main content
Log in

DISET: a distance based semi-supervised self-training for automated users’ agent activity detection from web access log

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Detecting automated users’ agent activities at any web application through users’ web access logs is a challenging issue. Many machines learning based automated solutions exist to address this issue. However, the existing supervised learning methods are heavily dependent on fully labeled data. But the scarcity of labeled log access data and the cost of labeling still make the issue challenging. Some unsupervised learning-based solutions are also proposed, but their performance accuracy is questionable. The semi-supervised based self-training method works with a small set of partially labeled data but lacks a suitable selection metric for a set of predictions with a high degree of confidence and a reliable base learner. In this paper, we propose a new semi-supervised learning based self-training method using probability-based selection criteria with Mahalanobis distance, named DIstance-based SElf-Training (DISET) for detecting automated users’ agent activities. The DISET used probability-based selection criteria with Mahalanobis distance to achieve high-confidence subset selection. The DISET framework works in four steps. First, it performs the data cleaning, session identification, feature extraction, and session labeling during the data preprocessing step. The second step segments the data into labeled and unlabeled datasets. The third step of model self-training performs the subset selection using six different supervised base learners independently. Lastly, the fourth step tests the performance of the used model. The performance of DISET is evaluated on NASA95 and E-commerce weblog datasets using three-fold cross-validation training and testing. The used datasets are also divided into different ratios of labeled and unlabeled instances for experiments. The performance is recorded on the accuracy, precision, recall, and the f-1 score, and the Matthews Correlation Coefficient (MCC) measures and compares the model’s performance with six different base classifiers. We also plotted the ROC and PR curves to confirm and compare the performance of different base learners with the DISET method. Out of the six-base learners, XGBoost outperformed both datasets in the 30:70 data segmentation ratio. The results show that DISET achieves a minimum percentage improvement of 1.91% in accuracy, 2.70% in precision, 3.65% in sensitivity, and 1.00% in F-1 score with large unlabeled datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Algorithm 1
Fig. 5
Algorithm 2
Fig. 6
Fig. 7

Similar content being viewed by others

References

  1. Abubakar H, Souley B, Gital AYu (2020) An improved captcha - based intrusion detection system based on redirector model. J Theor Appl Inf Technol 98:429–440

  2. Agarwal AK, Wadhwa S, Chandra S (2016) XGBoost a scalable tree boosting system. J Assoc Physicians India 42:665

  3. Akamai-2022 (2022) Akamai’s bot manager - advanced strategies to flexibly manage the long-term business and IT impact of bots. https://www.akamai.com/site/en/documents/product-brief/bot-manager-product-brief.pdf. Accessed 20 Jul 2022

  4. Algiryage N, Dias G, Jayasena S (2018) Distinguishing real web crawlers from fakes: Googlebot example. MERCon 2018–4th Int Multidiscip Moratuwa. Eng Res Conf, pp 13–18. https://doi.org/10.1109/MERCon.2018.8421894

  5. Alipour M, Harris DK (2020) A big data analytics strategy for scalable urban infrastructure condition assessment using semi-supervised multi-transform self-training. J Civ Struct Heal Monit 10:313–332. https://doi.org/10.1007/s13349-020-00386-4

    Article  Google Scholar 

  6. Alnoamany Y, Weigle MC, Nelson ML (2013) Access patterns for robots and humans in web archives. Proc ACM/IEEE Jt Conf Digit Libr 339–348. https://doi.org/10.1145/2467696.2467722

  7. Arlitt M, Williamson C (1996) NASA website access log data. ftp://ita.ee.lbl.gov/html/contrib/NASA-HTTP.html. Accessed 24 Aug 2021

  8. Bellet A, Habrard A, Sebban M (2013) A survey on metric learning for feature vectors and structured data, pp 1–59. https://doi.org/10.48550/arXiv.1306.6709

  9. Bhatti UA, Huang M, Wu D et al (2019) Recommendation system using feature extraction and pattern recognition in clinical care systems. Enterp Inf Syst 13:329–351. https://doi.org/10.1080/17517575.2018.1557256

    Article  Google Scholar 

  10. Bhatti UA, Yu Z, Li J et al (2020) Hybrid watermarking algorithm using clifford algebra with arnold scrambling and chaotic encryption. IEEE Access 8:76386–76398. https://doi.org/10.1109/ACCESS.2020.2988298

    Article  Google Scholar 

  11. Bomhardt C, Gaul W, Schmidt-Thieme L (2005) Web robot detection - preprocessing web logfìles for robot detection. Stud Classif Data Anal Knowl Organ 0:113–124. https://doi.org/10.1007/3-540-27373-5_14

  12. Breiman L (2001) Random forests. Mach Learn 45:5–32. https://doi.org/10.1023/A:1010933404324

  13. Cabri A, Suchacka G, Rovetta S, Masulli F(2019) Online web bot detection using a sequential classification approach. Proc – 20th Int Conf High Perform Comput Commun 16th Int Conf Smart City 4th Int Conf Data Sci Syst HPCC/SmartCity/DSS 2018 1536–1540. https://doi.org/10.1109/HPCC/SmartCity/DSS.2018.00252

  14. Chen H, He H, Starr A (2020) An overview of web robots detection techniques. Int Conf Cyber Secur Prot Digit Serv Cyber Secur 2020, pp 1–6. https://doi.org/10.1109/CyberSecurity49315.2020.9138856

  15. Chicco D, Jurman G (2020) The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genomics 21:1–13. https://doi.org/10.1186/s12864-019-6413-7

    Article  Google Scholar 

  16. Courtney L, Li X, Xu R, Coffman J (2021) Data science techniques to detect fraudulent resource consumption in the cloud. 2021 IEEE 11th Annu Comput Commun Work Conf CCWC 2021, pp 451–457. https://doi.org/10.1109/CCWC51732.2021.9375938

  17. CVE Details (2022) Vulnerabilities by types. https://www.cvedetails.com/vulnerabilities-by-types.php. Accessed 20 Jan 2022

  18. Doran D, Gokhale SS (2012) Detecting web robots using resource request patterns. Proc – 2012 11th Int Conf Mach Learn Appl ICMLA 2012 1, pp 7–12. https://doi.org/10.1109/ICMLA.2012.11

  19. Doran D, Gokhale SS (2016) An integrated method for real time and offline web robot detection. Expert Syst 33:592–606. https://doi.org/10.1111/exsy.12184

    Article  Google Scholar 

  20. Fu J, Li L, Wang Y et al (2019) Web scanner detection based on behavioral differences. In: Communications in computer and information science. Springer Singapore, pp 1–16

  21. Guo Y, Shi J, Cao Z et al (2019) Machine learning based cloudbot detection using multi-layer traffic statistics. Proc – 21st IEEE Int Conf High Perform Comput Commun 17th IEEE Int Conf Smart City 5th IEEE Int Conf Data Sci Syst HPCC/SmartCity/DSS 2019, pp 2428–2435. https://doi.org/10.1109/HPCC/SmartCity/DSS.2019.00339

  22. Hamidzadeh J, Zabihimayvan M, Sadeghi R (2018) Detection of Web site visitors based on fuzzy rough sets. Soft Comput 22:2175–2188. https://doi.org/10.1007/s00500-016-2476-4

    Article  Google Scholar 

  23. Hou YT, Chang Y, Chen T et al (2010) Malicious web content detection by machine learning. Expert Syst Appl 37:55–60. https://doi.org/10.1016/j.eswa.2009.05.023

    Article  Google Scholar 

  24. Iliou C, Kostoulas T, Tsikrika T et al (2019) Towards a framework for detecting advanced web bots. In: ACM international conference proceeding series, pp 1–10

  25. Imperva (2021) Bad bot report 2021. https://www.imperva.com/blog/bad-bot-report-2020-bad-bots-strike-back/. Accessed 20 Jan 2022

  26. Imperva-2022 (2022) Imperva advanced bot protection management. https://www.imperva.com/products/advanced-bot-protection-management/. Accessed 20 Jul 2022

  27. Krzywinski M, Altman N (2017) Classification and regression trees. Nat Methods 14:757–758. https://doi.org/10.1038/nmeth.4370

    Article  Google Scholar 

  28. Kwon S, Kim YG, Cha S (2012) Web robot detection based on pattern-matching technique. J Inf Sci 38:118–126. https://doi.org/10.1177/0165551511435969

    Article  Google Scholar 

  29. Lagopoulos A, Tsoumakas G, Papadopoulos G (2018) Web robot detection: a semantic approach. Proc - Int Conf Tools with Artif Intell ICTAI 2018-Novem, pp 968–974. https://doi.org/10.1109/ICTAI.2018.00150

  30. Lee J, Cha S, Lee D, Lee H (2009) Classification of web robots: an empirical study based on over one billion requests. Comput Secur 28:795–802. https://doi.org/10.1016/j.cose.2009.05.004

  31. Lewandowski P, Janiszewski M, Felkner A (2020) SpiderTrap - an innovative approach to analyze activity of internet bots on a website. IEEE Access 8:141292–141309. https://doi.org/10.1109/ACCESS.2020.3012969

  32. Liao K, Liu G, Xiao L, Liu C (2013) A sample-based hierarchical adaptive K-means clustering method for large-scale video retrieval. Knowledge-Based Syst 49:123–133. https://doi.org/10.1016/j.knosys.2013.05.003

    Article  Google Scholar 

  33. Livieris IE, Kanavos A, Tampakas V, Pintelas P (2018) An auto-adjustable semi-supervised self-training algorithm. Algorithms 11:1–16. https://doi.org/10.3390/a11090139

    Article  MathSciNet  MATH  Google Scholar 

  34. Mittal M, Sharma RK, Singh VP (2014) Validation of k -means and threshold based clustering methodering Method. Int J Adv Technol 5:153–160

  35. Mucherino A, Papajorgji PJ, Pardalos PM (2009). In: Mucherino A, Papajorgji PJ, Pardalos PM (eds) k-Nearest neighbor classification BT - data mining in agriculture. Springer New York, New York, pp 83–106

  36. Rahman RU, Tomar DS (2021) Threats of price scraping on e-commerce websites: attack model and its detection using neural network. J Comput Virol Hacking Tech: 75–89. https://doi.org/10.1007/s11416-020-00368-6

  37. Renuka Devi S (2012) Detection of application layer DDOS Attacks using information theory based metrics, pp 217–223. https://doi.org/10.5121/csit.2012.2223

  38. Rustogi R, Agarwal A, Prasad A, Saurabh S (2019) Machine learning based web-traffic analysis for detection of fraudulent resource consumption attack in cloud. Proc – 2019 IEEE/WIC/ACM Int Conf Web Intell WI 2019, pp 456–460. https://doi.org/10.1145/3350546.3352567

  39. Sahu S, Kumar R, Mohdshafi P et al (2022) A hybrid recommendation system of upcoming movies using sentiment analysis of YouTube trailer reviews. Mathematics 10:1–22. https://doi.org/10.3390/math10091568

  40. Sahu S, Kumar R, Pathan MS et al (2022) Movie popularity and target audience prediction using the content-based recommender system. IEEE Access 10:42030–42046. https://doi.org/10.1109/ACCESS.2022.3168161

  41. Sardar TH, Ansari Z (2014) Detection and confirmation of web robot requests for cleaning the voluminous web log data. 2014 Int Conf IMpact E-Technology US, IMPETUS 2014, pp 13–19. https://doi.org/10.1109/IMPETUS.2014.6775871

  42. Schapire RE (2013) Explaining AdaBoost. In: Empirical inference. Springer Berlin Heidelberg, Berlin, pp 37–52

  43. Silhavy R, Senkerik R, Silhavy P et al (2014) UAC: a lightweight and scalable approach to detect malicious web pages. Adv Intell Syst Comput 285:241–261. https://doi.org/10.1007/978-3-319-06740-7

  44. Sisodia DS, Verma N (2018) Framework for preprocessing and feature extraction from weblogs for identification of HTTP flood request attacks. 2018 Int Conf Adv Comput Telecommun ICACAT 2018, pp 8–11. https://doi.org/10.1109/ICACAT.2018.8933587

  45. Sisodia DS, Verma S, Vyas OP (2015) Agglomerative approach for identification and elimination of web robots from web server logs to extract knowledge about actual visitors. J Data Anal Inf Process 03:1–10. https://doi.org/10.4236/jdaip.2015.31001

  46. Stassopoulou A, Dikaiakos MD (2009) Web robot detection: a probabilistic reasoning approach. Comput Netw 53:265–278. https://doi.org/10.1016/j.comnet.2008.09.021

  47. Stevanovic D, Vlajic N, An A (2011) Unsupervised clustering of web sessions to detect malicious and non-malicious website users. Procedia Comput Sci 5:123–131. https://doi.org/10.1016/j.procs.2011.07.018

    Article  Google Scholar 

  48. Stevanovic D, An A, Vlajic N (2012) Feature evaluation for web crawler detection with data mining techniques. Expert Syst Appl 39:8707–8717. https://doi.org/10.1016/j.eswa.2012.01.210

    Article  Google Scholar 

  49. Stevanovic D, Vlajic N, An A (2013) Detection of malicious and non-malicious website visitors using unsupervised neural network learning. Appl Soft Comput J 13:698–708. https://doi.org/10.1016/j.asoc.2012.08.028

    Article  Google Scholar 

  50. Suchacka G, Iwański J (2020) Identifying legitimate web users and bots with different traffic profiles — an information bottleneck approach. Knowledge-Based Syst 197:105875. https://doi.org/10.1016/j.knosys.2020.105875

  51. Tan PN, Kumar V (2002) Discovery of web robot sessions based on their navigational patterns. Data Min Knowl Discov 6:9–35. https://doi.org/10.1023/A:1013228602957

    Article  MathSciNet  Google Scholar 

  52. Tanaka T, Niibori H, Li S et al (2020) Bot detection model using user agent and user behavior for web log analysis. Procedia Comput Sci 176:1621–1625. https://doi.org/10.1016/j.procs.2020.09.185

    Article  Google Scholar 

  53. Triguero I, García S, Herrera F (2015) Self-labeled techniques for semi-supervised learning: taxonomy, software and empirical study. Knowl Inf Syst 42:245–284. https://doi.org/10.1007/s10115-013-0706-y

  54. udger (2022) User agents. https://udger.com/. Accessed 10 May 2022

  55. Wan S, Li Y, Sun K (2019) PathMarker: protecting web contents against inside crawlers. Cybersecurity 2:1–17. https://doi.org/10.1186/s42400-019-0023-1

    Article  Google Scholar 

  56. Webb GI (2010). In: Sammut C, Webb GI (eds) Naïve bayes BT - encyclopedia of machine learning. Springer US, Boston, pp 713–714

  57. Zabihimayvan M, Sadeghi R, Rude HN, Doran D (2017) A soft computing approach for benign and malicious web robot detection. Expert Syst Appl 87:129–140. https://doi.org/10.1016/j.eswa.2017.06.004

    Article  Google Scholar 

  58. Zaker F (2019) Online shopping store - web server logs. https://doi.org/10.7910/DVN/3QBYB5. Accessed 25 Aug 2021

  59. Zhu X (2008) Semi-supervised learning literature survey contents. Sci York 10:10. https://doi.org/10.1.1.146.2352

  60. Zhu W, Gao H, He Z et al (2019) A hybrid approach for recognizing web crawlers. Wireless algorithms, systems, and applications. WASA 2019. Lecture Notes in Computer Science. Springer International Publishing, pp 507–519

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rikhi Ram Jagat.

Ethics declarations

Conflicts of interest/Competing interests

None; we declare that we have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Jagat, R.R., Sisodia, D.S. & Singh, P. DISET: a distance based semi-supervised self-training for automated users’ agent activity detection from web access log. Multimed Tools Appl 82, 19853–19876 (2023). https://doi.org/10.1007/s11042-022-14258-0

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-022-14258-0

Keywords

Navigation