Skip to main content
Log in

Automatic image captioning system using a deep learning approach

  • Focus
  • Published:
Soft Computing Aims and scope Submit manuscript

Abstract

This paper's residual network is tailored to increase the high-quality image caption generation ability. The captioning is exploited using the relevant content with high-quality interpretation. The research develops a Residual Attention Generative Adversarial Network (RAGAN) and uses attention-based residual learning in Generative Adversarial Network (GAN) to improve the diversity and fidelity of the generated image captions. The RAGAN exploits the words based on the feature maps faster to generate high-quality captions. The RAGAN improves the diversity of captions generated and increases the language metrics scores. The generator is designed as an encoder-decoder mechanism that operates in an unsupervised manner. The residual learning is adopted between the encoder and decoder network. The discriminator is connected to a language evaluator unit, which provides feed-forward to the generator and discriminator to either positively or negatively influence the image captioning process. The experiments show that the proposed RAGAN performs better than the state-of-the-art GAN models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Data availability

Data included in article/supplementary material/referenced in article.

References

  • Beddiar DR, Oussalah M, Seppänen T (2023) Automatic captioning for medical imaging (MIC): a rapid review of literature. Artif Intell Rev 56(5):4019–4076. https://doi.org/10.1007/s10462-022-10270-w

    Article  Google Scholar 

  • Braun S, Starr K (2019) Finding the right words: Investigating machine-generated video description quality using a corpus-based approach. J Audiovis Transl 2(2):11–35

    Article  Google Scholar 

  • Cao S, An G, Zheng Z, Ruan Q (2020) Interactions guided generative adversarial network for unsupervised image captioning. Neurocomputing 417:419–431

    Article  Google Scholar 

  • Chen T, Li Z, Wu J, Ma H, Su B (2022) Improving image captioning with pyramid attention and SC-GAN. Image vis Comput 117:104340

    Article  Google Scholar 

  • Chen X, Fang H, Lin TY, Vedantam R, Gupta S, Dollár P, Zitnick CL (2015) Microsoft coco captions: data collection and evaluation server. arXiv preprint arXiv:1504.00325

  • Chen C, Mu S, Xiao W, Ye Z, Wu L, Ju Q (2019) Improving image captioning with conditional generative adversarial nets. In: Proceedings of the AAAI conference on artificial intelligence, vol 33, no 01, pp 8142–8150

  • Cui W, Wang F, He X, Zhang D, Xu X, Yao M et al (2019) Multi-scale semantic segmentation and spatial relationship recognition of remote sensing images based on an attention model. Remote Sens 11(9):1044

    Article  Google Scholar 

  • Das B, Pal R, Majumder M, Phadikar S, Sekh AA (2023) A visual attention-based model for bengali image captioning. SN Comput Sci 4(2):208

    Article  Google Scholar 

  • Frolov S, Hinz T, Raue F, Hees J, Dengel A (2021) Adversarial text-to-image synthesis: a review. Neural Netw 144:187–209

    Article  Google Scholar 

  • Hossain MZ, Sohel F, Shiratuddin MF, Laga H, Bennamoun M (2021) Text to image synthesis for improved image captioning. IEEE Access 9:64918–64928

    Article  Google Scholar 

  • Huang J, Liu Y, Gong S, Jin H (2021) Cross-sentence temporal and semantic relations in video activity localisation. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 7199–7208

  • Liu J, Wang K, Xu C, Zhao Z, Xu R, Shen Y, Yang M (2020) Interactive dual generative adversarial networks for image captioning. In: Proceedings of the AAAI conference on artificial intelligence, vol 34, no 07, pp 11588–11595

  • Munusamy H (2022) Video captioning using semantically contextual generative adversarial network. Comput vis Image Underst 221:103453

    Article  Google Scholar 

  • Obeso AM, Benois-Pineau J, Vázquez MSG, Acosta AÁR (2022) Visual vs internal attention mechanisms in deep neural networks for image classification and object detection. Pattern Recogn 123:108411

    Article  Google Scholar 

  • Plummer BA, Wang L, Cervantes CM, Caicedo JC, Hockenmaier J, Lazebnik S (2015) Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models. In: Proceedings of the IEEE international conference on computer vision, pp 2641–2649

  • Poongodi M, Hamdi M, Wang H (2022). Image and audio caps: automated captioning of background sounds and images using deep learning. Multimedia Syst 1–9

  • Sargar O, Kinger S (2021) Image captioning methods and metrics. In: 2021 international conference on emerging smart computing and informatics (ESCI). IEEE, pp 522–526

  • Setiawan D, Saffachrissa MAC, Tamara S, Suhartono D (2022) Image captioning with style using generative adversarial networks. Int J Inf Vis 6(1):26–32

    Google Scholar 

  • Sharma H, Srivastava S (2022) Graph neural network-based visual relationship and multilevel attention for image captioning. J Electron Imaging 31(5):053022

    Article  Google Scholar 

  • Shen C, Kasra M, Pan W, Bassett GA, Malloch Y, O’Brien JF (2019) Fake images: the effects of source, intermediary, and digital media literacy on contextual assessment of image credibility online. New Media Soc 21(2):438–463

    Article  Google Scholar 

  • Singh A, Singh TD, Bandyopadhyay S (2021) An encoder-decoder based framework for hindi image caption generation. Multimedia Tools Appl 80(28–29):35721–35740

    Article  Google Scholar 

  • Stowell D (2022) Computational bioacoustics with deep learning: a review and roadmap. PeerJ 10:e13152

    Article  Google Scholar 

  • Tomii K, Kumar S, Zhi D, Brenner SE (2020) Meta-align: a novel HMM-based algorithm for pairwise alignment of error-prone sequencing reads. bioRxiv, 2020-05

  • Vizoso Á, Vaz-Álvarez M, López-García X (2021) Fighting deepfakes: media and internet giants’ converging and diverging strategies against hi-tech misinformation. Media Commun 9(1):291–300

    Article  Google Scholar 

  • Wang J, Xu W, Wang Q, Chan AB (2022) On distinctive image captioning via comparing and reweighting. IEEE Trans Pattern Anal Mach Intell 45(2):2088–2103

    Article  Google Scholar 

  • Wei Y, Wang L, Cao H, Shao M, Wu C (2020) Multi-attention generative adversarial network for image captioning. Neurocomputing 387:91–99

    Article  Google Scholar 

  • Xiong R, Song Y, Li H, Wang Y (2019) Onsite video mining for construction hazards identification with visual relationships. Adv Eng Inform 42:100966

    Article  Google Scholar 

  • Yan S, Wu F, Smith JS, Lu W, Zhang B (2018). Image captioning using adversarial networks and reinforcement learning. In: 2018 24th international conference on pattern recognition (ICPR). IEEE, pp 248–253

  • Yang M, Zhao W, Xu W, Feng Y, Zhao Z, Chen X, Lei K (2018) Multitask learning for cross-domain image captioning. IEEE Trans Multimedia 21(4):1047–1061

    Article  Google Scholar 

  • Yang M, Liu J, Shen Y, Zhao Z, Chen X, Wu Q, Li C (2020) An ensemble of generation-and retrieval-based image captioning with dual generator generative adversarial network. IEEE Trans Image Process 29:9627–9640

    Article  MathSciNet  MATH  Google Scholar 

  • Zhang B, Zhu J, Su H (2023) Toward the third generation artificial intelligence. Sci China Inf Sci 66(2):1–19

    Article  MathSciNet  Google Scholar 

  • Zhou Y, Tao W, Zhang W (2021) Triple sequence generative adversarial nets for unsupervised image captioning. In: ICASSP 2021–2021 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 7598–7602

  • Zhou Z, Yang Y, Li Z, Zhang X, Huang F (2022) Image captioning with residual swin transformer and actor-critic. Neural Comput Appl 1–13

Download references

Funding

Not applicable.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Gerard Deepak.

Ethics declarations

Conflict of interest

The authors have no conflicts of interest to declare relevant to this article's content.

Human and animal rights

This research does not involve any human participants and/or animals; hence, any informed consent or statement on the welfare of animals does not apply to this research.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Deepak, G., Gali, S., Sonker, A. et al. Automatic image captioning system using a deep learning approach. Soft Comput (2023). https://doi.org/10.1007/s00500-023-08544-8

Download citation

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s00500-023-08544-8

Keywords

Navigation