skip to main content
10.1145/3503161.3548282acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article
Open Access

Draw Your Art Dream: Diverse Digital Art Synthesis with Multimodal Guided Diffusion

Authors Info & Claims
Published:10 October 2022Publication History

ABSTRACT

Digital art synthesis is receiving increasing attention in the multimedia community because of engaging the public with art effectively. Current digital art synthesis methods usually use single-modality inputs as guidance, thereby limiting the expressiveness of the model and the diversity of generated results. To solve this problem, we propose the multimodal guided artwork diffusion (MGAD) model, which is a diffusion-based digital artwork generation approach that utilizes multimodal prompts as guidance to control the classifier-free diffusion model. Additionally, the contrastive language-image pretraining (CLIP) model is used to unify text and image modalities. Extensive experimental results on the quality and quantity of the generated digital art paintings confirm the effectiveness of the combination of the diffusion model and multimodal guidance. Code is available at https://github.com/haha-lisa/MGAD-multimodal-guided-artwork-diffusion.

References

  1. 2022. Yahoo Flickr Creative Commons 100 Million (YFCC100m) dataset. http://projects.dfki.uni-kl.de/yfcc100m/Google ScholarGoogle Scholar
  2. Adverb. 2022. The BigSleep: BigGANCLIP. https://colab.research.google.com/ drive/1NCceX2mbiKOSlAd_o7IU7nA9UskKN5WR?usp=sharing#scrollTo= WtlDVVMvzMUdGoogle ScholarGoogle Scholar
  3. Luis Alvarez, Nelson Monzón, and Jean-Michel Morel. 2021. Interactive Design of Random Aesthetic Abstract Textures by Composition Principles. Leonardo 54, 2 (2021), 179--184.Google ScholarGoogle ScholarCross RefCross Ref
  4. Andrew Brock, Jeff Donahue, and Karen Simonyan. 2018. Large Scale GAN Training for High Fidelity Natural Image Synthesis. In International Conference on Learning Representations (ICLR).Google ScholarGoogle Scholar
  5. Johannes Buchner. 2021. ImageHash:An image hashing library written in Python. https://github.com/JohannesBuchner/imagehashGoogle ScholarGoogle Scholar
  6. Haibo Chen, Lei Zhao, Zhizhong Wang, Huiming Zhang, Zhiwen Zuo, Ailin Li, Wei Xing, and Dongming Lu. 2021. DualAST: Dual Style-Learning Networks for Artistic Style Transfer. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 872--881.Google ScholarGoogle Scholar
  7. Haibo Chen, Lei Zhao, Huiming Zhang, Zhizhong Wang, Zhiwen Zuo, Ailin Li, Wei Xing, and Dongming Lu. 2021. Diverse Image Style Transfer via Invertible Cross-Space Mapping. In IEEE/CVF International Conference on Computer Vision (ICCV). 14860--14869.Google ScholarGoogle ScholarCross RefCross Ref
  8. Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. 2020. Uniter: Universal image-text representation learning. In European Conference on Computer Vision (ECCV). 104--120.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Jooyoung Choi, Sungwon Kim, Yonghyun Jeong, Youngjune Gwon, and Sungroh Yoon. 2021. ILVR: Conditioning Method for Denoising Diffusion Probabilistic Models. In IEEE/CVF International Conference on Computer Vision (ICCV). 14347-- 14356.Google ScholarGoogle Scholar
  10. Katherine Crowson and Chainbreakers AI. 2022. Diffusion 512x512, secondary model method. https://github.com/crowsonkb/v-diffusion-pytorchGoogle ScholarGoogle Scholar
  11. Yingying Deng, Fan Tang, Weiming Dong, Haibin Huang, Chongyang Ma, and Changsheng Xu. 2021. Arbitrary video style transfer via multi-channel correlation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 1210--1217.Google ScholarGoogle ScholarCross RefCross Ref
  12. Yingying Deng, Fan Tang, Weiming Dong, Chongyang Ma, Feiyue Huang, Oliver Deussen, and Changsheng Xu. 2020. Exploring the representativity of art paintings. IEEE Transactions on Multimedia 23 (2020), 2794--2805.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Yingying Deng, Fan Tang,Weiming Dong, Chongyang Ma, Xingjia Pan, LeiWang, and Changsheng Xu. 2022. StyTr2: Image Style Transfer with Transformers. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarGoogle ScholarCross RefCross Ref
  14. Yingying Deng, Fan Tang, Weiming Dong, Wen Sun, Feiyue Huang, and Changsheng Xu. 2020. Arbitrary style transfer via multi-adaptation network. In Proceedings of the 28th ACM International Conference on Multimedia. 2719--2727.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Yingying Deng, Fan Tang, Weiming Dong, Fuzhang Wu, Oliver Deussen, and Changsheng Xu. 2019. Selective clustering for representative paintings selection. Multimedia Tools and Applications 78, 14 (2019), 19305--19323.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Karan Desai and Justin Johnson. 2021. Virtex: Learning visual representations from textual annotations. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 11162--11173.Google ScholarGoogle ScholarCross RefCross Ref
  17. Prafulla Dhariwal and Alex Nichol. 2021. Diffusion Models Beat GANs on Image Synthesis. In Advances in Neural Information Processing Systems (NeurIPS).Google ScholarGoogle Scholar
  18. Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations (ICLR).Google ScholarGoogle Scholar
  19. Patrick Esser, Robin Rombach, and Bjorn Ommer. 2021. Taming transformers for high-resolution image synthesis. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 12873--12883.Google ScholarGoogle ScholarCross RefCross Ref
  20. Kevin Frans, LB Soros, and Olaf Witkowski. 2021. CLIPDraw: Exploring Text-to-Drawing Synthesis through Language-Image Encoders. arXiv preprint arXiv:2106.14843 (2021).Google ScholarGoogle Scholar
  21. Rinon Gal, Or Patashnik, Haggai Maron, Gal Chechik, and Daniel Cohen-Or. 2021. StyleGAN-NADA: CLIP-Guided Domain Adaptation of Image Generators. arXiv:2108.00946 [cs.CV]Google ScholarGoogle Scholar
  22. Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative Adversarial Nets. In Neural Information Processing Systems (NIPS).Google ScholarGoogle Scholar
  23. Shuyang Gu, Dong Chen, Jianmin Bao, Fang Wen, Bo Zhang, Dongdong Chen, Lu Yuan, and Baining Guo. 2021. Vector Quantized Diffusion Model for Text-to-Image Synthesis. arXiv preprint arXiv:2111.14822 (2021).Google ScholarGoogle Scholar
  24. Peter Hall, Hongping Cai, Qi Wu, and Tadeo Corradi. 2015. Cross-depiction problem: Recognition and synthesis of photographs and artwork. Computational Visual Media 1, 2 (2015), 91--103.Google ScholarGoogle ScholarCross RefCross Ref
  25. Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising Diffusion Probabilistic Models. arXiv: Learning (2020).Google ScholarGoogle Scholar
  26. Jonathan Ho and Tim Salimans. 2021. Classifier-Free Diffusion Guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications.Google ScholarGoogle Scholar
  27. Xun Huang and Serge Belongie. 2017. Arbitrary style transfer in real-time with adaptive instance normalization. In IEEE International Conference on Computer Vision (ICCV). IEEE, 1501--1510.Google ScholarGoogle ScholarCross RefCross Ref
  28. Zhengyu Huang, Yichen Peng, Tomohiro Hibino, Chunqi Zhao, Haoran Xie, Tsukasa Fukusato, and Kazunori Miyata. 2022. dualface: Two-stage drawing guidance for freehand portrait sketching. Computational Visual Media 8, 1 (2022), 63--77.Google ScholarGoogle ScholarCross RefCross Ref
  29. Yuchi Huo and Sung-eui Yoon. 2021. A survey on deep learning-based Monte Carlo denoising. Computational Visual Media 7, 2 (2021), 169--185.Google ScholarGoogle ScholarCross RefCross Ref
  30. Ajay Jain. 2021. VectorAscent: Generate vector graphics from a textual description. https://github.com/ajayjain/VectorAscentGoogle ScholarGoogle Scholar
  31. Justin Johnson, Alexandre Alahi, and Li Fei-Fei. 2016. Perceptual losses for realtime style transfer and super-resolution. In European Conference on Computer Vision (ECCV). 694--711.Google ScholarGoogle ScholarCross RefCross Ref
  32. Alexia Jolicoeur-Martineau, Rémi Piché-Taillefer, Ioannis Mitliagkas, and Remi Tachet des Combes. 2021. Adversarial score matching and improved sampling for image generation. In International Conference on Learning Representations (ICLR).Google ScholarGoogle Scholar
  33. Tero Karras, Samuli Laine, and Timo Aila. 2018. A Style-Based Generator Architecture for Generative Adversarial Networks. arXiv: Neural and Evolutionary Computing (2018).Google ScholarGoogle Scholar
  34. Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. 2020. Analyzing and Improving the Image Quality of StyleGAN. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 8107--8116.Google ScholarGoogle Scholar
  35. Gwanghyun Kim, Taesung Kwon, and Jong Chul Ye. 2021. DiffusionCLIP: Text- Guided Diffusion Models for Robust Image Manipulation. (2021). https://doi.org/10.48550/ARXIV.2110.02711Google ScholarGoogle Scholar
  36. Diederik P. Kingma and Prafulla Dhariwal. 2018. Glow: Generative Flow with Invertible 1x1 Convolutions. arXiv: Machine Learning (2018).Google ScholarGoogle Scholar
  37. Dmytro Kotovenko, Artsiom Sanakoyeu, Sabine Lang, and Bjorn Ommer. 2019. Content and Style Disentanglement for Artistic Style Transfer. In IEEE/CVF International Conference on Computer Vision (ICCV). 4422--4431.Google ScholarGoogle Scholar
  38. Gihyun Kwon and Jong Chul Ye. 2022. CLIPstyler: Image Style Transfer with a Single Text Condition. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarGoogle ScholarCross RefCross Ref
  39. Tzu-Mao Li, Michal Lukác, Michaël Gharbi, and Jonathan Ragan-Kelley. 2020. Differentiable vector graphics rasterization for editing and learning. ACM Transactions on Graphics 39, 6 (2020), 1--15.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, et al. 2020. Oscar: Objectsemantics aligned pre-training for vision-language tasks. In European Conference on Computer Vision (ECCV). 121--137.Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Minxuan Lin, Fan Tang,Weiming Dong, Xiao Li, Changsheng Xu, and Chongyang Ma. 2021. Distribution Aligned Multimodal and Multi-Domain Image Stylization. ACM Transactions on Multimedia Computing, Communications, and Applications 17, 3, Article 96 (2021), 17 pages.Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Songhua Liu, Tianwei Lin, Dongliang He, Fu Li, MeilingWang, Xin Li, Zhengxing Sun, Qian Li, and Errui Ding. 2021. AdaAttN: Revisit attention mechanism in arbitrary neural style transfer. In IEEE/CVF International Conference on Computer Vision (ICCV). 6649--6658.Google ScholarGoogle ScholarCross RefCross Ref
  43. Xingchao Liu, Chengyue Gong, Lemeng Wu, Shujian Zhang, Hao Su, and Qiang Liu. 2021. FuseDream: Training-Free Text-to-Image Generation with Improved CLIPGAN Space Optimization. arXiv:2112.01573 [cs.CV]Google ScholarGoogle Scholar
  44. Xihui Liu, Dong Huk Park, Samaneh Azadi, Gong Zhang, Arman Chopikyan, Yuxiao Hu, Humphrey Shi, Anna Rohrbach, and Trevor Darrell. 2021. More Control for Free! Image Synthesis with Semantic Diffusion Guidance. arXiv:2112.05744 [cs.CV]Google ScholarGoogle Scholar
  45. Yahui Liu, Marco De Nadai, Deng Cai, Huayang Li, Xavier Alameda-Pineda, Nicu Sebe, and Bruno Lepri. 2020. Describe What to Change: A Text-Guided Unsupervised Image-to-Image Translation Approach. In Proceedings of the 28th ACM International Conference on Multimedia. Association for Computing Machinery, 1357--1365.Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Jacob Menick and Nal Kalchbrenner. 2018. Generating High Fidelity Images with Subscale Pixel Networks and Multidimensional Upscaling. international conference on learning representations (ICLR) (2018).Google ScholarGoogle Scholar
  47. Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. 2021. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741 (2021).Google ScholarGoogle Scholar
  48. Or Patashnik, Zongze Wu, Eli Shechtman, Daniel Cohen-Or, and Dani Lischinski. 2021. StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 2085--2094.Google ScholarGoogle ScholarCross RefCross Ref
  49. Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning. PMLR, 8748--8763.Google ScholarGoogle Scholar
  50. Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. 2022. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125 (2022).Google ScholarGoogle Scholar
  51. Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. 2021. Zero-Shot Text-to-Image Generation. In International Conference on Machine Learning (ICML). 8821--8831.Google ScholarGoogle Scholar
  52. Ali Razavi, Aäron van den Oord, and Oriol Vinyals. 2019. Generating Diverse High-Fidelity Images with VQ-VAE-2. In Advances in Neural Information Processing Systems.Google ScholarGoogle Scholar
  53. Nerdy Rodent. 2022. Source Code of VQGAN-CLIP. https://github.com/nerdyrodent/VQGAN-CLIPGoogle ScholarGoogle Scholar
  54. Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2021. High-Resolution Image Synthesis with Latent Diffusion Models.Google ScholarGoogle Scholar
  55. Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Medical Image Computing and Computer-Assisted Intervention.Google ScholarGoogle Scholar
  56. Shulan Ruan, Yong Zhang, Kun Zhang, Yanbo Fan, Fan Tang, Qi Liu, and Enhong Chen. 2021. DAE-GAN: Dynamic Aspect-aware GAN for Text-to-Image Synthesis. In IEEE/CVF International Conference on Computer Vision (ICCV). 13960--13969.Google ScholarGoogle Scholar
  57. Peter Schaldenbrand, Zhixuan Liu, and Jean Oh. 2022. StyleCLIPDraw: Coupling Content and Style in Text-to-Drawing Translation. arXiv preprint arXiv:2202.12362 (2022).Google ScholarGoogle Scholar
  58. Jascha Sohl-Dickstein, Eric L. Weiss, Niru Maheswaranathan, and Surya Ganguli. 2015. Deep Unsupervised Learning using Nonequilibrium Thermodynamics. arXiv: Learning (2015).Google ScholarGoogle Scholar
  59. Jascha Sohl-Dickstein, Eric L. Weiss, Niru Maheswaranathan, and Surya Ganguli. 2015. Deep Unsupervised Learning using Nonequilibrium Thermodynamics. arXiv: Learning (2015).Google ScholarGoogle Scholar
  60. Jiaming Song, Chenlin Meng, and Stefano Ermon. 2021. Denoising Diffusion Implicit Models. In International Conference on Learning Representations (ICLR).Google ScholarGoogle Scholar
  61. Yang Song and Stefano Ermon. 2019. Generative modeling by estimating gradients of the data distribution. Advances in Neural Information Processing Systems 32 (2019).Google ScholarGoogle Scholar
  62. Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. 2021. Score-Based Generative Modeling through Stochastic Differential Equations. In International Conference on Learning Representations (ICLR).Google ScholarGoogle Scholar
  63. Wei Ren Tan, Chee Seng Chan, Hernán Aguirre, and Kiyoshi Tanaka. 2017. ArtGAN: Artwork Synthesis with Conditional Categorial GANs.Google ScholarGoogle Scholar
  64. Wei Ren Tan, Chee Seng Chan, Hernán Aguirre, and Kiyoshi Tanaka. 2019. Improved ArtGAN for Conditional Synthesis of Natural Image and Artwork. IEEE Transactions on Image Processing 28 (2019), 394--409.Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. Fan Tang, Weiming Dong, Yiping Meng, Xing Mei, Feiyue Huang, Xiaopeng Zhang, and Oliver Deussen. 2018. Animated Construction of Chinese Brush Paintings. IEEE Transactions on Visualization and Computer Graphics 24, 12 (2018), 3019--3031.Google ScholarGoogle ScholarCross RefCross Ref
  66. Hao Wang, Guosheng Lin, Steven C. H. Hoi, and Chunyan Miao. 2021. Cycle-Consistent Inverse GAN for Text-to-Image Synthesis. In Proceedings of the 29th ACM International Conference on Multimedia. Association for Computing Machinery, New York, NY, USA, 630--638.Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. MiaoyiWang, BinWang, Yun Fei, Kanglai Qian,WenpingWang, Jiating Chen, and Jun-Hai Yong. 2014. Towards PhotoWatercolorization with Artistic Verisimilitude. IEEE Transactions on Visualization and Computer Graphics 20, 10 (2014), 1451-- 1460.Google ScholarGoogle ScholarCross RefCross Ref
  68. ZhizhongWang, Lei Zhao, Haibo Chen, Lihong Qiu, Qihang Mo, Sihuan Lin,Wei Xing, and Dongming Lu. 2020. Diversified Arbitrary Style Transfer via Deep Feature Perturbation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 7786--7795.Google ScholarGoogle Scholar
  69. Hua-Peng Wei, Ying-Ying Deng, Fan Tang, Xing-Jia Pan, and Wei-Ming Dong. 2022. A Comparative Study of CNN- and Transformer-Based Visual Style Transfer. Journal of Computer Science and Technology 37, 3 (2022), 601--614.Google ScholarGoogle ScholarDigital LibraryDigital Library
  70. Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe Gan, Xiaolei Huang, and Xiaodong He. 2018. AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 1316--1324.Google ScholarGoogle ScholarCross RefCross Ref
  71. Yuan Xue, Yuan-Chen Guo, Han Zhang, Tao Xu, Song-Hai Zhang, and Xiaolei Huang. 2022. Deep image synthesis from intuitive user input: A review and perspectives. Computational Visual Media 8, 1 (2022), 3--31.Google ScholarGoogle ScholarCross RefCross Ref
  72. Da Yi, Chao Guo, and Tianxiang Bai. 2021. Exploring Painting Synthesis with Diffusion Models. In IEEE 1st International Conference on Digital Twins and Parallel Intelligence (DTPI). 332--335.Google ScholarGoogle Scholar
  73. Sergey Zagoruyko and Nikos Komodakis. 2016. Wide residual networks. arXiv preprint arXiv:1605.07146 (2016).Google ScholarGoogle Scholar
  74. Kang Zhang and Jinhui Yu. 2016. Generation of Kandinsky Art. Leonardo 49, 1 (2016), 48--54.Google ScholarGoogle ScholarCross RefCross Ref
  75. Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. 2018. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 586--595.Google ScholarGoogle ScholarCross RefCross Ref
  76. Yuxin Zhang, Fan Tang, Weiming Dong, Haibin Huang, Chongyang Ma, Tong- Yee Lee, and Changsheng Xu. 2022. Domain Enhanced Arbitrary Image Style Transfer via Contrastive Learning. In Special Interest Group on Computer Graphics and Interactive Techniques Conference Proceedings (SIGGRAPH '22 Conference Proceedings.Google ScholarGoogle ScholarDigital LibraryDigital Library
  77. Bo Zhao, Xiao Wu, Zhi-Qi Cheng, Hao Liu, Zequn Jie, and Jiashi Feng. 2018. Multi-View Image Generation from a Single-View. In Proceedings of the 26th ACM International Conference on Multimedia (Seoul, Republic of Korea). Association for Computing Machinery, New York, NY, USA, 383--391.Google ScholarGoogle ScholarDigital LibraryDigital Library
  78. Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. 2017. Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV). 2223-- 2232.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Draw Your Art Dream: Diverse Digital Art Synthesis with Multimodal Guided Diffusion

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader