ABSTRACT
Digital art synthesis is receiving increasing attention in the multimedia community because of engaging the public with art effectively. Current digital art synthesis methods usually use single-modality inputs as guidance, thereby limiting the expressiveness of the model and the diversity of generated results. To solve this problem, we propose the multimodal guided artwork diffusion (MGAD) model, which is a diffusion-based digital artwork generation approach that utilizes multimodal prompts as guidance to control the classifier-free diffusion model. Additionally, the contrastive language-image pretraining (CLIP) model is used to unify text and image modalities. Extensive experimental results on the quality and quantity of the generated digital art paintings confirm the effectiveness of the combination of the diffusion model and multimodal guidance. Code is available at https://github.com/haha-lisa/MGAD-multimodal-guided-artwork-diffusion.
- 2022. Yahoo Flickr Creative Commons 100 Million (YFCC100m) dataset. http://projects.dfki.uni-kl.de/yfcc100m/Google Scholar
- Adverb. 2022. The BigSleep: BigGANCLIP. https://colab.research.google.com/ drive/1NCceX2mbiKOSlAd_o7IU7nA9UskKN5WR?usp=sharing#scrollTo= WtlDVVMvzMUdGoogle Scholar
- Luis Alvarez, Nelson Monzón, and Jean-Michel Morel. 2021. Interactive Design of Random Aesthetic Abstract Textures by Composition Principles. Leonardo 54, 2 (2021), 179--184.Google ScholarCross Ref
- Andrew Brock, Jeff Donahue, and Karen Simonyan. 2018. Large Scale GAN Training for High Fidelity Natural Image Synthesis. In International Conference on Learning Representations (ICLR).Google Scholar
- Johannes Buchner. 2021. ImageHash:An image hashing library written in Python. https://github.com/JohannesBuchner/imagehashGoogle Scholar
- Haibo Chen, Lei Zhao, Zhizhong Wang, Huiming Zhang, Zhiwen Zuo, Ailin Li, Wei Xing, and Dongming Lu. 2021. DualAST: Dual Style-Learning Networks for Artistic Style Transfer. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 872--881.Google Scholar
- Haibo Chen, Lei Zhao, Huiming Zhang, Zhizhong Wang, Zhiwen Zuo, Ailin Li, Wei Xing, and Dongming Lu. 2021. Diverse Image Style Transfer via Invertible Cross-Space Mapping. In IEEE/CVF International Conference on Computer Vision (ICCV). 14860--14869.Google ScholarCross Ref
- Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. 2020. Uniter: Universal image-text representation learning. In European Conference on Computer Vision (ECCV). 104--120.Google ScholarDigital Library
- Jooyoung Choi, Sungwon Kim, Yonghyun Jeong, Youngjune Gwon, and Sungroh Yoon. 2021. ILVR: Conditioning Method for Denoising Diffusion Probabilistic Models. In IEEE/CVF International Conference on Computer Vision (ICCV). 14347-- 14356.Google Scholar
- Katherine Crowson and Chainbreakers AI. 2022. Diffusion 512x512, secondary model method. https://github.com/crowsonkb/v-diffusion-pytorchGoogle Scholar
- Yingying Deng, Fan Tang, Weiming Dong, Haibin Huang, Chongyang Ma, and Changsheng Xu. 2021. Arbitrary video style transfer via multi-channel correlation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 1210--1217.Google ScholarCross Ref
- Yingying Deng, Fan Tang, Weiming Dong, Chongyang Ma, Feiyue Huang, Oliver Deussen, and Changsheng Xu. 2020. Exploring the representativity of art paintings. IEEE Transactions on Multimedia 23 (2020), 2794--2805.Google ScholarDigital Library
- Yingying Deng, Fan Tang,Weiming Dong, Chongyang Ma, Xingjia Pan, LeiWang, and Changsheng Xu. 2022. StyTr2: Image Style Transfer with Transformers. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarCross Ref
- Yingying Deng, Fan Tang, Weiming Dong, Wen Sun, Feiyue Huang, and Changsheng Xu. 2020. Arbitrary style transfer via multi-adaptation network. In Proceedings of the 28th ACM International Conference on Multimedia. 2719--2727.Google ScholarDigital Library
- Yingying Deng, Fan Tang, Weiming Dong, Fuzhang Wu, Oliver Deussen, and Changsheng Xu. 2019. Selective clustering for representative paintings selection. Multimedia Tools and Applications 78, 14 (2019), 19305--19323.Google ScholarDigital Library
- Karan Desai and Justin Johnson. 2021. Virtex: Learning visual representations from textual annotations. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 11162--11173.Google ScholarCross Ref
- Prafulla Dhariwal and Alex Nichol. 2021. Diffusion Models Beat GANs on Image Synthesis. In Advances in Neural Information Processing Systems (NeurIPS).Google Scholar
- Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations (ICLR).Google Scholar
- Patrick Esser, Robin Rombach, and Bjorn Ommer. 2021. Taming transformers for high-resolution image synthesis. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 12873--12883.Google ScholarCross Ref
- Kevin Frans, LB Soros, and Olaf Witkowski. 2021. CLIPDraw: Exploring Text-to-Drawing Synthesis through Language-Image Encoders. arXiv preprint arXiv:2106.14843 (2021).Google Scholar
- Rinon Gal, Or Patashnik, Haggai Maron, Gal Chechik, and Daniel Cohen-Or. 2021. StyleGAN-NADA: CLIP-Guided Domain Adaptation of Image Generators. arXiv:2108.00946 [cs.CV]Google Scholar
- Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative Adversarial Nets. In Neural Information Processing Systems (NIPS).Google Scholar
- Shuyang Gu, Dong Chen, Jianmin Bao, Fang Wen, Bo Zhang, Dongdong Chen, Lu Yuan, and Baining Guo. 2021. Vector Quantized Diffusion Model for Text-to-Image Synthesis. arXiv preprint arXiv:2111.14822 (2021).Google Scholar
- Peter Hall, Hongping Cai, Qi Wu, and Tadeo Corradi. 2015. Cross-depiction problem: Recognition and synthesis of photographs and artwork. Computational Visual Media 1, 2 (2015), 91--103.Google ScholarCross Ref
- Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising Diffusion Probabilistic Models. arXiv: Learning (2020).Google Scholar
- Jonathan Ho and Tim Salimans. 2021. Classifier-Free Diffusion Guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications.Google Scholar
- Xun Huang and Serge Belongie. 2017. Arbitrary style transfer in real-time with adaptive instance normalization. In IEEE International Conference on Computer Vision (ICCV). IEEE, 1501--1510.Google ScholarCross Ref
- Zhengyu Huang, Yichen Peng, Tomohiro Hibino, Chunqi Zhao, Haoran Xie, Tsukasa Fukusato, and Kazunori Miyata. 2022. dualface: Two-stage drawing guidance for freehand portrait sketching. Computational Visual Media 8, 1 (2022), 63--77.Google ScholarCross Ref
- Yuchi Huo and Sung-eui Yoon. 2021. A survey on deep learning-based Monte Carlo denoising. Computational Visual Media 7, 2 (2021), 169--185.Google ScholarCross Ref
- Ajay Jain. 2021. VectorAscent: Generate vector graphics from a textual description. https://github.com/ajayjain/VectorAscentGoogle Scholar
- Justin Johnson, Alexandre Alahi, and Li Fei-Fei. 2016. Perceptual losses for realtime style transfer and super-resolution. In European Conference on Computer Vision (ECCV). 694--711.Google ScholarCross Ref
- Alexia Jolicoeur-Martineau, Rémi Piché-Taillefer, Ioannis Mitliagkas, and Remi Tachet des Combes. 2021. Adversarial score matching and improved sampling for image generation. In International Conference on Learning Representations (ICLR).Google Scholar
- Tero Karras, Samuli Laine, and Timo Aila. 2018. A Style-Based Generator Architecture for Generative Adversarial Networks. arXiv: Neural and Evolutionary Computing (2018).Google Scholar
- Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. 2020. Analyzing and Improving the Image Quality of StyleGAN. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 8107--8116.Google Scholar
- Gwanghyun Kim, Taesung Kwon, and Jong Chul Ye. 2021. DiffusionCLIP: Text- Guided Diffusion Models for Robust Image Manipulation. (2021). https://doi.org/10.48550/ARXIV.2110.02711Google Scholar
- Diederik P. Kingma and Prafulla Dhariwal. 2018. Glow: Generative Flow with Invertible 1x1 Convolutions. arXiv: Machine Learning (2018).Google Scholar
- Dmytro Kotovenko, Artsiom Sanakoyeu, Sabine Lang, and Bjorn Ommer. 2019. Content and Style Disentanglement for Artistic Style Transfer. In IEEE/CVF International Conference on Computer Vision (ICCV). 4422--4431.Google Scholar
- Gihyun Kwon and Jong Chul Ye. 2022. CLIPstyler: Image Style Transfer with a Single Text Condition. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarCross Ref
- Tzu-Mao Li, Michal Lukác, Michaël Gharbi, and Jonathan Ragan-Kelley. 2020. Differentiable vector graphics rasterization for editing and learning. ACM Transactions on Graphics 39, 6 (2020), 1--15.Google ScholarDigital Library
- Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, et al. 2020. Oscar: Objectsemantics aligned pre-training for vision-language tasks. In European Conference on Computer Vision (ECCV). 121--137.Google ScholarDigital Library
- Minxuan Lin, Fan Tang,Weiming Dong, Xiao Li, Changsheng Xu, and Chongyang Ma. 2021. Distribution Aligned Multimodal and Multi-Domain Image Stylization. ACM Transactions on Multimedia Computing, Communications, and Applications 17, 3, Article 96 (2021), 17 pages.Google ScholarDigital Library
- Songhua Liu, Tianwei Lin, Dongliang He, Fu Li, MeilingWang, Xin Li, Zhengxing Sun, Qian Li, and Errui Ding. 2021. AdaAttN: Revisit attention mechanism in arbitrary neural style transfer. In IEEE/CVF International Conference on Computer Vision (ICCV). 6649--6658.Google ScholarCross Ref
- Xingchao Liu, Chengyue Gong, Lemeng Wu, Shujian Zhang, Hao Su, and Qiang Liu. 2021. FuseDream: Training-Free Text-to-Image Generation with Improved CLIPGAN Space Optimization. arXiv:2112.01573 [cs.CV]Google Scholar
- Xihui Liu, Dong Huk Park, Samaneh Azadi, Gong Zhang, Arman Chopikyan, Yuxiao Hu, Humphrey Shi, Anna Rohrbach, and Trevor Darrell. 2021. More Control for Free! Image Synthesis with Semantic Diffusion Guidance. arXiv:2112.05744 [cs.CV]Google Scholar
- Yahui Liu, Marco De Nadai, Deng Cai, Huayang Li, Xavier Alameda-Pineda, Nicu Sebe, and Bruno Lepri. 2020. Describe What to Change: A Text-Guided Unsupervised Image-to-Image Translation Approach. In Proceedings of the 28th ACM International Conference on Multimedia. Association for Computing Machinery, 1357--1365.Google ScholarDigital Library
- Jacob Menick and Nal Kalchbrenner. 2018. Generating High Fidelity Images with Subscale Pixel Networks and Multidimensional Upscaling. international conference on learning representations (ICLR) (2018).Google Scholar
- Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. 2021. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741 (2021).Google Scholar
- Or Patashnik, Zongze Wu, Eli Shechtman, Daniel Cohen-Or, and Dani Lischinski. 2021. StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 2085--2094.Google ScholarCross Ref
- Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning. PMLR, 8748--8763.Google Scholar
- Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. 2022. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125 (2022).Google Scholar
- Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. 2021. Zero-Shot Text-to-Image Generation. In International Conference on Machine Learning (ICML). 8821--8831.Google Scholar
- Ali Razavi, Aäron van den Oord, and Oriol Vinyals. 2019. Generating Diverse High-Fidelity Images with VQ-VAE-2. In Advances in Neural Information Processing Systems.Google Scholar
- Nerdy Rodent. 2022. Source Code of VQGAN-CLIP. https://github.com/nerdyrodent/VQGAN-CLIPGoogle Scholar
- Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2021. High-Resolution Image Synthesis with Latent Diffusion Models.Google Scholar
- Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Medical Image Computing and Computer-Assisted Intervention.Google Scholar
- Shulan Ruan, Yong Zhang, Kun Zhang, Yanbo Fan, Fan Tang, Qi Liu, and Enhong Chen. 2021. DAE-GAN: Dynamic Aspect-aware GAN for Text-to-Image Synthesis. In IEEE/CVF International Conference on Computer Vision (ICCV). 13960--13969.Google Scholar
- Peter Schaldenbrand, Zhixuan Liu, and Jean Oh. 2022. StyleCLIPDraw: Coupling Content and Style in Text-to-Drawing Translation. arXiv preprint arXiv:2202.12362 (2022).Google Scholar
- Jascha Sohl-Dickstein, Eric L. Weiss, Niru Maheswaranathan, and Surya Ganguli. 2015. Deep Unsupervised Learning using Nonequilibrium Thermodynamics. arXiv: Learning (2015).Google Scholar
- Jascha Sohl-Dickstein, Eric L. Weiss, Niru Maheswaranathan, and Surya Ganguli. 2015. Deep Unsupervised Learning using Nonequilibrium Thermodynamics. arXiv: Learning (2015).Google Scholar
- Jiaming Song, Chenlin Meng, and Stefano Ermon. 2021. Denoising Diffusion Implicit Models. In International Conference on Learning Representations (ICLR).Google Scholar
- Yang Song and Stefano Ermon. 2019. Generative modeling by estimating gradients of the data distribution. Advances in Neural Information Processing Systems 32 (2019).Google Scholar
- Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. 2021. Score-Based Generative Modeling through Stochastic Differential Equations. In International Conference on Learning Representations (ICLR).Google Scholar
- Wei Ren Tan, Chee Seng Chan, Hernán Aguirre, and Kiyoshi Tanaka. 2017. ArtGAN: Artwork Synthesis with Conditional Categorial GANs.Google Scholar
- Wei Ren Tan, Chee Seng Chan, Hernán Aguirre, and Kiyoshi Tanaka. 2019. Improved ArtGAN for Conditional Synthesis of Natural Image and Artwork. IEEE Transactions on Image Processing 28 (2019), 394--409.Google ScholarDigital Library
- Fan Tang, Weiming Dong, Yiping Meng, Xing Mei, Feiyue Huang, Xiaopeng Zhang, and Oliver Deussen. 2018. Animated Construction of Chinese Brush Paintings. IEEE Transactions on Visualization and Computer Graphics 24, 12 (2018), 3019--3031.Google ScholarCross Ref
- Hao Wang, Guosheng Lin, Steven C. H. Hoi, and Chunyan Miao. 2021. Cycle-Consistent Inverse GAN for Text-to-Image Synthesis. In Proceedings of the 29th ACM International Conference on Multimedia. Association for Computing Machinery, New York, NY, USA, 630--638.Google ScholarDigital Library
- MiaoyiWang, BinWang, Yun Fei, Kanglai Qian,WenpingWang, Jiating Chen, and Jun-Hai Yong. 2014. Towards PhotoWatercolorization with Artistic Verisimilitude. IEEE Transactions on Visualization and Computer Graphics 20, 10 (2014), 1451-- 1460.Google ScholarCross Ref
- ZhizhongWang, Lei Zhao, Haibo Chen, Lihong Qiu, Qihang Mo, Sihuan Lin,Wei Xing, and Dongming Lu. 2020. Diversified Arbitrary Style Transfer via Deep Feature Perturbation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 7786--7795.Google Scholar
- Hua-Peng Wei, Ying-Ying Deng, Fan Tang, Xing-Jia Pan, and Wei-Ming Dong. 2022. A Comparative Study of CNN- and Transformer-Based Visual Style Transfer. Journal of Computer Science and Technology 37, 3 (2022), 601--614.Google ScholarDigital Library
- Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe Gan, Xiaolei Huang, and Xiaodong He. 2018. AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 1316--1324.Google ScholarCross Ref
- Yuan Xue, Yuan-Chen Guo, Han Zhang, Tao Xu, Song-Hai Zhang, and Xiaolei Huang. 2022. Deep image synthesis from intuitive user input: A review and perspectives. Computational Visual Media 8, 1 (2022), 3--31.Google ScholarCross Ref
- Da Yi, Chao Guo, and Tianxiang Bai. 2021. Exploring Painting Synthesis with Diffusion Models. In IEEE 1st International Conference on Digital Twins and Parallel Intelligence (DTPI). 332--335.Google Scholar
- Sergey Zagoruyko and Nikos Komodakis. 2016. Wide residual networks. arXiv preprint arXiv:1605.07146 (2016).Google Scholar
- Kang Zhang and Jinhui Yu. 2016. Generation of Kandinsky Art. Leonardo 49, 1 (2016), 48--54.Google ScholarCross Ref
- Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. 2018. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 586--595.Google ScholarCross Ref
- Yuxin Zhang, Fan Tang, Weiming Dong, Haibin Huang, Chongyang Ma, Tong- Yee Lee, and Changsheng Xu. 2022. Domain Enhanced Arbitrary Image Style Transfer via Contrastive Learning. In Special Interest Group on Computer Graphics and Interactive Techniques Conference Proceedings (SIGGRAPH '22 Conference Proceedings.Google ScholarDigital Library
- Bo Zhao, Xiao Wu, Zhi-Qi Cheng, Hao Liu, Zequn Jie, and Jiashi Feng. 2018. Multi-View Image Generation from a Single-View. In Proceedings of the 26th ACM International Conference on Multimedia (Seoul, Republic of Korea). Association for Computing Machinery, New York, NY, USA, 383--391.Google ScholarDigital Library
- Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. 2017. Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV). 2223-- 2232.Google ScholarCross Ref
Index Terms
- Draw Your Art Dream: Diverse Digital Art Synthesis with Multimodal Guided Diffusion
Recommendations
Digital islamic art: the use of digital technologies in contemporary islamic art in the UK
EVA '16: Proceedings of the conference on Electronic Visualisation and the ArtsIn this paper, I provide a brief introduction to my artistic practice combining the use of digital technologies with traditional methods for producing Islamic art. Looking at further examples of Islamic artworks by artists in the UK I describe how in ...
Exploring the Emotional Design of Digital Art Under the Multimodal Interaction Form
Design, User Experience, and UsabilityAbstractUnder the trend of digital age, new technologies are changing day by day. The rapid progress of science and technology has led to great progress and innovation in the forms of digital art. Digital art works under multimodality provide not only ...
Creativity in algorithmic art
C&C '09: Proceedings of the seventh ACM conference on Creativity and cognitionEarly algorithmic art (also called computer art or digital art) is chosen as a case to differentiate three aspects of creative behavior: trivial, personal, and historic creativity. Extending a remark by Marcel Duchamp on the role of the spectator in ...
Comments