文本到图像生成模型

文本到图像生成模型是一种机器学习模型，一般以自然语言描述为输入，输出与该描述相匹配的图像。这种模型的开发始于2010年代中期，伴随深度神经网络技术的发展而进步。2022年，最先进的文生图模型，例如OpenAI的DALL-E 2、谷歌大脑的Imagen和StabilityAI的Stable Diffusion，其品质开始接近真实照片或是人类所绘艺术作品。

文生图模型通常结合了一个语言模型，负责将输入的文本转化为机器描述，而图像生成模型则负责生成图像。最有效的模型通常是用从互联网上抓取的大量图像和文本数据训练出来的。^[1]

历史

在深度学习兴起之前，搭建文生图模型的尝试仅限于通过排列现有的组件图像，如来自美工图案数据库的素材，形成类似于拼贴画的图像。^[2]^[3]

相反的任务，即给图像配文更具有可操作性，在第一个文生图模型出现之前，就已经出现了一些类似的模型。^[4]

第一个现代文生图模型是alignDRAW，由多伦多大学研究人员于2015年推出，扩展了之前的DRAW架构（其使用带有注意力机制的循环变分自编码器）使其能以文本序列作为输入。^[4] alignDRAW生成的图像是模糊的，并不逼真，但该模型能归纳出训练数据中没有的物体（如红色校车）。并适当的处理新的提示，如“停车标识在蓝天上飞”，表明它并不仅仅是在“回放”训练集中的数据。^[4]^[5]

以“A stop sign is flying in blue skies（在蓝天上飞行的停车标识）”为文本提示，由AlignDRAW生成的8张图像(2015)。（经过放大处理以显示细节）^[6]

2016年，Reed、Akata、Yan等人首先试图将生成对抗网络用于文生图任务。^[5]^[7]通过用狭窄的、特定领域的数据集训练的模型，他们能够从文字说明中生成“视觉上可信的”物体，如从“an all black bird with a distinct thick, rounded bill（一只全黑的鸟。有明显的厚而圆的喙）”中生成“视觉上可信的”鸟和花。在更多样化的COCO数据集上训练的模型产生的图像“从远处看……令人鼓舞”，但在细节上缺乏一致性。^[5]后来的系统包括VQGAN+CLIP、^[8]XMC-GAN和GauGAN2。^[9]

最早引起公众广泛关注的文生图模型之一是OpenAI的DALL-E，它是一个公布于2021年1月的Transformer模型系统。^[10]2022年4月，又发布了能生成更复杂、更逼真图像的DALL-E 2，^[11]2022年8月又出现了公开发布的Stable Diffusion。^[12]

继其他文生图模型之后，由语言模型驱动的文生视频平台开始涌现，如Runway、Make-A-Video、^[13]Imagen Video、^[14] Midjourney^[15]Phenaki等，^[16]它们可以从文本和/或文/图描述生成视频。^[17]

结构与训练

文生图模型有各种不同架构。文本编码这一步可以用循环神经网络如长短期记忆（LSTM）网络实现，后来更流行的是Transformer模型。对于图像生成这一步，通常使用条件生成对抗网络，近年来扩散模型也很受欢迎。与其直接训练一个以文本为输入、以高分辨率图像为输出的模型，不如先训练一个模型来生成低分辨率图像，再用一个或多个辅助的深度学习模型来提升质量，填补更精细的细节。

文生图模型是在大型（文，图）对数据集的基础上训练的，通常是从互联网上抓取来的。谷歌大脑在2022年的Imagen模型中使用的大型语言模型仅用到了纯文本数据（其权重随后被冻结），并得到了积极的结果，这与以往的标准方法不同。^[18]

数据集

训练文生图模型需要一个与文字说明相互相匹配的图像数据集。常用于此目的的数据集是微软于2014年发布的COCO（Common Objects in Context，语境中的常见对象），其由约12.3万张描述各种物体的图片组成，每张图片都有5条说明，由人类标注。Oxford-120 Flowers和CUB-200 Birds是较小的数据集，各有约1万张图片，分别限于花和鸟。它们的主题范围比较窄，因此用它们训练领域内的高质量文生图模型难度较小。^[7]

评价

评价文生图模型的质量十分具有挑战性，需要评估多种不同的属性。与任何生成性图像模型相同，所生成的图像最好比较真实（看起来像是来自训练集的有意义图像），且风格多样。文生图模型的一个具体要求是，生成的图像在语义上应与用于生成图像的文字说明相一致。这个一致性的度量与许多方案，有些是自动的，有些则基于人类的判断。^[7]

评估图像质量和多样性的常用算法指标是初始分数（Inception score，IS），它基于预训练的Inception v3图像分类模型应用于文生图模型生产的图像样本时，预测的标签分布。一个单一标签的可能性越高，分数就越高，这是基于鼓励“独特性”的理念做出的。另一个较为知名的指标是与其相关的FID分数，它根据预训练的图像分类模型的最后一层所提取的特征，对生成的图像和真实训练图像的分布进行比较。^[7]

影响与应用

纽约现代艺术博物馆的“思考机器:1959-1989,计算机时代的艺术与设计”（Thinking Machines: Art and Design in the Computer Age, 1959–1989）展览提供了AI在艺术、建筑和设计中的应用概况。展示AI用于生产艺术作品的展览有2016年谷歌赞助的旧金山灰色区域基金会的慈善活动和拍卖会，以及2017年于洛杉矶和法兰克福举办的“非人类：AI时代的艺术”（Unhuman: Art in the Age of AI），艺术家们在那里实验了DeepDream算法。2018年春，美国计算机协会专门出版了一期以计算机和艺术为主题的杂志。2018年6月，允许观众与AI互动的艺术作品“人与机器的二重奏”（Duet for Human and Machine）于Beall艺术+技术中心首演。奥地利Ars Electronica和维也纳应用艺术博物馆在2019年开设了关于AI的展览。Ars Electronica的2019年节日主题“盒子之外”（Out of the box）探讨了艺术在可持续社会转型中的作用。

网络上对于生成图像的应用开始蓬勃发展，也开始出现传统非尖端科技的领域的延伸。例如“小秋子绘本^[19]”就是文字转图片辅助语言治疗的案例。

2022年9月，一位专家得出结论：“AI艺术现在无处不在”，甚至专家也不知道它将意味着什么。^[20]一家新闻媒体确定“AI艺术蓬勃发展”，并报道了专业艺术家的版权和自动化问题，^[21]一家新闻媒体则调查了网络社区面对大量此种作品时的反应，^[22]也有人提出了对深伪技术的担忧。^[23]一部杂志强调了实现“新的艺术表现形式”的可能性，^[24]一篇社论指出，它可能被视为一种受欢迎的“人类能力的增强”。Vincent, James. Anyone can use this AI art generator — that's the risk. The Verge. 2022-09-15 [2022-11-09]. （原始内容存档于2023-02-14）. ^[25]^[26]

这种增强的例子可能包括，使业余爱好者能扩大非商业的市场定位体裁（常见的是赛博庞克衍生体裁，如太阳庞克）。

包括AI艺术在内的合成媒体在2022年被描述为一个主要的技术驱动趋势，可能会在将来几年内影响商业。^[26]

参见

参考文献

^ Vincent, James. All these images were generated by Google's latest text-to-image AI. The Verge (Vox Media). May 24, 2022 [2022-05-28]. （原始内容存档于2023-02-15）.
^ Agnese, Jorge; Herrera, Jonathan; Tao, Haicheng; Zhu, Xingquan. A Survey and Taxonomy of Adversarial Neural Networks for Text-to-Image Synthesis (PDF). 2019-10 [2023-01-14]. arXiv:1910.09399 . （原始内容存档 (PDF)于2023-03-16）.
^ Zhu, Xiaojin; Goldberg, Andrew B.; Eldawy, Mohamed; Dyer, Charles R.; Strock, Bradley. A text-to-picture synthesis system for augmenting communication (PDF). AAAI. 2007, 7: 1590–1595 [2023-01-14]. （原始内容存档 (PDF)于2022-09-07）.
^ ^4.0 ^4.1 ^4.2 Mansimov, Elman; Parisotto, Emilio; Lei Ba, Jimmy; Salakhutdinov, Ruslan. Generating Images from Captions with Attention. ICLR. 2015-11 [2023-01-14]. arXiv:1511.02793 . （原始内容存档于2023-04-14）.
^ ^5.0 ^5.1 ^5.2 Reed, Scott; Akata, Zeynep; Logeswaran, Lajanugen; Schiele, Bernt; Lee, Honglak. Generative Adversarial Text to Image Synthesis (PDF). International Conference on Machine Learning. 2016-06 [2023-01-14]. （原始内容存档 (PDF)于2023-03-16）.
^ Mansimov, Elman; Parisotto, Emilio; Ba, Jimmy Lei; Salakhutdinov, Ruslan. Generating Images from Captions with Attention (PDF). International Conference on Learning Representations. 2016-02-29 [2023-01-14]. arXiv:1511.02793 . （原始内容存档 (PDF)于2023-02-03）.
^ ^7.0 ^7.1 ^7.2 ^7.3 Frolov, Stanislav; Hinz, Tobias; Raue, Federico; Hees, Jörn; Dengel, Andreas. Adversarial text-to-image synthesis: A review. Neural Networks. 2021-12, 144: 187–209. PMID 34500257. S2CID 231698782. doi:10.1016/j.neunet.2021.07.019.
^ Rodriguez, Jesus. 🌅 Edge#229: VQGAN + CLIP. thesequence.substack.com. [2022-10-10]. （原始内容存档于2022-12-04）（英语）.
^ Rodriguez, Jesus. 🎆🌆 Edge#231: Text-to-Image Synthesis with GANs. thesequence.substack.com. [2022-10-10]. （原始内容存档于2022-12-04）（英语）.
^ Coldewey, Devin. OpenAI's DALL-E creates plausible images of literally anything you ask it to. TechCrunch. 2021-01-05 [2023-01-14]. （原始内容存档于2021-01-06）.
^ Coldewey, Devin. OpenAI's new DALL-E model draws anything — but bigger, better and faster than before. TechCrunch. 2022-04-06 [2023-01-14]. （原始内容存档于2023-05-06）.
^ Stable Diffusion Public Release. Stability.Ai. [2022-10-27]. （原始内容存档于2022-08-30）（英国英语）.
^ Kumar, Ashish. Meta AI Introduces 'Make-A-Video': An Artificial Intelligence System That Generates Videos From Text. MarkTechPost. 2022-10-03 [2022-10-03]. （原始内容存档于2022-12-01）（美国英语）.
^ Edwards, Benj. Google's newest AI generator creates HD video from text prompts. Ars Technica. 2022-10-05 [2022-10-25]. （原始内容存档于2023-02-07）（美国英语）.
^ Rodriguez, Jesus. 🎨 Edge#237: What is Midjourney?. thesequence.substack.com. [2022-10-26]. （原始内容存档于2022-12-04）（英语）.
^ Phenaki. phenaki.video. [2022-10-03]. （原始内容存档于2022-10-07）.
^ Edwards, Benj. Runway teases AI-powered text-to-video editing using written prompts. Ars Technica. 2022-09-09 [2022-09-12]. （原始内容存档于2023-01-27）.
^ Saharia, Chitwan; Chan, William; Saxena, Saurabh; Li, Lala; Whang, Jay; Denton, Emily; Kamyar Seyed Ghasemipour, Seyed; Karagol Ayan, Burcu; Sara Mahdavi, S.; Gontijo Lopes, Rapha; Salimans, Tim; Ho, Jonathan; J Fleet, David; Norouzi, Mohammad. Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding. 2022-05-23 [2023-01-14]. arXiv:2205.11487 . （原始内容存档于2023-03-25）.
^ 小秋子绘本. 小秋子繪本. 小秋子绘本.
^ Ocampo, Rodolfo. AI art is everywhere right now. Even experts don't know what it will mean. techxplore.com. [2022-09-15]. （原始内容存档于2023-01-19）（英语）.
^ As AI-generated art takes off - who really owns it?. Thomson Reuters Foundation. [2022-09-15]. （原始内容存档于2022-09-23）.
^ Edwards, Benj. Flooded with AI-generated images, some art communities ban them completely. Ars Technica. 2022-09-12 [2022-09-15]. （原始内容存档于2023-01-31）（美国英语）.
^ Wiggers, Kyle. Deepfakes: Uncensored AI art model prompts ethics questions. TechCrunch. 2022-08-24 [2022-09-15]. （原始内容存档于2022-08-31）.
^ AI is reshaping creativity, and maybe that's a good thing. Dazed. 2022-08-18 [2022-09-15]. （原始内容存档于2023-01-23）（英语）.
^ AI-generated art illustrates another problem with computers | John Naughton. The Guardian. 2022-08-20 [2022-09-15]. （原始内容存档于2023-02-06）（英语）.
^ ^26.0 ^26.1 Elgan, Mike. How 'synthetic media' will transform business forever. Computerworld. 2022-11-01 [2022-11-09]. （原始内容存档于2023-02-10）（英语）.

[imagen-verge-1] Vincent, James. All these images were generated by Google's latest text-to-image AI. The Verge (Vox Media). May 24, 2022 [2022-05-28]. （原始内容存档于2023-02-15）.

[agnese-2] Agnese, Jorge; Herrera, Jonathan; Tao, Haicheng; Zhu, Xingquan. A Survey and Taxonomy of Adversarial Neural Networks for Text-to-Image Synthesis (PDF). 2019-10 [2023-01-14]. arXiv:1910.09399 . （原始内容存档 (PDF)于2023-03-16）.

[zhu-2007-3] Zhu, Xiaojin; Goldberg, Andrew B.; Eldawy, Mohamed; Dyer, Charles R.; Strock, Bradley. A text-to-picture synthesis system for augmenting communication (PDF). AAAI. 2007, 7: 1590–1595 [2023-01-14]. （原始内容存档 (PDF)于2022-09-07）.

[mansimov-2015-4] 4.0 ^4.1 ^4.2 Mansimov, Elman; Parisotto, Emilio; Lei Ba, Jimmy; Salakhutdinov, Ruslan. Generating Images from Captions with Attention. ICLR. 2015-11 [2023-01-14]. arXiv:1511.02793 . （原始内容存档于2023-04-14）.

[reed-2016-5] 5.0 ^5.1 ^5.2 Reed, Scott; Akata, Zeynep; Logeswaran, Lajanugen; Schiele, Bernt; Lee, Honglak. Generative Adversarial Text to Image Synthesis (PDF). International Conference on Machine Learning. 2016-06 [2023-01-14]. （原始内容存档 (PDF)于2023-03-16）.

[6] Mansimov, Elman; Parisotto, Emilio; Ba, Jimmy Lei; Salakhutdinov, Ruslan. Generating Images from Captions with Attention (PDF). International Conference on Learning Representations. 2016-02-29 [2023-01-14]. arXiv:1511.02793 . （原始内容存档 (PDF)于2023-02-03）.

[frolov-7] 7.0 ^7.1 ^7.2 ^7.3 Frolov, Stanislav; Hinz, Tobias; Raue, Federico; Hees, Jörn; Dengel, Andreas. Adversarial text-to-image synthesis: A review. Neural Networks. 2021-12, 144: 187–209. PMID 34500257. S2CID 231698782. doi:10.1016/j.neunet.2021.07.019.

[8] Rodriguez, Jesus. 🌅 Edge#229: VQGAN + CLIP. thesequence.substack.com. [2022-10-10]. （原始内容存档于2022-12-04）（英语）.

[9] Rodriguez, Jesus. 🎆🌆 Edge#231: Text-to-Image Synthesis with GANs. thesequence.substack.com. [2022-10-10]. （原始内容存档于2022-12-04）（英语）.

[tc-dalle-10] Coldewey, Devin. OpenAI's DALL-E creates plausible images of literally anything you ask it to. TechCrunch. 2021-01-05 [2023-01-14]. （原始内容存档于2021-01-06）.

[tc-dalle-2-11] Coldewey, Devin. OpenAI's new DALL-E model draws anything — but bigger, better and faster than before. TechCrunch. 2022-04-06 [2023-01-14]. （原始内容存档于2023-05-06）.

[12] Stable Diffusion Public Release. Stability.Ai. [2022-10-27]. （原始内容存档于2022-08-30）（英国英语）.

[13] Kumar, Ashish. Meta AI Introduces 'Make-A-Video': An Artificial Intelligence System That Generates Videos From Text. MarkTechPost. 2022-10-03 [2022-10-03]. （原始内容存档于2022-12-01）（美国英语）.

[14] Edwards, Benj. Google's newest AI generator creates HD video from text prompts. Ars Technica. 2022-10-05 [2022-10-25]. （原始内容存档于2023-02-07）（美国英语）.

[15] Rodriguez, Jesus. 🎨 Edge#237: What is Midjourney?. thesequence.substack.com. [2022-10-26]. （原始内容存档于2022-12-04）（英语）.

[16] Phenaki. phenaki.video. [2022-10-03]. （原始内容存档于2022-10-07）.

[17] Edwards, Benj. Runway teases AI-powered text-to-video editing using written prompts. Ars Technica. 2022-09-09 [2022-09-12]. （原始内容存档于2023-01-27）.

[imagen-paper-18] Saharia, Chitwan; Chan, William; Saxena, Saurabh; Li, Lala; Whang, Jay; Denton, Emily; Kamyar Seyed Ghasemipour, Seyed; Karagol Ayan, Burcu; Sara Mahdavi, S.; Gontijo Lopes, Rapha; Salimans, Tim; Ho, Jonathan; J Fleet, David; Norouzi, Mohammad. Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding. 2022-05-23 [2023-01-14]. arXiv:2205.11487 . （原始内容存档于2023-03-25）.

[19] 小秋子绘本. 小秋子繪本. 小秋子绘本.

[20] Ocampo, Rodolfo. AI art is everywhere right now. Even experts don't know what it will mean. techxplore.com. [2022-09-15]. （原始内容存档于2023-01-19）（英语）.

[21] As AI-generated art takes off - who really owns it?. Thomson Reuters Foundation. [2022-09-15]. （原始内容存档于2022-09-23）.

[22] Edwards, Benj. Flooded with AI-generated images, some art communities ban them completely. Ars Technica. 2022-09-12 [2022-09-15]. （原始内容存档于2023-01-31）（美国英语）.

[deepfakes-23] Wiggers, Kyle. Deepfakes: Uncensored AI art model prompts ethics questions. TechCrunch. 2022-08-24 [2022-09-15]. （原始内容存档于2022-08-31）.

[24] AI is reshaping creativity, and maybe that's a good thing. Dazed. 2022-08-18 [2022-09-15]. （原始内容存档于2023-01-23）（英语）.

[25] AI-generated art illustrates another problem with computers | John Naughton. The Guardian. 2022-08-20 [2022-09-15]. （原始内容存档于2023-02-06）（英语）.

[computerworld-26] 26.0 ^26.1 Elgan, Mike. How 'synthetic media' will transform business forever. Computerworld. 2022-11-01 [2022-11-09]. （原始内容存档于2023-02-10）（英语）.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]