GlyphDraw2: Automatic Generation of Complex Glyph Posters with Diffusion Models and Large Language Models (2024)

Jian Ma¹,Yonglin Deng², Chen Chen^{1 #}, Haonan Lu¹, Zhenyu Yang¹The author did his work during internship at OPPO AI Center. # denotes corresponding authors.

Abstract

Posters play a crucial role in marketing and advertising, contributing significantly to industrial design by enhancing visual communication and brand visibility. With recent advances in controllable text-to-image diffusion models, more concise research is now focusing on rendering text within synthetic images. Despite improvements in text rendering accuracy, the field of end-to-end poster generation remains underexplored. This complex task involves striking a balance between text rendering accuracy and automated layout to produce high-resolution images with variable aspect ratios. To tackle this challenge, we propose an end-to-end text rendering framework employing a triple cross-attention mechanism rooted in align learning, designed to create precise poster text within detailed contextual backgrounds. Additionally, we introduce a high-resolution dataset that exceeds 1024 pixels in image resolution. Our approach leverages the SDXL architecture. Extensive experiments validate the ability of our method to generate poster images featuring intricate and contextually rich backgrounds. Codes will be available at https://github.com/OPPO-Mente-Lab/GlyphDraw2.

GlyphDraw2: Automatic Generation of Complex Glyph Posters with Diffusion Models and Large Language Models (1)

Introduction

The impressive generative capabilities of large text-to-image diffusion models (Nichol etal. 2021; Ramesh etal. 2022; Rombach etal. 2022; Saharia etal. 2022; Podell etal. 2023) enable the creation of highly realistic and detailed images. Substantial research efforts have focused on addressing the text rendering limitations of diffusion models, yielding promising results. Building upon these advancements, our goal is now to endow diffusion systems with the capability for end-to-end poster generation. Posters serve as a prominent medium of visual communication. In the field of industrial design, there has been a growing inclination for personalized and customized posters in advertising, publicity, marketing, and other fields. Poster generation based on diffusion models offers a novel solution in this domain, representing an important direction for the industrial application of text-to-image generation.

The challenge of controllable image generation has gained significant attention recently because text-based conditioning by itself often falls short of providing the precise control necessary to meet the varied and complex requirements of users. To create personalized images that satisfy user preferences and steer text-to-image diffusion models with innovative conditions, one method involves using extra adapter modules to encode these new conditions and then applying a cross-attention mechanism to blend the encoded features into the diffusion generation process.Visual text rendering is a subset of personalized text-to-image generation, as it involves supplying glyph reference images and specific text layout details to steer the generation process. Recent studies in this field can be generally classified into two primary approaches: layout control (Chen etal. 2024, 2023a) and text accuracy (Yang etal. 2024; Tuo etal. 2023; Zhao and Lian 2023). Nonetheless, there is a lack of comprehensive research on creating an end-to-end text rendering diffusion model that automates layout according to user inputs and produces images with high text accuracy. This is particularly advantageous for poster creation, where end-to-end systems can offer immense convenience and enhance user experiences by eliminating manual layout adjustments. Ensuring high text rendering accuracy and rich visual backgrounds is vital for these generated posters. Furthermore, such posters require high resolution and adaptable aspect ratios to accommodate various display scenarios and design requirements.

To address the aforementioned challenges, we introduce a controllable text generation framework with a triple cross-attention mechanism based on alignment learning. This is an end-to-end poster generation framework that leverages user prompts during the inference process to automatically generate layouts and ensures high text accuracy while generating poster images with visually-rich background. In particular, regarding layout, a large language model is fine-tuned to produce the positional information of text bounding boxes (bbox). This refinement allows for more adaptable and resilient layout strategy in poster generation. When adding new drawn elements, it is crucial to integrate them smoothly and consistently with the background. Studies (Li, Li, and Hoi 2024; Ma etal. 2023b; Tuo etal. 2023; Li etal. 2023b) have shown that using conditional representations in text prompts to guide generation can achieve a seamless blend between conditions and background. Therefore, we utilize PP-OCR (Li etal. 2022) to encode text stroke attributes into embeddings, which are integrated with image caption embeddings as text features for SDXL.For glyph control, the ControlNet architecture (Zhang, Rao, and Agrawala 2023) is employed, which utilizes positional text information as the image feature and incorporates PP-OCR embeddings as text input. To enhance text generation accuracy, besides the inherent cross-attention mechanism applied to text features in U-Net, we introduce two additional cross-attention mechanisms: a glyph image prompts adapter akin to IP-Adapter (Ye etal. 2023) and a ControlNet-based adaptive fusion module. Finally, an auxiliary alignment loss on the generated image is implemented to maintain the visual richness of the background.In summary, our contributions are fourfold.

•
We propose an end-to-end solution for poster generation by fine-tuning large language models (LLMs) for layout planning. A glyph generation framework based on alignment learning and triple cross-attention can accurately place text in appropriate positions while preserving the visually rich background of the poster.
•
We introduce a higher-resolution dataset that includes image-text pairs of both Chinese and English glyphs, as well as high-quality poster data.
•
Both quantitative and qualitative experimental results demonstrate the excellent performance of our proposed architecture in generating posters.

Related Work

Controllable Text-to-Image Diffusion Models In recent years, text-to-image (T2I) diffusion models have demonstrated state-of-the-art performance in image generation. Although text-based conditioning has advanced the field of controllable generation, it still lacks the capacity to fully satisfy the diverse needs of all users. Consequently, a growing body of recent research has shifted its focus towards integrating novel conditioning beyond the scope of text description into T2I diffusion models, aiming to address more specific requirements of users across various applications. A prevalent approach involves the integration of model-based conditioning, wherein an auxiliary model is utilized to encode novel conditioning factors, and the encoded features are input into the diffusion model. A prominent instance of this approach is IP-Adapter, which introduced a decoupled cross-attention mechanism to separate the cross-attention layers of text features and image features. The cross-attention mechanism employed by IP-Adapter for image-based conditioning has demonstrated efficacy and has been widely adopted in numerous subsequent investigations into controllable generation (Ma etal. 2023b; Wang etal. 2024).ControlNet also stands out for controllable generation by incorporating an additional copy of the encoder into the U-Net structure. This supplementary encoder is connected to the original U-Net layer through the proposed zero convolution to prevent overfitting and catastrophic forgetting. ControlNet can combine the input conditions of specific tasks as a prior conditions for controllable generation, and has been widely studied in spatial control(Jia etal. 2024; Qin etal. 2023; Zavadski, Feiden, and Rother 2023), text rendering(Yang etal. 2024; Zhang etal. 2023a), and 3D generation(Chen etal. 2023c; Yu etal. 2023).

Text RenderingText rendering is a pivotal task in controllable image generation. The fundamental objective is to generate accurate, well-laid-out text images on images that seamlessly blend with the background. Initially, GlyphDraw (Ma etal. 2023c) modify the network structure to allow the model to learn drawinglanguage characters with the help of glyph and position information. It lacks line break text generation and has limited layout planning capabilities. Meanwhile, GlyphControl (Yang etal. 2024) and Brush Your Text (Zhang etal. 2023a) enhance off-the-shelf text-to-image diffusion models by leveraging the shape information of glyph images with the ControlNet branch, thus imbuing text-to-image diffusion models with the capacity to generate text. The latter introduces localized attention constraints to solve the problem of unreasonable positioning of scene text. In addition, TextDiffuser (Chen etal. 2024) and TextDiffuser-2 (Chen etal. 2023a) perform layout automation by using a layout transformer and a large language model, respectively, to predict the layout of the input prompts. TextDiffuser series is limited to monolingual generation, and the accuracy of the generated text requires further improvement. AnyText (Tuo etal. 2023) comprises a diffusion pipeline with an auxiliary latent module and a text embedding module. During training, it employed text-control diffusion loss and text perceptual loss for training to further enhance writing accuracy. It requires users to specify the layout. Otherwise, it can only generate a random layout, which may lack aesthetic appeal and rational structure. Further, UDiffText (Zhao and Lian 2023) and Glyph-ByT5 (Liu etal. 2024) design and train character-aware and glyph-aligned text encoder to provide more robust text embeddings as conditional guidance.Poster generation is a type of text rendering. In this study, we take into account the automatic generation of layout and improve the accuracy of generated text while maintaining the visually captivating nature of the background.

LLM-Generated Text-to-Image ConditionsControllable image generation typically requires users to provide more intricate and nuanced conditions.For instance, ControlNet requires the provision of depth maps, edge detection maps, semantic segmentation maps, etc. However, these additional conditions often demands further human effort to acquire, imposing a burden on the user in comparison to the relative simplicity of text-based prompts. Meanwhile, existing generative models face challenges in fully comprehending complex, lengthy texts with more elaborate descriptions. To mitigate the aforementioned problems, recent studies (Nie etal. 2024; Zhang etal. 2023b; Gani etal. 2023) have explored the use of LLMs to generate new comprehensive conditions based on user prompts, such as blob representations, sketches with descriptions, object descriptions, and layout specifications to guide image generation. Especially for layout, LayoutGPT (Feng etal. 2024) and LayoutPrompter (Lin etal. 2024) leverage LLMs to generate style sheet language for each object, such as CSS, HTML, XML, ect. Furthermore, TextDiffuser-2, LLM Blueprint (Gani etal. 2023) and Reason Out your Layout (Chen etal. 2023b) have explored utilizing LLMs to generate a bbox for each object as a new condition. Generating layout bboxes can be achieved through two main approaches: prompt engineering for advanced proprietary models such as GPT-4, and fine-tuning open-source LLMs. Compared to prompt engineering, fine-tuning LLMs is more efficient and facilitates the development of end-to-end poster generation models.Based on the above, we fine-tune LLMs on poster layout information to generate bboxes that guide the positioning of textual elements within posters.

Build the Dataset

Motivation

To endow the diffusion model with the ability to produce effective poster images, the construction of a comprehensive dataset with the following characteristics is necessary: diverse glyph distributions, aesthetically pleasing layouts and compositions, and visually appealing backgrounds.In addition, our specific objective focuses on achieving versatility through bilingual poster generation.However, existing datasets mainly emphasize text-image datasets specifically tailored for monolingual text rendering, such as LAION-Glyph (Yang etal. 2024) and MARIO-10M (Chen etal. 2024) for English generation, which also demonstrate slight limitations in terms of text layout.AnyWord-3M (Tuo etal. 2023) is a bilingual dataset predominantly sourced from e-commerce and advertising contexts. While it is well-suited for text rendering training, it lacks sufficient text layout and background appeal for poster generation tasks, making it less than ideal as a standalone training dataset for poster generation.

Therefore, we developed two large-scale high-resolution image datasets to enhance the accuracy of the generated text and improve the overall aesthetic quality of poster generation task. These datasets have a resolution that exceeds 1024 pixels, enabling the creation of more precise and visually appealing posters.The first dataset, referred to as the general dataset, is designed to train the model’s text rendering capabilities. It consists of two parts: Chinese and English, with the Chinese data being approximately twice the size of the English data. The second dataset is specifically tailored for poster generation, predominantly featuring Chinese glyphs in the rendered text of poster images. Around 10% of the data in this dataset contains English words.

Data Collection and Processing

Two datasets are prepared for poster generation: a general dataset and a poster dataset.To collect high-quality images embedded with visual text suitable for poster generation tasks, we adopt a data pre-processing procedure to filter data and extract text and location information. The process is depicted in Figure 2.

GlyphDraw2: Automatic Generation of Complex Glyph Posters with Diffusion Models and Large Language Models (2)

The initial step entails processing the general dataset. Specifically, high-resolution images are first selected, as our chosen base model is SDXL. Following this, PP-OCR is employed to precisely locate and recognize text elements within the images, encompassing both English and Chinese characters. In order to mitigate potential noise in the dataset, we employ sophisticated filtering strategies specifically designed for the identified text bboxes. Additionally, we leverage the BLIP-2 (Li etal. 2023a) to generate captions for the collected images. The extracted text is enclosed within quotation marks and seamlessly integrated into the image captions.

For the poster dataset, supplementary resolution-based filtering rules are introduced to carefully select landscape and portrait-oriented posters. Furthermore, aesthetic scoring is utilized to identify visually captivating images of higher quality. In addition, to enhance the dataset’s overall quality, improvements were made to the image processing techniques. Due to the potential limitations of PP-OCR in accurately locating and extracting all text and its corresponding positions in images, the unrecognizable portions of text in the images introduce noise during the training process. Consequently, during inference, the model generates garbled and malformed text outside the target regions, which adversely affects the quality of the resulting images. To address this issue, a specific approach was implemented for handling small text in the poster dataset. This involved adding masks to the regions containing small text and utilizing the LaMa (Suvorov etal. 2022) model to restore the images. Small text areas are text areas where the area obtained by PP-OCR accounts for less than 0.001 of the total area. The restored images are then incorporated into the poster dataset, ensuring improved quality.

Based on the aforementioned process, detailed filtering strategies and statistical distribution of the datasets are presented in the appendix.

Methodology

GlyphDraw2: Automatic Generation of Complex Glyph Posters with Diffusion Models and Large Language Models (3)

Model Overview

The entire framework is divided into four parts, as shown in Fig.3. The first component, the Fusion Text Encoder (FTE) with glyph embedding, operates in a relatively traditional manner. Its primary objective is to integrate the features of two modalities from the perspective of the text encoder in SD, thereby ensuring a cohesive amalgamation of the two modalities in the generated images. The second, and more pivotal, element of our framework is the introduction of Triples of Cross-Attention (TCA). In this stage, we have incorporated two distinct cross-attention layers into the SD decoder section.The first new cross-attention layer facilitates the interaction between glyph features and the hidden variables within the image. This builds on earlier work such as IP-Adapter, enhancing the accuracy of glyph rendering. Meanwhile, the second new cross-attention layer enables interaction between ControlNet features and the hidden variables in the image. By engaging with ControlNet information, this layer adaptively learns intrinsic data, such as the harmonious layout of the glyph.In the third part, we have added learning of Auxiliary Alignment Loss (AAL) for semantic consistency, in order to enhance the overall layout and enrich the background information of the poster. Finally, in the inference stage, we employed the fine-tuning LLM strategy to automatically analyze user descriptions and generate corresponding glyphs and coordinate positions of the condition framework. This aims to satisfy end-to-end poster generation.

Preliminaries

The diffusion model is composed of a forward diffusion process and a reverse denoising process. The forward process gradually adds random noise to clean data and diffuses it into pure Gaussian noise, while the reverse process reverses the diffusion process to create satisfactory samples from the Gaussian noise(Ho, Jain, and Abbeel 2020).

Specifically, for an input image $x_{0}\in\mathbb{R}^{H\times W\times 3}$ , the encoder $\mathcal{E}$ of the auto-encoder transforms it into a latent representation $z_{0}\in\mathbb{R}^{h\times w\times c}$ , where $f=H/h=W/w$ is the downsampling factor and $c$ is the latent feature dimension.The diffusion process is then performed on the latent space, where a conditional U-Net(Ronneberger, Fischer, and Brox 2015) denoiser $\epsilon_{\theta}$ is employed to predict noise $\epsilon$ with current timestep $t$ , noisy latent $z_{t}$ and generation condition $C$ .

The condition information $C$ is fed into each cross attention block $i$ of the U-Net model as

\displaystyle{S}={Attention}(Q,K,V)={softmax}\left(\frac{QK^{T}}{\sqrt{d}}%\right)\cdot V,

(1)

where $Q={W_{q}}^{(i)}\cdot\varphi_{i}(z_{t})$ , $K={W_{k}}^{(i)}\cdot C$ , $V={W_{v}}^{(i)}\cdot C$ ,Here, $d$ denotes the output dimension of key ( $K$ ) and query ( $Q$ ) features, $\varphi_{i}(z_{t})$ is a flattened intermediate representation of the noisy latent $z_{t}$ through the U-Net implementation $\epsilon_{\theta}$ , and $W_{q}^{(i)},W_{k}^{(i)},W_{v}^{(i)}$ are learnable projection matrices.

Fusion Text Encoder

This approach draws on ideas from earlier works such as Blip-Diffusion (Li, Li, and Hoi 2024), Subject-Diffusion (Ma etal. 2023b), AnyText, and is also commonly used as a global condition control strategy. First, the input glyph condition is rendered into a glyph image, then transferred into PP-OCR to extract corresponding glyph’s features. Following the same logic as AnyText, the glyph feature will go through a linear layer for feature alignment when fused with the corresponding position’s caption, this ensures the functional modularity of the plug and play, without fine-tuning the text encoder.

Triples of Cross-Attention

In order to ensure the accuracy of glyph generation, we still introduce a ControlNet module here. However, instead of directly adding features in the decoder as before, we additionally introduce a new adaptive cross-attention layer after the original cross-attention layer, as shown in Fig.3.The output of new cross-attention $S^{{}^{\prime}}$ is computed as follows:

\displaystyle{S^{{}^{\prime}}}={Attention}(Q,K^{{}^{\prime}},V^{{}^{\prime}})=%{softmax}\left(\frac{QK^{{}^{\prime}T}}{\sqrt{d}}\right)\cdot V^{{}^{\prime}},

(3)

where $K^{{}^{\prime}}={W^{{}^{\prime}}_{k}}^{(j)}\cdot C^{{}^{\prime}}$ , $V^{{}^{\prime}}={W^{{}^{\prime}}_{v}}^{(j)}\cdot C^{{}^{\prime}}$ , and the $C^{{}^{\prime}}$ features come from the corresponding block of ControlNet, ${W^{{}^{\prime}}_{k}}^{(j)},{W^{{}^{\prime}}_{v}}^{(j)}$ are learnable projection matrices, $j$ represents the block in the U-Net decoder. Due to the asymmetric structure of SDXL’s encoder and decoder layers, we have ignored the interaction of the first block in the first two decoders.The purpose of this approach is that, since the glyph condition only occupies a smaller proportion of the generated image, we need to prevent the ControlNet of the input glyph condition from affecting the richness of the generated image’s background. Therefore, we enable adaptive local position learning to ensure glyph condition accuracy while generating images with better layouts and backgrounds.

Moreover, it is worth noting that we have borrowed the approach of InstantID (Wang etal. 2024), where the input condition of the ControlNet only contains glyph information, excluding text information.

Furthermore, the accurate generation of paragraphs or larger blocks of text remains a significant challenge. To address this issue, we introduce a second cross-attention layer,the output of the second new cross-attention $S^{{}^{\prime\prime}}$ is computed as follows:

\displaystyle{S^{{}^{\prime\prime}}}={Attention}(Q,K^{{}^{\prime\prime}},V^{{}%^{\prime\prime}})={softmax}\left(\frac{QK^{{}^{\prime\prime}T}}{\sqrt{d}}%\right)\cdot V^{{}^{\prime\prime}},

(4)

where $K^{{}^{\prime\prime}}={W^{{}^{\prime\prime}}_{k}}^{(j)}\cdot C^{{}^{\prime%\prime}}$ , $V^{{}^{\prime\prime}}={W^{{}^{\prime}}_{v}}^{(j)}\cdot C^{{}^{\prime\prime}}$ , and the $C^{{}^{\prime\prime}}$ come from the glyph features obtained by PP-OCR, ${W^{{}^{\prime\prime}}_{k}}^{(j)},{W^{{}^{\prime\prime}}_{v}}^{(j)}$ are learnable projection matrices,This idea is inspired by the earlier work of IP-Adapter. It is worth noting that here we also specifically insert this cross-attention layer into the corresponding block of the SD decoder layer only, as modifying the encoder layer would disrupt the features obtained by the ControlNet. Through multiple experiments, we find that the functioning of the ControlNet is highly dependent on its relatively intact encoder structure. Moreover, it is crucial that the ControlNet maintains a duplicate of the SD encoder and uses zero initialization.

In combination with the existing cross-attention layer of each block, the final TCA output is the sum of the three layers as follows:

\displaystyle S_{TCA}=\alpha S+\beta S^{{}^{\prime}}+\gamma S^{{}^{\prime%\prime}},

(5)

where $\alpha,\beta,\gamma$ constants to balance the importance of the three cross-attention layers.

Auxiliary Align Loss

Considering the application context for poster generation in our paper, in addition to the accuracy of glyph generation and the harmony of the background, we also need to focus on the richness of the image background itself. Our approach inevitably introduces additional condition injection, including the ControlNet feature addition as well as the TCA strategy which results in an increased number of decoder components. The fundamental purpose of these conditions is to ensure the controllability of the generated image. However, many articles have shown that controllability is often accompanied by a sacrifice in editability or text consistency. Therefore, we introduce AAL in our approach. The alignment model employs SDXL as its backbone, similar to how ControlNet utilizes a duplicated SD encoder. However, in our method, we duplicate the SD decoder and apply AAL between the cross-attention outputs in each block of the duplicated decoder and those in the original cross-attention layer of the TCA. The primary objective of this approach is to minimize the impact of the added modules for learning glyphs on the overall layout and image quality.Therefore, our AAL for semantic consistency $\mathcal{L}^{\prime}$ can be formulated as follows:

\mathcal{L}^{\prime}=\lVert{softmax}\left(\frac{QK^{T}}{\sqrt{d}}\right)\cdot V%-{softmax}\left(\frac{QK_{c}^{T}}{\sqrt{d}}\right)\cdot V_{c}\rVert,

(6)

where $K_{c},V_{c}$ refers to the CA output in each block of the duplicated U-Net decoder. Our final loss can be formulated as follows with an important hyperparameter $\lambda$ :

\mathcal{L}=\mathbb{E}_{\mathcal{E}(x_{0}),C,\epsilon\sim\mathcal{N}(0,1),t}%\Big{[}\lVert\epsilon-\epsilon_{\theta}(z_{t},t,C)\rVert_{2}^{2}\Big{]}+%\lambda\mathcal{L}^{\prime}.

(7)

Inference with Fine-tuned LLM

To ensure end-to-end poster generation, the last problem that urgently needs to be solved is the elimination of manual intervention, i.e., the process of predefined image layout. We rely completely on user’s caption description and introduce LLM to solve this problem. Also, for the convenience of invocation, we have constructed our own instruction data and fine-tuned the open-source language model.

Experiments

Implementation Details

The model we intend to train comprises two main components. The first component is a controllable text-to-image poster model, with the backbone of our framework being based on SDXL.To adapt the multilingual understanding capacity of the SDXL encoder and maintain linguistic coherence between the prompt’s description of the poster background and the generated text, we have incorporated the PEA-Diffusion strategy (Ma etal. 2023a) into the backbone architecture. This strategy entails replacing the original SDXL encoder with a multilingual CLIP encoder and an adapter, followed by applying knowledge distillation to align semantic representations.Our model has a total of 1.6 billion trainable parameters, comprising the ControlNet and two additional cross-attention structures.Based on the characteristics of ControlNet and adapter, our solution has good portability.We use the AdamW optimizer (Loshchilov and Hutter 2017) and set the learning rate to 3e-5. During the training phase, we adopt a two-stage progressive training strategy.For initial training stage, the objective is to impart the model with text generation capabilities and the model was trained for 80,000 steps on the synthetic dataset without integrating the AAL for semantic consistency. Then in the second stage, a poster dataset with rich layouts was utilized. To maintain a diverse range of backgrounds in the generated posters, the model underwent training for 20,000 steps with AAL. The entire diffusion model is trained on 64 A100 GPUs for 10W steps with a batch size of 2 per GPU.

The second component is a layout generation model based on LLM.We employed Baichuan2(Yang etal. 2023) specifically for this task, using a training dataset consisting exclusively of poster data. Given that the task involves predicting two position coordinates, it posed a major challenge to the language model. To improve prediction accuracy, we normalized the coordinate points and focused solely on utilizing only the top-left and bottom-right corner points.In addition, to maintain the stability of the end-to-end generation process, a random rule-based layout generation approach was utilized when encountering inaccurate predictions from the LLM. This involved integrating random strategies into the layout generation procedure. The implementation ratio of these random strategies was approximately 5% to strike a balance between stability and variability in the generated layouts. The LLM model for layout generation is trained on 64 A100 GPUs for 3W steps with a batch size of 10 per GPU.

Evaluation

The evaluation set can be divided into two parts, which are used to assess the performance of the model.

The first part is AnyText-Benchmark (Tuo etal. 2023), which contains one thousand English images and Chinese images from LAION (Schuhmann etal. 2021) and Wukong (Gu etal. 2022) respectively. However, we found that 1,000 images in AnyText-Benchmark used to test Chinese generation capabilities were mixed with English data, so we removed this part of the data and left 915 as the ground truth for evaluation.Following AnyText, we evaluate text rendering quality with AnyText-Benchmark from two aspects: (1)Position Word Accuracy (PWAcc) calculates the accuracy of the words generated at a specific position.Only when the predicted text perfectly matches the ground truth is it considered correct. (2)Normalized Edit Distance (NED) is a measure of the similarity between two strings. It is commonly used for text comparison. The method usually involves first using a dynamic programming algorithm to calculate the Levenshtein distance between the two strings, and then dividing it by the maximum length of the strings for normalization.

It is worth noting that in the AnyText-Benchmark, the majority of English evaluation sets contain only one English word per bbox, resulting in a lack of precision when evaluating English sentences. Therefore, it is necessary to construct more complex evaluation sets.

The second part of the evaluation set consists of two subsets: the Complex-Benchmark and the Poster-Benchmark. These subsets, constructed by us, form four evaluation subsets in total, which include bilingual Chinese and English evaluations. The Complex-Benchmark consists of 100 prompts. In the Chinese prompts, the characters to be rendered are randomly combined and arranged, while the English prompts feature longer words with consecutive repetitions of letters. The primary objective of this evaluation set is to assess the accuracy of text rendering. Futhermore, the Poster Evaluation Set includes 120 prompts that describe the generation of posters. Its purpose is to evaluate the layout accuracy, robustness, and overall aesthetic quality of end-to-end poster generation.For these evaluation sets, we utilized three evaluation metrics to assess the accuracy and quality of poster generation: (1)Accuracy (Acc) calculates the proportion of correctly generated characters in the rendered text compared to the total number of characters that need to be rendered. (2) ClipScore measures how well the generated image aligns with the textual prompt or description provided.(3)HPSv2 (Wu etal. 2023) whether the generated images align with human preferences and serves as an indicator to assess the preferences quality of the images.

In our comparison, we evaluate various approaches, including not only AnyText but also those that used ControlNet and StableDiffusion3 (SD3)(Esser etal. 2024). Given that SD3 does not support the rendering of Chinese text, the calculation of Chinese indicators has been omitted in our subsequent analysis. Additionally, as NED computations generally depend on anchoring based on the positioning of text bboxes, SD3’s NED calculations have also been excluded.

Experimental Results

In the following section, we provide a comprehensive analysis of both quantitative and qualitative results, comparing our method with state-of-the-art approaches in the fields of text rendering and poster generation.

EvaluationBenchmark	Model	Chinese				English
EvaluationBenchmark	Model	Acc	NED	ClipScore	HPSv2	Acc	NED	ClipScore	HPSv2
AnyText-Benchmark	SD3	-	-	-	-	0.3261	-	0.4517	0.2215
	ControlNet	0.7598	0.8254	0.3749	0.2347	0.7098	0.8467	0.4558	0.2245
	AnyText-v1.1	0.7661	0.8423	0.3968	0.2272	0.7108	0.8564	0.4721	0.2121
	GlyphDraw2 w/o LLM	0.7892	0.8476	0.3921	0.2555	0.7369	0.8921	0.4616	0.2350
Complex-Benchmark	SD3	-	-	-	-	0.2515	-	0.4391	0.2492
	ControlNet	0.6943	0.8745	0.3589	0.2364	0.2254	0.4025	0.4214	0.2385
	AnyText-v1.1	0.5749	0.8560	0.3633	0.2434	0.0342	0.3755	0.4104	0.2312
	GlyphDraw2 w/o LLM	0.7176	0.8991	0.3600	0.2422	0.2791	0.4332	0.4160	0.2395
	LLM+ControlNet	0.5812	0.8012	0.3687	0.2365	0.1856	0.5841	0.4215	0.2356
	LLM+AnyText-v1.1	0.4850	0.7888	0.3697	0.2534	0.0455	0.4680	0.4038	0.2380
	GlyphDraw2	0.6215	0.8479	0.3756	0.2427	0.2264	0.6273	0.4362	0.2415
Poster-Benchmark	SD3	-	-	-	-	0.2310	-	0.4128	0.2337
	ControlNet	0.7878	0.8453	0.3844	0.2298	0.3421	0.7514	0.3902	0.2125
	LLM+AnyText-v1.1	0.7421	0.8894	0.3956	0.2362	0.2604	0.7120	0.4093	0.2289
	GlyphDraw2	0.8215	0.9590	0.3908	0.2378	0.3999	0.7667	0.3984	0.2297

Comparison results of AnyText-Benchmark. AnyText-Benchmark is utilized to assess the model’s proficiency in rendering Chinese and English text independently. To specifically assess the models’ Chinese text generation capabilities, we exclude all English texts from the Chinese evaluation set, including samples with only a single English text in the prompt. This results in 915 remaining samples for the experimental evaluation. The English evaluation set remains unchanged. Additionally, the evaluation metrics employed align with those used in AnyText, encompassing word accuracy and NED.

To ensure fair evaluation, all methods employed the DDIM sampler with a sampling step of 50, CFG scale of 9, and a fixed random seed of 100. Each prompt generated a single image with identical positive and negative cues. Quantitative comparison results are presented in Table 1. From the results, It is evident that our model achieves significantly higher accuracy in rendering both Chinese and English text compared to AnyText. However, the ClipScore metric is slightly lower than that of GlyphDraw2 in this instance. It should be noted that the Acc metric here is calculated based on the previously mentioned PWAcc rule.

Comparison results of Complex-Benchmark.To comprehensively evaluate the model’s text rendering capabilities, we devised a more sophisticated evaluation set. Specifically, for the Chinese language, we randomly combined characters from a pool of 2000 commonly used Chinese characters as the text to be rendered, resulting in a set of 100 prompts. The number of rows and characters per row were also randomly determined, ensuring the generation of prompts with a complete sense of randomness. The set of 100 prompts we devised comprises characters with intricate strokes and structures, such as “{CJK}UTF8gbsn薯(potato)”, “{CJK}UTF8gbsn寨(stockade)”, and “{CJK}UTF8gbsn聚(gather)”. Although the number of evaluation samples is limited, they encompass a diverse range of frequently encountered Chinese characters, including some complex structural characters that are infrequently represented in the training dataset. Consequently, these prompts provide a robust means to holistically assess the model’s Chinese character generation capability. For English text, we selected words with consecutive repeated letters and some longer words for rendering. These words are prone to errors, making them persuasive indicators of the rendering proficiency for English words. Also, in contrast to AnyText-Benchmark, we provide a bbox that can render phrases and sentences, not just single words. This approach inevitably increases the difficulty of rendering.

In terms of evaluation metrics, we opted for accuracy to gauge the precision of the generated text, ClipScore to assess the alignment between image and text prompts, and HPSv2 to capture human preferences for the generated images.In addition to assessing the text rendering capabilities, it is crucial to validate the overall performance of end-to-end generation. To facilitate a more comprehensive comparative analysis, our research experiments focused on two key aspects: randomly generated bboxes and the utilization of LLM predicted bboxes. This approach allows for a more in-depth evaluation and comparison of the end-to-end text generation functionality.

In the experiments conducted in Table 1, all methods employed in the experiments presented predefined rules and randomly initialized coordinates for text prompts during the image generation process. Based on quantitative comparisons, the results indicate that our model outperforms AnyText in terms of text generation accuracy. Except for the slightly lower Chinese ClipScore and HPSv2 compared to AnyText in randomly assigned bbox coordinates, our approach outperforms AnyText in all other metrics. When it comes to complex sentence-level English evaluation sets, AnyText’s text rendering accuracy is quite low. Although GlyphDraw2’s accuracy is not high either, it has significantly surpassed AnyText.

The second part of the experiment involved using a fine-tuned LLM to generate the positions of text bboxes, followed by generating images with text based on these bbox positions. According to the results presented in the Table 1, The addition of LLM predictions led to a decrease in the accuracy of text rendering, as the bbox coordinates generated by random rules tend to enclose larger areas, resulting in higher performance compared to LLM-predicted scenarios. However, in comparison to AnyText, our proposed model still exhibited relatively high accuracy.

Comparison results of Poster-Benchmark. To assess the end-to-end capabilities of our poster generation model, we specifically designed a dedicated dataset for poster evaluation, encompassing a variety of prompt forms for poster generation. This comprehensive dataset comprises 120 prompts that describe posters in both English and Chinese, enabling the generation of images in various resolutions, including landscape, portrait, and square formats. During the image generation process, our model utilizes the LLM to predict the positions of text description boxes, facilitating seamless end-to-end poster generation without requiring users to specify text placement. Unlike AnyText-Benchmark, which only allows English words inputs in text prompts, our model accommodates complete English sentences, thus facilitating the presentation of desired text.

The quantitative results of poster generation are presented in the Table 1, similarly, the results reveal that our model attains the highest accuracy in rendering texts for end-to-end poster generation scenarios. However, it’s notable that the ClipScore is still slightly lower here.

LLM layout prediction experiment. Firstly, we constructed four tasks according to the difficulty level.

GlyphDraw2: Automatic Generation of Complex Glyph Posters with Diffusion Models and Large Language Models (4)

1.
Input: Caption describing the image containing the glyph to be rendered and the size of the image to be generated;Output: The glyph to be rendered and the four coordinate points of the corresponding bbox, with multiple similar tuples corresponding to multiple positions.
2.
Input: Caption describing the image containing the glyph to be rendered;Output: The glyph to be rendered and the four normalized coordinate points of the corresponding bbox, with multiple similar tuples corresponding to multiple positions.
3.
Input: Caption describing the image containing the glyph to be rendered and the size of the image to be generated; Output: The glyph to be rendered and the two coordinate points (top left and bottom right) of the corresponding bbox, with multiple similar tuples corresponding to multiple positions.
4.
Input: Caption describing the image containing the glyph to be rendered; Output: The glyph to be rendered and the two normalized coordinate points (top left and bottom right) of the corresponding bbox, with multiple similar tuples corresponding to multiple positions.

The first two tasks require predicting four position coordinates, which is the most challenging but meets the requirements the most. Normalization reduces the task difficulty but sacrifices some diversity to a certain extent by reducing the solving range. The last two tasks lower the fine-tuning difficulty, but similarly sacrifice the diversity of the predicted coordinates, meaning the bbox coordinates limit it to be a rectangle.

We randomly tested 1000 prompts, using the correctness of the predicted format as the basis for calculating accuracy. Although a correctly predicted format does not necessarily mean the real rendering position is correct, this kind of error is relatively minor. Here we select three models for comparison, namely Qwen1.5(Bai etal. 2023), Baichuan2, and Llama2(Touvron etal. 2023). Among them, we experiment three model sizes for Qwen1.5, while the other two models were tested with two model sizes each. The experimental results, shown in Fig.5, the numerical suffix in the model name represents the task mode id. The experiment first discovered that the larger the model parameter volume, the better the fine-tuning effect. The results of output normalization have a higher accuracy rate. In the end, we chose the Baichuan2-13B model, with the third task mode.

Fig.4 shows the results after fine-tuning the LLM on our custom evaluation set. The main advantages are seen in three aspects. Firstly, in terms of the poster’s title, the model tends to predict a bbox with a relatively large area. Secondly, the continuity of content in adjacent bboxes offers contextual meaning, allowing the model to learn the semantic information required to render the glyph. Lastly, the size of the bboxes tends to be proportional to the number of characters or words they contain.

GlyphDraw2: Automatic Generation of Complex Glyph Posters with Diffusion Models and Large Language Models (5)

Ablation Studies

Given that we have numerous ablation experiments and wishing to reduce training costs, we uniformly set the first training phase in each experiment to 20,000 steps and the second phase to 10,000 steps and performed on the Chinese evaluation dataset.The ablation studies involve examining 4 main aspects, namely: the impact of TCA and its specific modules; 2) the impact of AAL; 3) the impact of text encoder fusion; 4) the impact of ControlNet’s condition input.

Model	Chinese
Model	Acc	NED	ClipScore	HPSv2
w/o CAG	0.7841	0.8970	0.4058	0.2446
w/o CAC	0.7985	0.9024	0.3974	0.2401
w/o TCA	0.7802	0.8795	0.3964	0.2405
w/o AAL	0.8198	0.9345	0.3884	0.2301
w/o FTE	0.7965	0.9010	0.4012	0.2382
w/o CC	0.7845	0.8975	0.4001	0.2422
GlyphDraw2	0.8058	0.9125	0.3996	0.2412

Effectiveness of TCA.TCA adds two CA layers, and here we individually ablate each added CA layer. Among them, CAG represents the ablation of the CA interaction where the glyph feature as K, V is involved. Since the addition of this CA layer is intended to improve glyph accuracy, as shown in Table 2, removing this layer results in a slight drop in accuracy but a certain improvement in the clip score and preference score. This indicates that while CAG improves the accuracy of text rendering, it sacrifices some text semantic alignment capability.

CAC represents the ablation of the adaptive CA interaction process that derives features from the ControlNet encoder. Here, both indicators will drop slightly, implying that the adaptive feature interaction can indeed enhance both the accuracy of text rendering and the ability for text semantic alignment as well as preference score.

TCA carries out the ablation of the entire TCA block. Similar to CAC both accuracy and preference score will decrease, further illustrating that the TCA module positively affects both text rendering accuracy and the preference score of the image.

Effectiveness of AAL.As seen in Table 2, this strategy does indeed enhance the ability for semantic alignment and image quality to a certain degree, but it also sacrifices some text rendering accuracy. However, the overall impact is still positive.

Effectiveness of FTE.The primary purpose of the FTE is to ensure harmony between the font and the background. As can be observed from the ablation study Table 2, all metrics are influenced to a certain extent. The FTE incorporates font feature information, which enhances the accuracy of text rendering. However, the fusion of image modalities may weaken the alignment of text semantics, leading to a slight decline in ClipScore. Lastly, the enhancement of image compatibility positively affects the preference score.

Effectiveness of ControlNet’s condition input.The condition input of ControlNet (CC) mainly affects the accuracy of the glyph, reducing the influence of the descriptive caption of the image on text rendering and to some extent improving glyph accuracy.

Conclusion and Limitation

So far, the profound cost and limited availability of manual labeling have presented significant challenges to the practical deployment of glyph generation models. In this study, we first collected high-resolution images containing Chinese and English glyphs and subsequently constructed an automatic screening process to build a large-scale dataset. Subsequently, we establish a comprehensive framework that merges text and glyph semantics, leveraging various tiers of information to optimize text rendering accuracy and richness of the background.Empirical analysis from our experiments demonstrates that our methodology surpasses existing models on various evaluation sets, suggesting potential to serve as a foundation for enhancing end-to-end poster generation capabilities.

Limitation

Although our method can generate end-to-end posters of free resolution, there are still some issues at present. Firstly, for the glyph bboxes predicted by LLM, the prediction accuracy is meager for complex scenarios, such as when a user inputs a paragraph of text without quotation marks as a bbox prompt. Secondly, balancing the richness of background generation and the accuracy of text rendering is still relatively difficult. In our current approach, we prioritize glyph accuracy; thus, the visual appeal of the background may be weaker. Additionally, the generation accuracy for tiny glyphs or paragraph texts still needs improvement. In the future, we may explore some solutions on the text encoder side to address these issues.

References

Bai etal. (2023)Bai, J.; Bai, S.; Chu, Y.; Cui, Z.; Dang, K.; Deng, X.; Fan, Y.; Ge, W.; Han,Y.; Huang, F.; Hui, B.; Ji, L.; Li, M.; Lin, J.; Lin, R.; Liu, D.; Liu, G.;Lu, C.; Lu, K.; Ma, J.; Men, R.; Ren, X.; Ren, X.; Tan, C.; Tan, S.; Tu, J.;Wang, P.; Wang, S.; Wang, W.; Wu, S.; Xu, B.; Xu, J.; Yang, A.; Yang, H.;Yang, J.; Yang, S.; Yao, Y.; Yu, B.; Yuan, H.; Yuan, Z.; Zhang, J.; Zhang,X.; Zhang, Y.; Zhang, Z.; Zhou, C.; Zhou, J.; Zhou, X.; and Zhu, T. 2023.Qwen Technical Report.arXiv preprint arXiv:2309.16609.
Chen etal. (2023a)Chen, J.; Huang, Y.; Lv, T.; Cui, L.; Chen, Q.; and Wei, F. 2023a.TextDiffuser-2: Unleashing the Power of Language Models for TextRendering.arXiv preprint arXiv:2311.16465.
Chen etal. (2024)Chen, J.; Huang, Y.; Lv, T.; Cui, L.; Chen, Q.; and Wei, F. 2024.Textdiffuser: Diffusion models as text painters.Advances in Neural Information Processing Systems, 36.
Chen etal. (2023b)Chen, X.; Liu, Y.; Yang, Y.; Yuan, J.; You, Q.; Liu, L.-P.; and Yang, H.2023b.Reason out your layout: Evoking the layout master from large languagemodels for text-to-image synthesis.arXiv preprint arXiv:2311.17126.
Chen etal. (2023c)Chen, Y.; Pan, Y.; Li, Y.; Yao, T.; and Mei, T. 2023c.Control3d: Towards controllable text-to-3d generation.In Proceedings of the 31st ACM International Conference onMultimedia, 1148–1156.
Esser etal. (2024)Esser, P.; Kulal, S.; Blattmann, A.; Entezari, R.; Müller, J.; Saini, H.;Levi, Y.; Lorenz, D.; Sauer, A.; Boesel, F.; Podell, D.; Dockhorn, T.;English, Z.; Lacey, K.; Goodwin, A.; Marek, Y.; and Rombach, R. 2024.Scaling Rectified Flow Transformers for High-Resolution ImageSynthesis.arXiv:2403.03206.
Feng etal. (2024)Feng, W.; Zhu, W.; Fu, T.-j.; Jampani, V.; Akula, A.; He, X.; Basu, S.; Wang,X.E.; and Wang, W.Y. 2024.Layoutgpt: Compositional visual planning and generation with largelanguage models.Advances in Neural Information Processing Systems, 36.
Gani etal. (2023)Gani, H.; Bhat, S.F.; Naseer, M.; Khan, S.; and Wonka, P. 2023.Llm blueprint: Enabling text-to-image generation with complex anddetailed prompts.arXiv preprint arXiv:2310.10640.
Gu etal. (2022)Gu, J.; Meng, X.; Lu, G.; Hou, L.; Minzhe, N.; Liang, X.; Yao, L.; Huang, R.;Zhang, W.; Jiang, X.; etal. 2022.Wukong: A 100 million large-scale chinese cross-modal pre-trainingbenchmark.Advances in Neural Information Processing Systems, 35:26418–26431.
Ho, Jain, and Abbeel (2020)Ho, J.; Jain, A.; and Abbeel, P. 2020.Denoising diffusion probabilistic models.Advances in Neural Information Processing Systems, 33:6840–6851.
Jia etal. (2024)Jia, C.; Luo, M.; Dang, Z.; Dai, G.; Chang, X.; Wang, M.; and Wang, J. 2024.SSMG: Spatial-Semantic Map Guided Diffusion Model for Free-formLayout-to-Image Generation.arXiv:2308.10156.
Li etal. (2022)Li, C.; Liu, W.; Guo, R.; Yin, X.; Jiang, K.; Du, Y.; Du, Y.; Zhu, L.; Lai, B.;Hu, X.; Yu, D.; and Ma, Y. 2022.PP-OCRv3: More Attempts for the Improvement of Ultra Lightweight OCRSystem.arXiv:2206.03001.
Li, Li, and Hoi (2024)Li, D.; Li, J.; and Hoi, S. 2024.Blip-diffusion: Pre-trained subject representation for controllabletext-to-image generation and editing.Advances in Neural Information Processing Systems, 36.
Li etal. (2023a)Li, J.; Li, D.; Savarese, S.; and Hoi, S. 2023a.Blip-2: Bootstrapping language-image pre-training with frozen imageencoders and large language models.In International conference on machine learning, 19730–19742.PMLR.
Li etal. (2023b)Li, Z.; Cao, M.; Wang, X.; Qi, Z.; Cheng, M.-M.; and Shan, Y.2023b.Photomaker: Customizing realistic human photos via stacked idembedding.arXiv preprint arXiv:2312.04461.
Lin etal. (2024)Lin, J.; Guo, J.; Sun, S.; Yang, Z.; Lou, J.-G.; and Zhang, D. 2024.LayoutPrompter: Awaken the Design Ability of Large Language Models.Advances in Neural Information Processing Systems, 36.
Liu etal. (2024)Liu, Z.; Liang, W.; Liang, Z.; Luo, C.; Li, J.; Huang, G.; and Yuan, Y. 2024.Glyph-ByT5: A Customized Text Encoder for Accurate Visual TextRendering.arXiv preprint arXiv:2403.09622.
Loshchilov and Hutter (2017)Loshchilov, I.; and Hutter, F. 2017.Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101.
Ma etal. (2023a)Ma, J.; Chen, C.; Xie, Q.; and Lu, H. 2023a.PEA-Diffusion: Parameter-Efficient Adapter with KnowledgeDistillation in non-English Text-to-Image Generation.arXiv preprint arXiv:2311.17086.
Ma etal. (2023b)Ma, J.; Liang, J.; Chen, C.; and Lu, H. 2023b.Subject-diffusion: Open domain personalized text-to-image generationwithout test-time fine-tuning.arXiv preprint arXiv:2307.11410.
Ma etal. (2023c)Ma, J.; Zhao, M.; Chen, C.; Wang, R.; Niu, D.; Lu, H.; and Lin, X.2023c.Glyphdraw: Learning to draw chinese characters in image synthesismodels coherently.arXiv preprint arXiv:2303.17870.
Nichol etal. (2021)Nichol, A.; Dhariwal, P.; Ramesh, A.; Shyam, P.; Mishkin, P.; McGrew, B.;Sutskever, I.; and Chen, M. 2021.Glide: Towards photorealistic image generation and editing withtext-guided diffusion models.arXiv preprint arXiv:2112.10741.
Nie etal. (2024)Nie, W.; Liu, S.; Mardani, M.; Liu, C.; Eckart, B.; and Vahdat, A. 2024.Compositional Text-to-Image Generation with Dense BlobRepresentations.arXiv preprint arXiv:2405.08246.
Podell etal. (2023)Podell, D.; English, Z.; Lacey, K.; Blattmann, A.; Dockhorn, T.; Müller, J.;Penna, J.; and Rombach, R. 2023.SDXL: Improving Latent Diffusion Models for High-Resolution ImageSynthesis.arXiv:2307.01952.
Qin etal. (2023)Qin, C.; Zhang, S.; Yu, N.; Feng, Y.; Yang, X.; Zhou, Y.; Wang, H.; Niebles,J.C.; Xiong, C.; Savarese, S.; etal. 2023.Unicontrol: A unified diffusion model for controllable visualgeneration in the wild.arXiv preprint arXiv:2305.11147.
Radford etal. (2021)Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry,G.; Askell, A.; Mishkin, P.; Clark, J.; etal. 2021.Learning transferable visual models from natural languagesupervision.In International conference on machine learning, 8748–8763.PMLR.
Ramesh etal. (2022)Ramesh, A.; Dhariwal, P.; Nichol, A.; Chu, C.; and Chen, M. 2022.Hierarchical text-conditional image generation with clip latents.arXiv preprint arXiv:2204.06125, 1(2): 3.
Rombach etal. (2022)Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; and Ommer, B. 2022.High-resolution image synthesis with latent diffusion models.In Proceedings of the IEEE/CVF conference on computer visionand pattern recognition, 10684–10695.
Ronneberger, Fischer, and Brox (2015)Ronneberger, O.; Fischer, P.; and Brox, T. 2015.U-net: Convolutional networks for biomedical image segmentation.In Medical Image Computing and Computer-AssistedIntervention–MICCAI 2015: 18th International Conference, Munich, Germany,October 5-9, 2015, Proceedings, Part III 18, 234–241. Springer.
Saharia etal. (2022)Saharia, C.; Chan, W.; Saxena, S.; Li, L.; Whang, J.; Denton, E.L.;Ghasemipour, K.; GontijoLopes, R.; KaragolAyan, B.; Salimans, T.; etal.2022.Photorealistic text-to-image diffusion models with deep languageunderstanding.Advances in neural information processing systems, 35:36479–36494.
Schuhmann etal. (2021)Schuhmann, C.; Vencu, R.; Beaumont, R.; Kaczmarczyk, R.; Mullis, C.; Katta, A.;Coombes, T.; Jitsev, J.; and Komatsuzaki, A. 2021.Laion-400m: Open dataset of clip-filtered 400 million image-textpairs.arXiv preprint arXiv:2111.02114.
Suvorov etal. (2022)Suvorov, R.; Logacheva, E.; Mashikhin, A.; Remizova, A.; Ashukha, A.;Silvestrov, A.; Kong, N.; Goka, H.; Park, K.; and Lempitsky, V. 2022.Resolution-robust large mask inpainting with fourier convolutions.In Proceedings of the IEEE/CVF winter conference onapplications of computer vision, 2149–2159.
Touvron etal. (2023)Touvron, H.; Martin, L.; Stone, K.; Albert, P.; Almahairi, A.; Babaei, Y.;Bashlykov, N.; Batra, S.; Bhargava, P.; Bhosale, S.; Bikel, D.; Blecher, L.;Ferrer, C.C.; Chen, M.; Cucurull, G.; Esiobu, D.; Fernandes, J.; Fu, J.; Fu,W.; Fuller, B.; Gao, C.; Goswami, V.; Goyal, N.; Hartshorn, A.; Hosseini, S.;Hou, R.; Inan, H.; Kardas, M.; Kerkez, V.; Khabsa, M.; Kloumann, I.; Korenev,A.; Koura, P.S.; Lachaux, M.-A.; Lavril, T.; Lee, J.; Liskovich, D.; Lu, Y.;Mao, Y.; Martinet, X.; Mihaylov, T.; Mishra, P.; Molybog, I.; Nie, Y.;Poulton, A.; Reizenstein, J.; Rungta, R.; Saladi, K.; Schelten, A.; Silva,R.; Smith, E.M.; Subramanian, R.; Tan, X.E.; Tang, B.; Taylor, R.;Williams, A.; Kuan, J.X.; Xu, P.; Yan, Z.; Zarov, I.; Zhang, Y.; Fan, A.;Kambadur, M.; Narang, S.; Rodriguez, A.; Stojnic, R.; Edunov, S.; andScialom, T. 2023.Llama 2: Open Foundation and Fine-Tuned Chat Models.arXiv:2307.09288.
Tuo etal. (2023)Tuo, Y.; Xiang, W.; He, J.-Y.; Geng, Y.; and Xie, X. 2023.AnyText: Multilingual Visual Text Generation And Editing.arXiv preprint arXiv:2311.03054.
Wang etal. (2024)Wang, Q.; Bai, X.; Wang, H.; Qin, Z.; and Chen, A. 2024.Instantid: Zero-shot identity-preserving generation in seconds.arXiv preprint arXiv:2401.07519.
Wu etal. (2023)Wu, X.; Hao, Y.; Sun, K.; Chen, Y.; Zhu, F.; Zhao, R.; and Li, H. 2023.Human Preference Score v2: A Solid Benchmark for Evaluating HumanPreferences of Text-to-Image Synthesis.arXiv:2306.09341.
Yang etal. (2023)Yang, A.; Xiao, B.; Wang, B.; Zhang, B.; Bian, C.; Yin, C.; Lv, C.; Pan, D.;Wang, D.; Yan, D.; Yang, F.; Deng, F.; Wang, F.; Liu, F.; Ai, G.; Dong, G.;Zhao, H.; Xu, H.; Sun, H.; Zhang, H.; Liu, H.; Ji, J.; Xie, J.; Dai, J.;Fang, K.; Su, L.; Song, L.; Liu, L.; Ru, L.; Ma, L.; Wang, M.; Liu, M.; Lin,M.; Nie, N.; Guo, P.; Sun, R.; Zhang, T.; Li, T.; Li, T.; Cheng, W.; Chen,W.; Zeng, X.; Wang, X.; Chen, X.; Men, X.; Yu, X.; Pan, X.; Shen, Y.; Wang,Y.; Li, Y.; Jiang, Y.; Gao, Y.; Zhang, Y.; Zhou, Z.; and Wu, Z. 2023.Baichuan 2: Open Large-scale Language Models.arXiv:2309.10305.
Yang etal. (2024)Yang, Y.; Gui, D.; Yuan, Y.; Liang, W.; Ding, H.; Hu, H.; and Chen, K. 2024.GlyphControl: Glyph Conditional Control for Visual Text Generation.Advances in Neural Information Processing Systems, 36.
Ye etal. (2023)Ye, H.; Zhang, J.; Liu, S.; Han, X.; and Yang, W. 2023.Ip-adapter: Text compatible image prompt adapter for text-to-imagediffusion models.arXiv preprint arXiv:2308.06721.
Yu etal. (2023)Yu, C.; Zhou, Q.; Li, J.; Zhang, Z.; Wang, Z.; and Wang, F. 2023.Points-to-3d: Bridging the gap between sparse points andshape-controllable text-to-3d generation.In Proceedings of the 31st ACM International Conference onMultimedia, 6841–6850.
Zavadski, Feiden, and Rother (2023)Zavadski, D.; Feiden, J.-F.; and Rother, C. 2023.ControlNet-XS: Designing an Efficient and Effective Architecture forControlling Text-to-Image Diffusion Models.arXiv preprint arXiv:2312.06573.
Zhang etal. (2023a)Zhang, L.; Chen, X.; Wang, Y.; Lu, Y.; and Qiao, Y. 2023a.Brush Your Text: Synthesize Any Scene Text on Images via DiffusionModel.arXiv:2312.12232.
Zhang, Rao, and Agrawala (2023)Zhang, L.; Rao, A.; and Agrawala, M. 2023.Adding conditional control to text-to-image diffusion models.In Proceedings of the IEEE/CVF International Conference onComputer Vision, 3836–3847.
Zhang etal. (2023b)Zhang, T.; Zhang, Y.; Vineet, V.; Joshi, N.; and Wang, X. 2023b.Controllable text-to-image generation with gpt-4.arXiv preprint arXiv:2305.18583.
Zhao and Lian (2023)Zhao, Y.; and Lian, Z. 2023.UDiffText: A Unified Framework for High-quality Text Synthesis inArbitrary Images via Character-aware Diffusion Models.arXiv preprint arXiv:2312.04884.

Appendix

GlyphDraw2 Dataset

Dataset building strategy

To ensure high-resolution images in the dataset, we apply resolution filtering criteria, retaining only images with dimensions larger than 1024×1024 and a minimum shorter side of 768 pixels. To extract clean images containing glyphs from a vast amount of data, we implement filtering rules: (1) Only bboxes with OCR recognition confidence greater than 0.8 for individual texts are retained. (2) Only text bboxes with a character count of less than 15 are retained, and each image is limited to a maximum of ten bboxes.(3) Bboxes whose center falls within 5% of the image boundaries are excluded to eliminate the influence of bottom watermarks. (4) The center of each bbox is required to be at least 15% away from the image boundaries in at least one direction.(5) Bboxes with a single character area larger than 2000 square pixels are kept.

For the poster dataset, we employ refined processing techniques to ensure high quality. In addition to the aforementioned filtering criteria, aesthetic scoring is performed on the poster data, and LaMa restoration is applied to small glyph regions using added masks.

Statistics and Comparison

The statistics of our general dataset and poster dataset are shown below.

Dataset	# Samples	# Chars	# Unique Chars
General dataset	3,726,397	35,741,039	5631
Poster dataset	1,321,943	19,160,954	5496
Total	5,048,340	54,901,993	5742

Dataset	# Samples	# Chars	# Words	# Unique Chars
General Dataset	1,929,981	13,870,183	2,339,693	339,055

GlyphDraw2: Automatic Generation of Complex Glyph Posters with Diffusion Models and Large Language Models (6)

Particularly, the number of words per image in the general dataset was statistically analyzed, as shown in the Figure 6. The proportion of images with 1 word is the highest, at around 60%. The percentage then gradually decreases as the number of words per image increases up to 10.

GlyphDraw2: Automatic Generation of Complex Glyph Posters with Diffusion Models and Large Language Models (7)

GlyphDraw2: Automatic Generation of Complex Glyph Posters with Diffusion Models and Large Language Models (8)

Figure 7 analyzes the distribution of the number of text boxes in the general dataset and the poster dataset.In the general dataset, the majority of images exhibit a concentration of 1-3 text boxes, with only a few containing 4-5 text boxes, and a relatively lower occurrence of 6 or more text boxes. In contrast, the poster dataset shows a more diverse distribution of text box numbers. Notably, approximately 11.55% of the text boxes in this dataset have a count of 10, ranking third in proportion. Furthermore, the distribution of text boxes ranging from 5 to 9 shows a relatively balanced pattern. The poster dataset exhibits a richer diversity in the number of text boxes, which could be advantageous for training models to generate posters with more varied layouts.

GlyphDraw2: Automatic Generation of Complex Glyph Posters with Diffusion Models and Large Language Models (9)

GlyphDraw2: Automatic Generation of Complex Glyph Posters with Diffusion Models and Large Language Models (10)

In addition, Figure 8 illustrates the 100 most frequent Chinese characters in the general dataset and the poster dataset. While the two datasets exhibit subtle differences in the most common Chinese characters, the poster dataset displays a more concentrated distribution of character frequencies, whereas the general dataset shows greater variance in character frequency.