WordArt Designer: User-Driven Artistic Typography Synthesis using Large Language Models

This paper introduces WordArt Designer, a user-driven framework for artistic typography synthesis, relying on the Large Language Model (LLM). The system incorporates four key modules: the LLM Engine, SemTypo, StyTypo, and TexTypo modules. 1) The LLM Engine, empowered by the LLM (e.g., GPT-3.5), interprets user inputs and generates actionable prompts for the other modules, thereby transforming abstract concepts into tangible designs. 2) The SemTypo module optimizes font designs using semantic concepts, striking a balance between artistic transformation and readability. 3) Building on the semantic layout provided by the SemTypo module, the StyTypo module creates smooth, refined images. 4) The TexTypo module further enhances the design's aesthetics through texture rendering, enabling the generation of inventive textured fonts. Notably, WordArt Designer highlights the fusion of generative AI with artistic typography. Experience its capabilities on ModelScope: https://www.modelscope.cn/studios/WordArt/WordArt.


Introduction
Typography, a critical intersection of language and design, finds extensive applications across various domains like advertising (Cheng et al., 2016(Cheng et al., , 2017a,b;,b;Sun et al., 2018), early childhood education (Vungthong et al., 2017), and historical tourism (Amar et al., 2017).Despite its widespread relevance, the mastery of typography design remains an intricate task for non-professional designers.Although attempts have been made to bridge this gap between amateur designers and typography (Iluz et al., 2023;Tanveer et al., 2023), existing solutions mainly generate semantically coherent and visually pleasing typography within predefined concepts.These solutions often lack adaptability, creativity, and computational efficiency.To overcome these limitations, we introduce Wor-dArt Designer (Fig. 1), a system composed of four primary modules: the LLM Engine, SemTypo Module, and StyTypo Module, supplemented by the TexTypo Module for texture rendering.This userfocused system allows users to define their design needs, including design concepts and domains.The system consists of: 1. LLM Engine: Based on the power of the LLM (e.g. , this engine interprets user input and produces prompts for the Sem-Typo, StyTypo, and TexTypo modules.The workflow, illustrated in Fig. 1, begins with the LLM module interpreting user input.The output of each module serves as the input for the next, with the final design decision made by the TexTypo module.This sequence ensures the final design aligns with the user's intent and maintains a unique aesthetic appeal. This design process is iterative, involving constant interaction between the user's input and the system's modules.This user-centered approach guarantees the creation of high-quality WordArt designs (See Fig. 2), making it an effective tool in creative design-dependent industries, such as food and jewelry.
Extensive experiments on WordArt Designer have validated its creativity, artistic expression, and expandability to different languages.The inclusion of a ranking model significantly improves the success rate and overall quality of stylized images, ensuring the production of high-quality WordArt designs.
In essence, WordArt Designer provides a creative, artistic, and fully automated solution for generating word art.Our research not only lays the groundwork for future text synthesis studies but also introduces numerous practical applications.Wor-dArt Designer can be employed in various areas, including media propaganda and product design, enhancing the efficiency and effectiveness of these systems, thereby making them more practical for everyday use.

Related work
LLM and their Apps.Large Language Model (LLM) has been progressively improved and utilized in a wide range of applications (Anil et al., 2023;Raffel et al., 2020;Shoeybi et al., 2019;Rajbhandari et al., 2020;Devlin et al., 2019;Cheng et al., 2023).The outstanding performances exhibited by the ChatGPT and GPT series (Radford et al., 2018;Brown et al., 2020;OpenAI, 2023) have stimulated the widespread use of the LLM.These models are adept at learning context from simple prompts, leading to their increasing use as the controlling component in intelligent systems (Wu et al., 2023;Shen et al., 2023).Building on these insights, WordArt Designer incorporates the LLM to enhance system creativity and diversity.
Text Synthesis.While significant progress has been made in image synthesis, integrating legible text into images remains challenging (Rombach et al., 2022;Saharia et al., 2022).Some solutions, such as eDiff-I (Balaji et al., 2022) and DeepFloyd (Lab, 2023), employ robust LLMs, such as T5 (Raffel et al., 2020), for improved visual text generation.Recent studies (Yang et al., 2023;Ma et al., 2023) have also looked into using glyph images as extra control conditions, while others like DS-Fusion (Tanveer et al., 2023) introduce additional constraints to synthesize more complex text forms, such as hieroglyphics.
Image Synthesis.The surge in demand for personalized image synthesis has spurred advances in interactive image editing (Meng et al., 2022;Gal et al., 2023;Brooks et al., 2022;Zhao et al., 2018) and techniques incorporating additional conditions, such as masks and depth maps (Rombach et al., 2022;Huang et al., 2020).New research (Zhang and Agrawala, 2023;Mou et al., 2023;Huang et al., 2023) is exploring multi-condition controllable synthesis.For instance, ControlNet (Zhang and Agrawala, 2023) learns task-specific conditions end-to-end, providing more nuanced control over the synthesis process.

WordArt Designer
The WordArt Designer system utilizes an assortment of typography synthesis modules, propelled by a Large Language Model (LLM) such as GPT-3.5),facilitating an interactive, user-centered design process.As illustrated in Fig. 3, users define their design needs, including design concepts and domains, e.g., "A cat in jewelry design."The LLM engine interprets the input, generating prompts to guide SemTypo, StyTypo, and TexTypo modules, thus executing the user's design vision.
To achieve automated WordArt design, we introduce a quality assessment feedback mechanism, which is vital for successful synthesis.The output from the ranking model is evaluated by the LLM engine to validate the quality of the synthesized image, ensuring the creation of at least K quali-fied transformations.If this criterion is not met, the LLM engine, along with the SemTypo and Sty-Typo modules and format directives, are restarted for another design iteration.Subsequent sections will delve into the details of each module's functionality and operation.

LLM Engine
The Large Language Model (LLM) engine is a crucial component for the WordArt designer.It serves as a knowledge engine and concretizes abstract notions, like "vegetables" and "fruit", into texture prompts in the context of food, for the eventual synthesis of the artistic text.For most concrete nouns, such as "cat", "dog", "flower", etc., semantic typography can be successfully generated.However, for words like abstract nouns and verbs, such as "winter", "hit", etc., users often fail to provide desired descriptions.The reason is that images compose highly complex scenes for abstract concepts, which is not suitable for our WordArt designer system.
To address this issue, we employ the LLM to render abstract concepts into representative objects that can be easily converted.Specifically, we can build our LLM engine using models like GPT-3.5 and other LLMs, all of which have context-learning capabilities.The prompts for input parsing, stylization, and texture rendering are generated as: Where Q inp , Q sty , and Q tex represent the standard prompts for input parsing, stylization, and texture rendering respectively.Q sty and Q tex are built using formatted prompt templates with concepts derived from the input parsing.LLM engine has ample capabilities to imbue our system with a creative and engaging "soul", ensuring the quality of artistic text synthesis.We provide detailed templates and full examples of prompts in Appendix A.

SemTypo Module
The Semantic Typography (SemTypo) module alters typographies based on a given semantic concept.It unfolds in three stages: (1) Character Extraction and Parameterization, (2) Region Selection for Transformation, and (3) Semantic Transformation and Differentiable Rasterization.
Character Parameterization.The first stage, as displayed in Fig. 3, starts by transforming the natural language input into a JSON format, specifying the characters to modify, the semantic concept, and the application domain.The FreeType font library (David Turner et al., 1996) is then employed to extract character contours and convert them into cubic Bézier curves characterized by a trainable set of parameters.For characters with surplus control points, a subdivision routine fine-tunes the control points θ, using a differentiable vector graphic rasterization scheme (Iluz et al., 2023).
Region Selection.Our unique contribution is the region-based transformation method, the second stage of the SemTypo module.This approach facilitates the selective transformation of certain character segments, effectively reducing distortions that typically affect typography generation in languages with single-character words.We choose to transform a random contiguous subset of control points within a character, instead of the entire character.We establish a splitting threshold of 20 pixels, with the set of control points randomly determined within the range [500, min(1000/control point count)], initiating from a random point.
In contrast to previous methods, such as the one by Iluz et al. (Iluz et al., 2023), which used extra loss terms with inadequate success to maintain legibility of the synthesized typography, our method only involves loss computation from the transformed sections of the characters.This approach increases efficiency and guarantees careful manipulation of character shapes, thus improving transformation quality.
Transformation and Rasterization.In the final stage, the parameters are transformed and rasterized through the Differentiable Vector Graphics (DiffVG) scheme (Li et al., 2020).As shown in Fig. 4, the transformed glyph image I sem is created from the trainable parameters θ of the SVG-format character input, using DiffVG ϕ(•).A segment of the chosen character x is optimized and cropped to yield an enhanced image batch X aug (Frans et al., 2022).The semantic concept S and the augmented image batch X aug are both input into a visionlanguage backbone model to compute the loss for parameter optimization.The Score Distillation Sampling (SDS) loss is applied in the latent space code z, as per the DreamFusion method (Poole et al., 2023): Here, t ∈ {1, 2, . . ., T } is uniformly sampled to define a noise latent code z t = a t z t + σ t ϵ, with ϵN ∼ (0, 1), and a t , σ t act as noise schedule regulators at time t.The multiplier w(t) is a constant, contingent on a t .This revised process refines expression and amplifies the variety of output.

StyTypo Module
The Stylization Typography (StyTypo) module's main purpose is to generate smoother and more detailed images, enhancing the semantic layout image I sem .To speed up I sty generation, we use short iteration settings.However, this approach might lead to a lack of smoothness and details.To overcome these potential drawbacks, the StyTypo module introduces two main components: (1) stylized image generation, and (2) stylized image ranking and selection.
Stylized Images Generation.The Latent Diffusion Model (LDM) (Rombach et al., 2022) has gained attention for its ability to generate images based on given input shapes.Therefore, we employ the LDM's depth2image methodology to stylize typographic layouts, enhancing smoothness and infusing additional detail to create a unique "sketch" for texture rendering.

TexTypo Module
To advance the styling capacities of the Stylization Typography (StyTypo) module, we adapted Con-trolNet (Zhang and Agrawala, 2023) for the purpose of texture rendering, resulting in the creation of the Texture Typography (TexTypo) module.
As can be seen in Fig. 6, ControlNet's original control conditions relied heavily on the Canny Edge and Depth data.This constraint tended to produce fonts that were lacking in creativity and artistic flair.
To counter this, we introduced Scribble conditions as an alternate control condition into ControlNet, which encourage the generation of more creatively textured fonts without compromising on readability.Furthermore, to cater to a range of industrial sectors, we have reconfigured ControlNet to incorporate pre-trained stable diffusion models that are relevant to different fields.These include, but are not limited to, commercial advertising, fashion design, gaming interfaces, tech products, and artistic creations.
Technically, we provide the ControlNet parameters with conditions Canny Edge, Depth, Scribble, as well as original font images.The TexTypo model receives these parameters and generates the tex- Figure 7: Results showcasing the adaptability of the WordArt Designer.The first row targets the concept of "food", which is further specified to "candy", "pasta", "cheese", "fruits", "bread", "vegetables" or "chocolate".The second row targets "jewelry", concretized to "jewels", "gold" or "jade".The variety of styles highlighted underscores WordArt Designer's versatility in creating unique artistic typography, pushing past traditional design boundaries.tured font image as, where A tex represents the prompts synthesized by the LLM engine M, and P cond stands for the control parameters, resulting in a creatively rendered textured font as the output.

Experiments
Creativity & Artistic Ability.We operationalize the concept of texture rendering to evaluate the Creativity and Artistic Ability of the WordArt Designer.The outcomes are demonstrated in Fig. 7.The first row of art words is generated by embodying the concept "food", which is further specified to "candy", "pasta", "cheese", "fruits", "bread", "vegetables" or "chocolate".The second row represents the concept "jewelry", concretized to "jewels", "gold" or "jade".The smart and reasonable texture rendering contributes to the creativity and artistic appeal of the output.
Expandability to Different Languages.Our Sem-Typo module, grounded on differentiable rasterization, is theoretically compatible with all types of languages.Beyond Chinese (i.e., hieroglyphs), we explore the expandability of WordArt Designer with the representative language, English (i.e., the Latin alphabet).Fig. 8    Figure 9: Various notable applications of our WordArt Designer, including art word poster creation (row 1) and urban master plan design (row 2).Note that re-vAnimated is used as the base LDM (Rombach et al., 2022).For rows 1-2, we further apply the Lora models Blindbox and MasterPlan respectively.

Application
WordArt Image.We experiment with various application possibilities for WordArt Designer.The results, exhibited in Fig. 9, are representative and not cherry-picked.WordArt Designer exhibits promising potential in areas like art word poster design and even city planning.We are confident that WordArt Designer will bring innovative inspiration to professional designers.
WordArt Animation.We also utilize Con-trolVideo (Zhang et al., 2023) to synthesize art word videos, illustrating the transformation of the word/character.The Chinese characters for Bamboo" and Flower" are used in the video generation process, with the "Van Gogh's painting" style applied to the animations, proving useful for Chinese

Ethical Considerations
Potential ethical concerns include perpetuating cultural stereotypes due to the use of certain imagery or symbols in the process of artistic transformations, or introducing bias against under-represented cultures.Another issue could be the potential inclusion of copyrighted graphics.Users need to pay attention to these issues to ensure responsible and respectful use of the system.

Conclusion
This paper presents WordArt Designer, a framework that harnesses Large Language Models (LLM), such as GPT-3.5, to automatically generate multilingual artistic typography.This system uses an LLM engine to parse and translates user input into directives, guiding three modules, each accountable for different aspects of the typographic design.The superior performance of WordArt Designer highlights the potential of AI to augment artistic typography.Future work aims to further explore the possibilities of integrating this technology into other aspects of design, such as graphics and interactive media.

Figure 1 :
Figure 1: Demonstration of WordArt Designer: Leveraging the power of the LLM (e.g.GPT-3.5), it integrates four modules (LLM Engine, SemTypo, StyTypo, Tex-Typo) to transform user inputs into visually striking and semantically rich multilingual typographic designs.It democratizes the art of typography design, enabling non-professionals to realize their creative visions.

2.
SemTypo Module: The SemTypo Module transforms typography based on a provided semantic concept.It involves a three-step process, including Character Extraction and Parameterization, Region Selection for Transformation, and Semantic Transformation & Differentiable Rasterization.

Figure 2 :
Figure 2: Examples of artistic typography generated by WordArt Designer.These instances demonstrate the system's ability to produce aesthetically pleasing, semantically coherent, and stylistically diverse typographic designs.3. StyTypo Module: The StyTypo Module generates smoother, more detailed images based on the semantic layout image provided by the SemTypo module.4. TexTypo Module: The TexTypo Module modifies ControlNet for texture rendering, ensuring creativity while preserving legibility.
:   : Please list the representative category or object name in/of "cat" including in real-life, artist, and film works.  : Please list the representative category or object name in/of "jewelry" including in real-life, artist, and film works … … StyTypo A black-and-white drawing of "Hello kitty", … Formatted Prompt From   : TexTypo "gold" texture, 8k … Formatted Prompt From   :

Figure 3 :
Figure 3: The architectural framework of the proposed WordArt Designer system.This structure involves an LLM engine, the SemTypo module for Semantic Typography, the StyTypo module for Stylization Typography, and the TexTypo module for Texture Typography.These modules operate coherently, guided by a preset control flow, to facilitate a seamless and innovative transformation of text into artistic typography.

FigFigure 4 :Figure 5 :
Figure 4: Differential rasterization scheme of semantic typography.The character stroke inside the purple box is the selected part for optimization.thetop row images generated by the SemTypo module, despite lacking smoothness and detail, provide a comprehensive object representation.After being processed by the StyTypo module, the stylized images on the lower row display an abundance of detail and inventive renderings for each semantic image input.

Figure 6 :
Figure 6: Comparison between Canny Edge and Scribble conditions for ControlNet texture rendering.The first row is generated using the Canny Edge condition, while the rest are from the Scribble condition.
Figure 8: Chinese Characters and their corresponding English art words."狐" (Fox) education.Please refer to Fig.10for additional animations.

Figure 10 :
Figure 10: Art word animations derived from the Sem-Typo optimization process.CLICK the image to PLAY ANIMATION!Best viewed with Adobe Acrobat DC.

Table 1 :
Ablation study of the ranking model on the validation set.'p', 'r', and 's' stand for precision, recall, and success rate, respectively.'x' in 'TopX' indicates the number of stylized images retained.In the rankingbased method, 'TopX' are selected based on ranking scores, while for the random-based method, 'TopX' are selected randomly.Results of the random-based method are obtained by averaging over 10,000 iterations.Increased values are indicated in blue.