Generative artificial intelligence faces a crucial challenge in balancing autonomy and controllability. While significant progress has been made in autonomy through powerful generative models, controllability has become a focal point for machine learning researchers. Text-based control has become particularly critical as natural language provides an intuitive interface for human-machine interaction. This approach has driven remarkable advancements in image editing, audio synthesis, and video generation.
In recent years, text-to-data generative models, especially those using diffusion techniques, have achieved impressive results by leveraging semantic insights from large datasets of data-text pairs. However, these models face significant obstacles in low-resource environments where obtaining sufficient text-paired data becomes extremely expensive or complex. These challenges are especially prominent in key areas like molecular data, motion capture, and time series, where adequate text labels are often lacking, limiting supervised learning capabilities and hindering the deployment of advanced generative models. These limitations often result in poor generation quality, model overfitting, bias, and insufficient output diversity, highlighting the substantial challenge of optimizing text representation for better data alignment in data-limited scenarios.
To address these challenges in low-resource contexts, several mitigation methods have emerged, each with its inherent limitations. Data augmentation techniques often fail to accurately align synthetic data with original text descriptions and risk overfitting while increasing computational demands for diffusion models. Semi-supervised learning struggles with the inherent ambiguity of textual data, making correct interpretation of unlabeled samples challenging. Transfer learning, while helpful for limited datasets, is often affected by catastrophic forgetting, where models forget previously acquired knowledge when adapting to new text descriptions.
Given these methodological shortcomings, designing more robust approaches for text-to-data generation in low-resource environments has become essential. In this paper, researchers from Salesforce AI Research introduce Text2Data, a technique that employs a diffusion-based framework to enhance text-to-data controllability in low-resource settings through a two-phase approach.
First, Text2Data utilizes unsupervised diffusion models to master data distribution through unlabeled data, avoiding the semantic ambiguity common in semi-supervised methods. Next, the technique performs controlled fine-tuning on text-annotated data without expanding the training dataset. Instead, Text2Data adopts a constrained optimization learning objective to prevent drastic parameter changes, effectively avoiding catastrophic forgetting.
This unique framework effectively combines labeled and unlabeled data, maintaining fine-grained data distribution while achieving excellent controllability. Theoretical validation supports the rationality of optimization constraints and generalization bounds, with comprehensive experiments across three modalities demonstrating Text2Data's superiority in generation quality and controllability compared to baseline methods.
Text2Data addresses controllable data generation by learning the conditional distribution pθ(x|c), where limited paired data presents optimization challenges. The framework operates in two distinct phases. In the initial phase, it leverages richer unlabeled data to learn the marginal distribution pθ(x) for optimal parameters θ̂. This approach utilizes the mathematical relationship between marginal and conditional distributions, making pθ(x) approximate the expected value of pθ(x|c) over the text distribution. Subsequently, Text2Data fine-tunes these parameters using available data-text pairs while implementing constrained optimization to keep updated parameters θ̂’ within the previously learned distribution range.
This constraint ensures the model maintains overall data distribution knowledge while gaining text controllability, effectively preventing catastrophic forgetting during fine-tuning. Text2Data first learns the overall data distribution using all available data conditioned with NULL tokens, then introduces a constrained optimization framework to fine-tune the model on text-annotated data while preventing parameters from deviating from the previously learned distribution.
Mathematically, this is expressed as minimizing the negative log-likelihood of conditional probability pθ(x|c) while constraining marginal distribution performance to remain close to the optimal value ξ established in the first phase. This constraint-based approach directly addresses catastrophic forgetting by ensuring model parameters remain within an optimal set that balances general data representation with text-specific controllability.
By translating theoretical objectives into practical loss functions, Text2Data achieves classifier-free diffusion guidance. The framework optimizes three key components: L1(θ) for general data distribution learning, L’1(θ) for maintaining labeled data distribution, and L2(θ) for text-conditioned generation. These are derived through empirical estimation using available data samples.
The dictionary optimization process performs constraints through dynamic adjustment of gradient updates with parameter λ while allowing effective learning. This method uses a sophisticated update rule where θ is modified according to a weighted combination of two objectives' gradients. The constraint can be relaxed during training to improve convergence, recognizing that parameters need not be an exact subset of the original parameter space but should remain close to preserve distribution knowledge while gaining controllability.
Text2Data provides theoretical foundations for its constrained optimization method by leveraging the sub-Gaussian property of random variables derived from the diffusion process, enabling the formulation of strict