Summary:
This thesis explores multi-modal generation and semi-supervised learning, addressing two critical challenges: supporting flexible configurations of input and output across multiple domains, and developing efficient training strategies for semi-supervised data settings.
As artificial intelligence systems advance, there is growing need for models that can flexibly integrate and generate multiple modalities, mirroring human cognitive abilities. Conventional deep learning systems often struggle when deviating from their training configuration, which occurs when certain modalities are unavailable in real-world applications. For instance, in medical settings, patients might not undergo all possible scans for a comprehensive analysis system. Additionally, obtaining finer control over generated modalities is crucial for enhancing generation capabilities and providing richer contextual information. As the number of domains increases, obtaining simultaneous supervision across all domains becomes increasingly challenging.
We focus on multi-domain translation in a semi-supervised setting, extending the classical domain translation paradigm. Rather than addressing specific translation directions or limiting translations to domain pairs, we develop methods facilitating translations between any possible domain configurations, determined at test time. The semi-supervised aspect reflects real-world scenarios where complete data annotation is often infeasible or prohibitively expensive.