Publications | Fabio Pizzati

2024

Specify and Edit: Overcoming Ambiguity in Text-Based Image Editing

Iakovleva*, Ekaterina, Pizzati*, Fabio, Torr, Philip, and Lathuilière, Stéphane

In arxiv 2024

Abs arXiv Bib

Text-based editing diffusion models exhibit limited performance when the user’s input instruction is ambiguous. To solve this problem, we propose Specify ANd Edit (SANE), a zero-shot inference pipeline for diffusion-based editing systems. We use a large language model (LLM) to decompose the input instruction into specific instructions, i.e. well-defined interventions to apply to the input image to satisfy the user’s request. We benefit from the LLM-derived instructions along the original one, thanks to a novel denoising guidance strategy specifically designed for the task. Our experiments with three baselines and on two datasets demonstrate the benefits of SANE in all setups. Moreover, our pipeline improves the interpretability of editing models, and boosts the output diversity. We also demonstrate that our approach can be applied to any edit, whether ambiguous or not.

Star
@inproceedings{sane, title = {Specify and Edit: Overcoming Ambiguity in Text-Based Image Editing}, author = {Iakovleva*, Ekaterina and Pizzati*, Fabio and Torr, Philip and Lathuilière, Stéphane}, booktitle = {arxiv}, year = {2024}, }
Model Merging and Safety Alignment: One Bad Model Spoils the Bunch

Hammoud, Hasan Abed Al Kader, Michieli, Umberto, Pizzati, Fabio, Torr, Philip, Bibi, Adel, Ghanem, Bernard, and Ozay, Mete

In EMNLP Findings 2024

Abs arXiv Bib

Merging Large Language Models (LLMs) is a cost-effective technique for combining multiple expert LLMs into a single versatile model, retaining the expertise of the original ones. However, current approaches often overlook the importance of safety alignment during merging, leading to highly misaligned models. This work investigates the effects of model merging on alignment. We evaluate several popular model merging techniques, demonstrating that existing methods do not only transfer domain expertise but also propagate misalignment. We propose a simple two-step approach to address this problem: (i) generating synthetic safety and domain-specific data, and (ii) incorporating these generated data into the optimization process of existing data-aware model merging techniques. This allows us to treat alignment as a skill that can be maximized in the resulting merged LLM. Our experiments illustrate the effectiveness of integrating alignment-related data during merging, resulting in models that excel in both domain expertise and alignment.
@inproceedings{merging, title = {Model Merging and Safety Alignment: One Bad Model Spoils the Bunch}, author = {Hammoud, Hasan Abed Al Kader and Michieli, Umberto and Pizzati, Fabio and Torr, Philip and Bibi, Adel and Ghanem, Bernard and Ozay, Mete}, booktitle = {EMNLP Findings}, year = {2024}, }
Material Palette: Extraction of Materials from a Single Image

Lopes, Ivan, Pizzati, Fabio, and Charette, Raoul

In CVPR 2024

Abs arXiv Bib

In this paper, we propose a method to extract physically-based rendering (PBR) materials from a single real-world image. We do so in two steps: first, we map regions of the image to material concepts using a diffusion model, which allows the sampling of texture images resembling each material in the scene. Second, we benefit from a separate network to decompose the generated textures into Spatially Varying BRDFs (SVBRDFs), providing us with materials ready to be used in rendering applications. Our approach builds on existing synthetic material libraries with SVBRDF ground truth, but also exploits a diffusion-generated RGB texture dataset to allow generalization to new samples using unsupervised domain adaptation (UDA). Our contributions are thoroughly evaluated on synthetic and real-world datasets. We further demonstrate the applicability of our method for editing 3D scenes with materials estimated from real photographs. The code and models will be made open-source.

Star
@inproceedings{matpal, title = {Material Palette: Extraction of Materials from a Single Image}, author = {Lopes, Ivan and Pizzati, Fabio and de Charette, Raoul}, booktitle = {CVPR}, year = {2024}, }
Near to Mid-term Risks and Opportunities of Open Source Generative AI

Eiras, Francisco, Petrov, Aleksandar, Vidgen, Bertie, Witt, Christian, Pizzati, Fabio, and others,

In ICML 2024

Abs arXiv Bib

In the next few years, applications of Generative AI are expected to revolutionize a number of different areas, ranging from science & medicine to education. The potential for these seismic changes has triggered a lively debate about potential risks and resulted in calls for tighter regulation, in particular from some of the major tech companies who are leading in AI development. This regulation is likely to put at risk the budding field of open source Generative AI. We argue for the responsible open sourcing of generative AI models in the near and medium term. To set the stage, we first introduce an AI openness taxonomy system and apply it to 40 current large language models. We then outline differential benefits and risks of open versus closed source AI and present potential risk mitigation, ranging from best practices to calls for technical and scientific contributions. We hope that this report will add a much needed missing voice to the current public discourse on near to mid-term AI safety and other societal impact.
@inproceedings{opensource, title = {Near to Mid-term Risks and Opportunities of Open Source Generative AI}, author = {Eiras, Francisco and Petrov, Aleksandar and Vidgen, Bertie and Schroeder de Witt, Christian and Pizzati, Fabio and others}, booktitle = {ICML}, year = {2024}, }
Latent Guard: a Safety Framework for Text-to-image Generation

Liu, Runtao, Khakzar, Ashkan, Gu, Jindong, Chen, Qifeng, Torr, Philip, and Pizzati, Fabio

ECCV 2024

Abs arXiv Bib

With the ability to generate high-quality images, text-to-image (T2I) models can be exploited for creating inappropriate content. To prevent misuse, existing safety measures are either based on text blacklists, which can be easily circumvented, or harmful content classification, requiring large datasets for training and offering low flexibility. Hence, we propose Latent Guard, a framework designed to improve safety measures in text-to-image generation. Inspired by blacklist-based approaches, Latent Guard learns a latent space on top of the T2I model’s text encoder, where it is possible to check the presence of harmful concepts in the input text embeddings. Our proposed framework is composed of a data generation pipeline specific to the task using large language models, ad-hoc architectural components, and a contrastive learning strategy to benefit from the generated data. The effectiveness of our method is verified on three datasets and against four baselines.

Star
@article{latentguard, title = {Latent Guard: a Safety Framework for Text-to-image Generation}, author = {Liu, Runtao and Khakzar, Ashkan and Gu, Jindong and Chen, Qifeng and Torr, Philip and Pizzati, Fabio}, journal = {ECCV}, year = {2024}, }
SynthCLIP: Are We Ready for a Fully Synthetic CLIP Training?

Hammoud, Hasan Abed Al Kader, Itani, Hani, Pizzati, Fabio, Torr, Philip, Bibi, Adel, and Ghanem, Bernard

under review 2024

Abs arXiv Bib

We present SynthCLIP, a novel framework for training CLIP models with entirely synthetic text-image pairs, significantly departing from previous methods relying on real data. Leveraging recent text-to-image (TTI) generative networks and large language models (LLM), we are able to generate synthetic datasets of images and corresponding captions at any scale, with no human intervention. With training at scale, SynthCLIP achieves performance comparable to CLIP models trained on real datasets. We also introduce SynthCI-30M, a purely synthetic dataset comprising 30 million captioned images. Our code, trained models, and generated data are released.

Star
@article{synthCLIP, title = {SynthCLIP: Are We Ready for a Fully Synthetic CLIP Training?}, author = {Hammoud, Hasan Abed Al Kader and Itani, Hani and Pizzati, Fabio and Torr, Philip and Bibi, Adel and Ghanem, Bernard}, journal = {under review}, year = {2024}, }
On Pretraining Data Diversity for Self-Supervised Learning

Hammoud, Hasan Abed Al Kader, Das, Tuhin, Pizzati, Fabio (shared first author), Torr, Philip, Bibi, Adel, and Ghanem, Bernard

ECCV 2024

Abs arXiv Bib

We explore the impact of training with more diverse datasets, characterized by the number of unique samples, on the performance of self-supervised learning (SSL) under a fixed computational budget. Our findings consistently demonstrate that increasing pretraining data diversity enhances SSL performance, albeit only when the distribution distance to the downstream data is minimal. Notably, even with an exceptionally large pretraining data diversity achieved through methods like web crawling or diffusion-generated data, among other ways, the distribution shift remains a challenge. Our experiments are comprehensive with seven SSL methods using large-scale datasets such as ImageNet and YFCC100M amounting to over 200 GPU days.

Star
@article{budgetssl, title = {On Pretraining Data Diversity for Self-Supervised Learning}, author = {Hammoud, Hasan Abed Al Kader and Das, Tuhin and Pizzati, Fabio (shared first author) and Torr, Philip and Bibi, Adel and Ghanem, Bernard}, journal = {ECCV}, year = {2024}, }

2023

Physics-informed guided disentanglement for generative networks

Pizzati, Fabio, Cerri, Pietro, and Charette, Raoul

T-PAMI 2023

Abs arXiv Bib Code

Image-to-image translation (i2i) networks suffer from entanglement effects in presence of physics-related phenomena in target domain (such as occlusions, fog, etc), lowering altogether the translation quality, controllability and variability. In this paper, we build upon collection of simple physics models and present a comprehensive method for disentangling visual traits in target images, guiding the process with a physical model that renders some of the target traits, and learning the remaining ones. Because it allows explicit and interpretable outputs, our physical models (optimally regressed on target) allows generating unseen scenarios in a controllable manner. We also extend our framework, showing versatility to neural-guided disentanglement. The results show our disentanglement strategies dramatically increase performances qualitatively and quantitatively in several challenging scenarios for image translation.

Star
@article{arxiv, title = {Physics-informed guided disentanglement for generative networks}, author = {Pizzati, Fabio and Cerri, Pietro and de Charette, Raoul}, journal = {T-PAMI}, year = {2023}, }

2022

ManiFest: Manifold Deformation for Few-shot Image Translation

Pizzati, Fabio, Lalonde, Jean-François, and Charette, Raoul

In ECCV 2022

Abs arXiv Bib Code

Most image-to-image translation methods require a large number of training images, which restricts their applicability. We instead propose ManiFest: a framework for few-shot image translation that learns a context-aware representation of a target domain from a few images only. To enforce feature consistency, our framework learns a style manifold between source and proxy anchor domains (assumed to be composed of large numbers of images). The learned manifold is interpolated and deformed towards the few-shot target domain via patch-based adversarial and feature statistics alignment losses. All of these components are trained simultaneously during a single end-to-end loop. In addition to the general few-shot translation task, our approach can alternatively be conditioned on a single exemplar image to reproduce its specific style. Extensive experiments demonstrate the efficacy of ManiFest on multiple tasks, outperforming the state-of-the-art on all metrics and in both the general- and exemplar-based scenarios.

Star
@inproceedings{manifest, title = {{ManiFest: Manifold Deformation for Few-shot Image Translation}}, author = {Pizzati, Fabio and Lalonde, Jean-François and de Charette, Raoul}, booktitle = {ECCV}, year = {2022}, }
Leveraging Local Domains for Image-to-Image Translation

Dell’Eva, Anthony, Pizzati, Fabio, Bertozzi, Massimo, and Charette, Raoul

In VISAPP (best paper award) 2022

Abs arXiv Bib

Image-to-image (i2i) networks struggle to capture local changes because they do not affect the global scene structure. For example, translating from highway scenes to offroad, i2i networks easily focus on global color features but ignore obvious traits for humans like the absence of lane markings. In this paper, we leverage human knowledge about spatial domain characteristics which we refer to as ’local domains’ and demonstrate its benefit for image-to-image translation. Relying on a simple geometrical guidance, we train a patch-based GAN on few source data and hallucinate a new unseen domain which subsequently eases transfer learning to target. We experiment on three tasks ranging from unstructured environments to adverse weather. Our comprehensive evaluation setting shows we are able to generate realistic translations, with minimal priors, and training only on a few images. Furthermore, when trained on our translations images we show that all tested proxy tasks are significantly improved, without ever seeing target domain at training.
@inproceedings{visapp, title = {Leveraging Local Domains for Image-to-Image Translation}, author = {Dell'Eva, Anthony and Pizzati, Fabio and Bertozzi, Massimo and de Charette, Raoul}, booktitle = {VISAPP (best paper award)}, year = {2022}, }

2021

CoMoGAN: continuous model-guided image-to-image translation

Pizzati, Fabio, Cerri, Pietro, and Charette, Raoul

In CVPR (oral) 2021

Abs arXiv Bib Code

CoMoGAN is a continuous GAN relying on the unsupervised reorganization of the target data on a functional manifold. To that matter, we introduce a new Functional Instance Normalization layer and residual mechanism, which together disentangle image content from position on target manifold. We rely on naive physics-inspired models to guide the training while allowing private model/translations features. CoMoGAN can be used with any GAN backbone and allows new types of image translation, such as cyclic image translation like timelapse generation, or detached linear translation. On all datasets, it outperforms the literature.

Star
@inproceedings{cvpr, title = {{CoMoGAN}: continuous model-guided image-to-image translation}, author = {Pizzati, Fabio and Cerri, Pietro and de Charette, Raoul}, booktitle = {CVPR (oral)}, year = {2021}, }

2020

Domain Bridge for Unpaired Image-to-Image Translation and Unsupervised Domain Adaptation

Pizzati, Fabio, Charette, Raoul, Zaccaria, Michela, and Cerri, Pietro

In WACV 2020

Abs arXiv Bib

Image-to-image translation architectures may have limited effectiveness in some circumstances. For example, while generating rainy scenarios, they may fail to model typical traits of rain as water drops, and this ultimately impacts the synthetic images realism. With our method, called domain bridge, web-crawled data are exploited to reduce the domain gap, leading to the inclusion of previously ignored elements in the generated images. We make use of a network for clear to rain translation trained with the domain bridge to extend our work to Unsupervised Domain Adaptation (UDA). In that context, we introduce an online multimodal style-sampling strategy, where image translation multimodality is exploited at training time to improve performances. Finally, a novel approach for self-supervised learning is presented, and used to further align the domains. With our contributions, we simultaneously increase the realism of the generated images, while reaching on par performances with respect to the UDA state-of-the-art, with a simpler approach.
@inproceedings{wacv, title = {Domain Bridge for Unpaired Image-to-Image Translation and Unsupervised Domain Adaptation}, author = {Pizzati, Fabio and de Charette, Raoul and Zaccaria, Michela and Cerri, Pietro}, booktitle = {WACV}, year = {2020}, }
Model-based occlusion disentanglement for image-to-image translation

Pizzati, Fabio, Cerri, Pietro, and Charette, Raoul

In ECCV 2020

Abs arXiv Bib

Image-to-image translation is affected by entanglement phenomena, which may occur in case of target data encompassing occlusions such as raindrops, dirt, etc. Our unsupervised model-based learning disentangles scene and occlusions, while benefiting from an adversarial pipeline to regress physical parameters of the occlusion model. The experiments demonstrate our method is able to handle varying types of occlusions and generate highly realistic translations, qualitatively and quantitatively outperforming the state-of-the-art on multiple datasets.
@inproceedings{eccv, title = {Model-based occlusion disentanglement for image-to-image translation}, author = {Pizzati, Fabio and Cerri, Pietro and de Charette, Raoul}, booktitle = {ECCV}, year = {2020}, }