In recent years, several text-image synthesis models have been released that are increasingly capable of synthesizing realistic images close to the input. Among the various state-of-the-art techniques and models, the introduction of the open-source latent diffusion model Stable Diffusion has led to significant developments in text-to-image generation in recent months. By using techniques such as DreamBoot and Textual Inversion, it is possible to refine further and control the generation process to produce even more specific output than text alone would allow. We test this approach for generating three specific cinematographic shot types: Close-up, Medium Shot, and Long Shot. By fine-tuning based on Stable Diffusion 1.5 using a small dataset of 600 labelled and captioned film frames, we achieve a noticeable increase in CLIP -T and DINO scores and an overall noticeable qualitative improvement (as indicated by our human-run evaluation survey) in image likability, compliance, and shot type correctness.

DreamShot: Teaching Cinema Shots to Latent Diffusion Models / Massaglia, T.; Vacchetti, B.; Cerquitelli, T.. - 3651:(2024). (Intervento presentato al convegno EDBT/ICDT tenutosi a Paestum (ITA) nel 25-28 marzo 2024).

DreamShot: Teaching Cinema Shots to Latent Diffusion Models

Vacchetti B.;Cerquitelli T.
2024

Abstract

In recent years, several text-image synthesis models have been released that are increasingly capable of synthesizing realistic images close to the input. Among the various state-of-the-art techniques and models, the introduction of the open-source latent diffusion model Stable Diffusion has led to significant developments in text-to-image generation in recent months. By using techniques such as DreamBoot and Textual Inversion, it is possible to refine further and control the generation process to produce even more specific output than text alone would allow. We test this approach for generating three specific cinematographic shot types: Close-up, Medium Shot, and Long Shot. By fine-tuning based on Stable Diffusion 1.5 using a small dataset of 600 labelled and captioned film frames, we achieve a noticeable increase in CLIP -T and DINO scores and an overall noticeable qualitative improvement (as indicated by our human-run evaluation survey) in image likability, compliance, and shot type correctness.
2024
File in questo prodotto:
File Dimensione Formato  
DARLI-AP-8.pdf

accesso aperto

Tipologia: 2a Post-print versione editoriale / Version of Record
Licenza: Creative commons
Dimensione 8.69 MB
Formato Adobe PDF
8.69 MB Adobe PDF Visualizza/Apri
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11583/2988712