Site carrière CEA : toutes nos offres d'emploi

Connexion Espace candidat

J'ai déjà un espace candidat

Je me crée un espace candidat

Vous n'avez pas encore votre propre espace candidat. Créez-le en cliquant ici.
Un souci ? Contactez-nous à
admin-poem@cea.fr

Vous êtes ici : Accueil › Liste des offres › Détail de l'offre

Ma sélection : 0 offre(s)

Moteur de recherche d'offres d'emploi CEA

Mots clés (ex : informaticien)

Contrat

Domaine

Localisation du poste

CDI/CDD pour alternants/stag. CEA

Voir toutes les offres

Flux RSS et autres flux

Enhancing Visual Reasoning in Vision-Language Models (VLMs) through Dynamic Visual Feature Selection H/F

Détail de l'offre

Informations générales

Entité de rattachement

Le CEA est un acteur majeur de la recherche, au service des citoyens, de l'économie et de l'Etat.

Il apporte des solutions concrètes à leurs besoins dans quatre domaines principaux : transition énergétique, transition numérique, technologies pour la médecine du futur, défense et sécurité sur un socle de recherche fondamentale. Le CEA s'engage depuis plus de 75 ans au service de la souveraineté scientifique, technologique et industrielle de la France et de l'Europe pour un présent et un avenir mieux maîtrisés et plus sûrs.

Implanté au cœur des territoires équipés de très grandes infrastructures de recherche, le CEA dispose d'un large éventail de partenaires académiques et industriels en France, en Europe et à l'international.

Les 20 000 collaboratrices et collaborateurs du CEA partagent trois valeurs fondamentales :

• La conscience des responsabilités
• La coopération
• La curiosité

Référence

2024-33154

Description de l'unité

Based in Saclay (Essonne), the LIST is one of the two institutes of CEA Tech, the technological research division of the CEA. Dedicated to intelligent digital systems, its mission is to carry out technological developments of excellence on behalf of industrial partners in order to create value.

Within the LIST, the Laboratory of Vision and Learning for Scene Analysis (LVA) conducts research in the field of computer vision and artificial intelligence for the perception of intelligent and autonomous systems. The laboratory's research themes include visual recognition, behavior and activity analysis, large-scale automatic annotation, and perception and decision models. These technologies are applied in major sectors such as security, mobility, advanced manufacturing, healthcare, and sports.

Description du poste

Domaine

Mathématiques, information scientifique, logiciel

Contrat

Stage

Intitulé de l'offre

Enhancing Visual Reasoning in Vision-Language Models (VLMs) through Dynamic Visual Feature Selection H/F

Sujet de stage

Generative Vision Language Models (VLMs) are designed to integrate text generation with visual contexts, but their performance in tasks requiring complex visual reasoning remains under scrutiny. This internship will focus on enhancing VLMs by using Chain-of-Thought (CoT) reasoning to optimize visual feature selection for text generation.

Durée du contrat (en mois)

Description de l'offre

Generative Vision Language Models (VLMs) combine the understanding and generation of text in visual contexts. These models have demonstrated impressive performance on real-world visual question answering (VQA) benchmarks, suggesting their visual reasoning abilities. However, these benchmarks often mix pure visual reasoning tasks with tests of world knowledge, and typically involve questions requiring only a limited number of reasoning steps [2]. As a result, it is unclear whether a VLM's apparent success in visual reasoning tasks is truly due to its reasoning capabilities or simply leveraging its extensive world knowledge. Moreover, VLMs often struggle with fine-grained scene understanding and spatial reasoning, largely due to inefficient use of visual features [5].

This internship aims to tackle these limitations by developing a novel approach for VLMs, particularly those trained through instruction learning methods like LLAVA [1]. This architecture involves converting visual features, from a Visual Transformer model [3], into text embedding space before feeding them to a large language model (LLM) for text generation.

We propose leveraging the Chain-of-Thought (CoT) technique [4] to iteratively select the most relevant visual features during the text generation process. CoT involves generating step-by-step reasoning to break down complex tasks into simpler logical steps, which enhances model performance on tasks requiring complex reasoning. In our approach, we will begin by linking the reasoning steps in a textual chain to specific visual features within the image to provide a visual justification for each step. Afterward, the model will learn to directly select and process relevant visual features without relying on explicit textual reasoning steps, allowing for a more intuitive and efficient understanding of the visual context.

[1] Liu, H., Li, C., Wu, Q., & Lee, Y.J. (2023). Visual Instruction Tuning. ArXiv, abs/2304.08485
[2] Zhang, Y., Bai, H., Zhang, R., Gu, J., Zhai, S., Susskind, J., & Jaitly, N. (2024). How far are we from intelligent visual deductive reasoning? In COLM
[3] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., & Houlsby, N. (2020). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. ArXiv, abs/2010.11929
[4] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E.H., Xia, F., Le, Q., & Zhou, D. (2022). Chain of Thought Prompting Elicits Reasoning in Large Language Models. ArXiv, abs/2201.11903
[5] Zhang, J., Hu, J., Khayatkhoei, M., Ilievski, F., & Sun, M. (2024). Exploring Perceptual Limitation of Multimodal Large Language Models. ArXiv, abs/2402.07384

Profil du candidat

Students in their 5th year of studies (M2 or gap year)
Computer vision skills
Machine learning skills (deep learning, perception models, generative AI…)
Python proficiency in a deep learning framework (especially TensorFlow or PyTorch)
Scientific research experience will be appreciated

In line with CEA's commitment to integrating people with disabilities, this job is open to all.

Localisation du poste

Site

Saclay

Localisation du poste

France, Ile-de-France, Essonne (91)

Ville

Palaiseau

Critères candidat

Diplôme préparé

Bac+5 - Master 2

Possibilité de poursuite en thèse

Oui

Suivez nous

Connexion Espace candidat

J'ai déjà un espace candidat

Je me crée un espace candidat

Menu Site carrière CEA

Enhancing Visual Reasoning in Vision-Language Models (VLMs) through Dynamic Visual Feature Selection H/F

Détail de l'offre

Informations générales

Entité de rattachement

Référence

Description de l'unité

Description du poste

Domaine

Contrat

Intitulé de l'offre

Sujet de stage

Durée du contrat (en mois)

Description de l'offre

Profil du candidat

Localisation du poste

Site

Localisation du poste

Ville

Critères candidat

Diplôme préparé

Possibilité de poursuite en thèse

Suivez nous

Connexion Espace candidat

J'ai déjà un espace candidat

Je me crée un espace candidat

Suivez nous

Menu Site carrière CEA

Enhancing Visual Reasoning in Vision-Language Models (VLMs) through Dynamic Visual Feature Selection H/F

Détail de l'offre

Informations générales

Entité de rattachement

Référence

Description de l'unité

Description du poste

Domaine

Contrat

Intitulé de l'offre

Sujet de stage

Durée du contrat (en mois)

Description de l'offre

Profil du candidat

Localisation du poste

Site

Localisation du poste

Ville

Critères candidat

Diplôme préparé

Possibilité de poursuite en thèse

Ces offres pourraient vous intéresser

STAGE - Ingénieur simulation numérique des procédés H/F

Stage - Bac+4/+5 - Modélisation du comportement mécanique des bétons sous chargements extrêmes - H/F

Stage - Système d'Excitation VHF à Couplage Capacitif Intégré pour Machines Synchrones H/F