cover image
Institut DataIA Paris-Saclay

Internship - Transfer learning models able to handle MISSing data for the survival analysis of rare cancer from multi-OMICS data

On site

Gif-sur-yvette, France

Internship

20-01-2026

Share this job:

Skills

Statistical Analysis Research Deep Learning Mathematics

Job Specifications

Centre National de Recherche en Génomique Humaine (CNRGH)

Within the CEA, the National Center for Human Genomics Research (CNRGH), located in the Évry-Courcouronnes Genopole, is a research center dedicated to the study of the human genome. It is part of the François Jacob Institute of Biology (IBFJ), which belongs to the CEA’s Fundamental Research Division (DRF).

The center provides the French and European scientific communities with the capacity to produce, store, and analyze the biological data required to carry out projects in the field of medical genomics, including research on cancer, rare diseases, and autism. It is the largest sequencing platform in France and one of the five largest in Europe.

The Mathematics and Statistics (MS) team plays a pivotal role at the National Center for Human Genomics Research. Its scope of action covers three main functions. First, the MS team is responsible for quality control of data generated by the genotyping platform. Second, the MS team acts as a reference point for the evaluation and validation of statistical analysis plans for both internal and external collaborative projects involving human genetics studies that use the CNRGH’s genotyping and sequencing platforms. Finally, the MS team initiates methodological research projects on original topics of interest to the genetics community. In particular, it has developed expertise in statistical methods for studying genetic associations, including rare variants, gene–environment interactions, gene networks (pathways), and multi-omics integration.

Adresse

2 rue Gaston Crémieux

91000 Evry-Courcouronnes

France

Détail de l'offre (poste, mission, profil)

Corps de texte

Context. In the context of cancer, accumulations of aberrations observed at multiple molecular levels are the source of the many differences observed between somatic tumor and normal cells1. Abnormalities on DNA may include an increased number of mutations, differentially methylated sites (epigenetic markers), or copy number variations (different numbers of copies of a chromosome segment in a cell). Such modifications have an impact on gene expression, which in turn affect proteins. Studying these molecular data (namely omics) separately is often not enough to understand the undergoing dysregulation. This led to the establishment of multi-omics studies with the hope that looking jointly at all the molecular layers would unravel the big picture. From a statistical point of view, this would result in an increase of power. Indeed, combining multiple small effects, across several omic modalities, commonly explaining the same phenomenon would increase the signal-to-noise ratio. However, to achieve this purpose, the high dimensionality of such data (more than 20.000 coding genes) has to be handled to avoid estimating spurious associations. Therefore, a tremendous number of multi-omics analysis methods have been developed2,3.

Among the tasks addressed with multi-omics data, survival analysis consists in estimating the duration between a patient’s initial diagnosis and their death. Such analysis can identify groups of patients with differential prognosis and distinguished by a molecular (omic) signature. Clinicians can further investigate such signatures for new treatments or to better adapt therapies according to the molecular specificities of a given cancer. This is one way of performing precision medicine.

Despite the promise of multi-omics data, their benefit in the field of cancer survival analysis remain limited. In an insightful study3, 12 survival analysis methods were compared on 18 cancer data-sets analyzed separately. The aggregated results across all cancers showed that only two methods using both clinical and molecular data performed better (not statistically) than a reference model using only clinical data. Adding Deep Learning methods in a follow-up study4 did not change the conclusions. In an ongoing work5, we added joint Dimension Reduction (jDR) methods to the comparison. These methods estimate a reduced space representing well the commonalities between omic layers2. We made the hypothesis that estimating such joint reduced space, prior to survival analysis, would improve the prediction results by better dealing with the high dimensionality of the data. Preliminary results identified two jDR methods, using both clinical and omics data, statistically outperforming the reference model, using clinical data only, after aggregating the results across all cancers. Further improvement in the performance of these methodologies may be expected.

However, we are still far from identifying robust candidate multi-omics biomarkers to be further investigated by clinical trials. This could mean that the dimensionality of the data is simply too high to construct good prediction models. We identified two major ways to better handle this curse of dimensionality. First, all studies mentioned above deal with complete data, i.e. if a subject has at least on

About the Company

Créé en 2017 dans le cadre de la Stratégie Nationale pour l’Intelligence Artificielle, l’Institut DataIA est le pôle d’excellence en IA de l’Université Paris-Saclay. Il fédère 14 établissements d’enseignement supérieur et de recherche, dont CentraleSupélec et l’ENS Paris-Saclay, ainsi que des organismes nationaux et partenaires académiques. L’Institut œuvre à structurer l’écosystème IA autour de la recherche, de la formation et de l’innovation, avec des actions phares comme le projet SaclAI-School, lauréat de l’appel Compét... Know more