Gouvernement
PEReN – Center of expertise for digital platform regulation
Data science expertise at the service of digital regulation
Data scientist
Contribution to the valorization of free textual data in the health sector (2022)
Recently, the healthcare industry has faced numerous challenges (epidemics management, demand volatility, care times condensation, etc.), resulting in a growing need for useful information to support decision-making. Furthermore, the majority of existing health data is available in the form of free text (clinical notes, messages on social networks, etc.). In this context, recent breakthroughs in natural language processing (NLP), especially language models based on deep learning, have raised opportunities to unlock this information and improve the global management of the healthcare sector. These technologies will allow for enhancing health databases, smoothing information flows between stakeholders, and improving multiple processes ranging from demand forecasting to epidemics management. Thus, this thesis focused on how to leverage the massively available unstructured textual data in the healthcare sector. First, two literature reviews identified opportunities and challenges of applying NLP to leverage available textual data and improve management processes. However, using these techniques comes with several challenges, including the high variability and implicit nature of natural language expressions or the scarcity of training and evaluation data. Therefore, a methodology using recent language models based on transformers has been developed to perform contextualized health information extraction (negations or suspicions of diseases, etc.) from various health-related texts, in the context of data scarcity in French. Finally, a second contribution developed a methodology to combine structured medical data with unstructured textual data from news media and validated it on two real cases in the pharmaceutical industry.