Chargement Évènements

« Tous les Évènements

  • Cet évènement est passé.

Ariel Cohen, Introduction to Weak Supervision & Applications

28 avril @ 14:0015:00

Introduction to Weak Supervision & Applications

The recent digitization of patient health records and their collection, in a near real-time basis, in Clinical Data Warehouses (CDWs) offer new perspectives for research, steering activities and policy making. Although promising, taking advantage of Electronic Health Records (EHR) is still a current challenge. Particularly, textual data are very rich in information but their exploitation remains extremely difficult. The development of efficient methods of information extraction from unstructured data for further use is, therefore, essential.
Natural language processing (NLP) techniques applied to health care notes have already shown satisfactory results in the literature, especially with supervised learning approaches. However, this good performance depends strongly on the existence of many annotated records and, moreover, these annotations must be performed by domain experts. This annotation task is in practice a bottleneck for the development of research because the experts’ available time is a scarce and expensive resource. Furthermore, the majority of annotated datasets issued from clinical notes could not be shared and reused due to patient privacy regulations.

The challenge of acquiring labelled training data has driven the search for alternatives to traditional supervised machine learning. There are new engineering and mathematical methodologies that focus on minimising the expert annotation task, especially the weak supervision approaches. Programmatic weak supervision encompasses a wide range of techniques that aim to learn from data where the supervision comes from labelling functions. Among those techniques, the distant supervision approach allows the use of multiple data sources to build annotated datasets automatically, consequently, much faster than what can be produced by manual annotation. However, this programmatic annotation is imperfect, producing “silver standard” datasets with partially unreliable labels, also called noisy labels. Many machine learning algorithms, including the most recent such as Deep neural networks (DNNs), are susceptible to overfit on noisy labels; Therefore, several efforts and methods have been developed to be able to learn from noisy labels with DNNs.

There is also an increasing interest in the use of Large Language Models (LLM) to solve NLP tasks of information extraction in the medical domain without the need of an expert labelled training set. Even though, to date, they present several limitations: first, it has been shown that these models are not as performant as smaller supervised contextual models (e.g. BERT). Second, the operational cost of deploying these huge resource demanding models in a CDW with more than 11M patients is not conceivable from an industrial perspective. The need for a dedicated, state-of-the-art hardware and the energy consumption of it makes, at date, prohibitively expensive the massive use of this technology for inference purposes. On the other hand, recent publications suggest that these models are suitable for the labelling task and they could accelerate the development of smaller specialized models.

The primary goal of our work is to explore how weak supervision approaches can be developed within a CDW to reduce the annotation workload for medical professionals and speed up NLP model development, while addressing the constraints typical of this industrial environment. Our research will be developed using multiple real-world use cases, and aims to answer the following research questions: Can weak supervision methods be applied in a Clinical Data Warehouse context to accelerate the development of NLP models? How can we leverage information redundancy present in certain portions of Electronic Health Records with a CDW to create labelling functions to obtain a programmatically annotated corpus (silver standard) which allows us to fit a model using distant supervision? How can we take advantage of Large Language Models in the annotation phase of training sets, and how can we use these datasets for the development of smaller, specialized models ready for deployment in a Clinical Data Warehouse? Which are the most effective training techniques for handling these silver standard datasets?

Détails

Date :
28 avril
Heure :
14:00 – 15:00