Machine Learning (ML) is a form of Artificial Intelligence (AI), allowing applications to learn from the data processed. This is in contrast to programming, which follows a defined route to write a specific computer program. During the absorption of data, the ML algorithms learn to produce better, more precise ‘models’. These ML ‘models’ are an output from training the ML algorithms. Training can be performed on smaller data sets, but best results are achieved when using large data sets.
After training a ‘model’, when you provide the model with an input, you will be producing an output. For example, a data extraction algorithm will create a data extraction model. Then, when you feed this model with data, you will receive a data extraction based on the data that trained the ‘model’. In most cases this will be an iterative process, allowing the ‘model’ to learn continuously.
ML differs from data mining as data mining is based on the principle of statistics. Traditionally data mining is used on structured data, intended to demonstrate data patterns. In contrast, ML automates the identification of patterns, mainly used to make predictions. To be fair, there are some similarities as both are analytical processes and are good for pattern recognition.
Although ML is already widely used in life sciences with analysing medicinal images, to give one example, it seems in Regulatory Affairs we are trailing behind. The recent implementation of IDMP highlighted several issues with regards to data and its granularity but also where it is maintained and how it can be accessed. Many found that accessing unstructured data, usually as part of documents, is a real challenge. This is where companies can benefit from ML.
The challenges with unstructured data.
Pharmaceutical companies are required to create SPCs (Summary of Product Characteristics) and PLs (Package Leaflets) for each and every medicinal product on the market. Both documents contain the required data to keep the public and health professionals informed about substances, excipients, Marketing Authorisation Holders (MAH), possible side effects and more. But here lies the challenge: structured data is contained in unstructured documents. The implication is that the data extraction project will be abandoned or time-consuming manual more error prone processes have to be used. But both options seem unsuitable if these documents are the only place where this type of data is managed. So, are there other more sophisticated methods available?
The benefit of Machine Learning.
ML seems an ideal approach here for many reasons. First of all, with SPCs available for nearly all medicinal products on the market, a large data set is available for training the ML ‘model’. Secondly, SPCs are semi structured documents with predefined sections, which partly supports the extraction process of product name, indication, substance, excipients and other data fields. Moreover, ‘models’ have to be trained to identify different SPCs for different products from different regions, independent of whether these products are for human or animal use.
Data granularity is another challenge, specifically with IDMP (Identification of Medicinal Products) in mind. The IDMP data model for example defines the “Authorised Medicinal Product” in detail with components including but certainly not limited to “Medicinal Product Name”, “Manufacturer/Establishment (Organisation)”, “Contra-Indication” and “Indications”. Now “Name of Medicinal Product” is also part of the SPC and so are “Therapeutic Indications” and “Contra-indications”. However, the granularity of SPC data compared to IDMP data is different. Here is where ML algorithms can be of benefit. ML algorithms will learn how to interpret the data structure and extract the data in IDMP compatible format. This process of “supervised learning” is based on certain understanding of the data to be analysed. The benefits are clear: an automated process via ML to establish structured data, which then can be used company wide for analysis and future processing, whether in Regulatory, Labelling or elsewhere.
Another underestimated benefit specifically for Generics would be the ability to easily compare SPCs with the originator’s SPC. Data from both documents, the originator SPC and the Generics SPC, can be extracted and compared on data level to identify any differences that might have occurred, especially after updates following an adverse event. ML algorithms will learn the layout of both versions of the SPC, train their ‘models’ accordingly and allow for precise data extraction, which then can be easily compared. Here we might get into Deep Learning utilising neural networks but more on that in a future post.
Join us at the upcoming DIA conference, 8th to 10th February 2021 to learn more about Machine Learning for the benefit of Life Sciences.