Imagine wanting to analyze the notes that a doctor’s typed up in a patient’s electronic health records (EHR) who has tested positive for COVID-19, including descriptions of symptoms and complications in great detail. These notes contain nuances that could be vital to understanding the development of the disease, the manner of transmission and the most effective treatments with the least side effects. Doing this in a timely and efficient manner can be critically important to address issues for better prevention, preparedness and even a cure.
Similarly, there is exorbitant amounts of healthcare and bio-medical care data for various diseases made available through physician notes, insurance claim, EHR, medical journals, news feeds, social media, etc. All of this data lacks utility, unless mined and brought into shape. The emerging technologies in text processing techniques and resources give way to an ocean of opportunities for providing useful insight, analysis and deduction which mimic the behavior of experts associated with healthcare
and its related domain.
This post exemplifies use of some of latest technologies and resources to mine concepts from the bio-medical domain, and applying Teradata Vantage’s
advanced analytics capabilities to analyze and predict useful diagnosis and prescription.
Electronic Health Records (EHR) are the digital patient information records that are inputted by a physician/clinician after each visit/examination. These recorded entries are manual, free-form text inputs containing a variety of medical information including patient demographics, diseases, anatomy, medication, treatments, dosages, etc. - all which lack structure. These records are often grammatically incorrect, have misspelt names and acronyms which are difficult to disambiguate from different contexts of usage.
In order to process such complex and irregular domain-specific text, we need at our disposal some powerful tools which are able to disambiguate, mine and structure the text which can, in turn, provide ground for further advanced analytics:
- One powerful instrument for cleaning and shaping text is Regex. Using Vantage’s Regex functions, text is transformed by removing non-ascii and other mark-up tags, performing sentence segmentation and other text normalization tasks.
- Next, we use an important entity recognition tool, MetaMap, which is used to map biomedical text to Unified Medical Language System (UMLS) concepts. It uses a knowledge intensive approach coupled with natural language processing and computational linguistics to categorize concepts and acronyms into 137 possible types and groups of categories. This is a key resource to understanding medical information which is made freely available to promote and improve healthcare services. Through API calls, we were able to transform our dataset into a rich corpus tagged with medical entities and their inter-relations. An example output of an entity tagged sentence is shown in Figure 1.
Figure 1: Bio-Medical Entity Recognition
- Syntactics dependency parser gives grammatical structuring to a sentence which in turn helps to pin-point deeper analysis about expressed opinion. Concept negations, conjunctions and adjectival terms help to extract aspect information and opinionated terms from the sentence. This helps to identify at a finer level the sentiment associated with a specific terms or concept rather than jumbled sentiment at the coarse sentence level. To build dependency parse, advanced NLP libraries from python are a good choice, whereas for sentiment analysis, in-built models and trainers are available within Vantage.
Figure 2: Dependency Parse of Opinionated Sentence
Figure 3: Disorder type mention in each visitor report along with sentiment
For each inspection report, using the features for various categories such as medication, diseases, body parts, etc., along with possible associated sentiment of each aspect, we are able to build advanced models for the medical condition of patients. By using native Vantage capabilities, example analytics are built to obtain useful insight and deductions:
- Using the features for disorders and anatomy, we build a classifier to predict possible diagnosis for a patient. Such analytics can assist in a physician’ decision-making in prescribing medication and treatments, taking into account the patient’s past and present conditions along with historical treatment record.
- Clustering of physician reports based of various types of features, particularly disorder and anatomy, can reveal related examinations and patients with related symptoms and diseases. This is particularly useful when profiling patients based on their illness patterns.
- Using N-Path, it’s possible to obtain a trace of prescribed medication and visualize how physicians have treated cases belonging to the particular medical condition of patients.
Figure 4: Clustering Visualized using PCA and TSNE graphs
Figure 5: NPath tracing medication prescription
Given that healthcare, pharmaceutical and cosmetic companies are looking towards AI-enabled technologies to help provide useful insight into medical diagnosis, the approach presented here showcases Teradata’s ability to combine Vantage’s advanced analytics offering -- seamlessly integrated with open-source tools and techniques in text processing -- to decipher complex healthcare-related issues pertinent to industry requirements.
Bilal joined Teradata as a Professional Services Data Science Consultant in 2016. In his role he is worked with the local team to show-case value for businesses to adopt analytics and solve various business problems with the help of data-driven insights and predictive modeling. Initially focused on financial sector, Bilal consulted on problems credit Risk scoring, branch/ATM cash optimization, Next Best Offer analytics etc. Later as he joined the Global Delivery Consulting arm, his portfolio diversified to include wide range of industries including, utilities, retail, telecom, healthcare etc.
With a specialization in speech & text processing, Bilal pursued academia and advanced study focused on research topic in natural language processing. Prior to joining Teradata, he worked as a Research Associate with a globally leading research group in Natural Speech Technology, an EPSRC grant shared between Edinburgh Univeristy, Sheffield University, and University of Cambridge, while based out of Sheffield. He made significant contribution to speech technology working in collaboration with BBC news, and the NHS to build advanced speech recognition systems for disabled speakers. During his tenure in Teradata, Bilal continues to work as domain lead in text processing, having worked with customer on question answering systems, healthcare records information extraction using NLP, topic modeling etc.
View all posts by Bilal Khaliq