Teradata is a mix of highly qualified data scientists, business intelligence consultants, software developers, Managed Services consultants, and data engineers. All this makes a great recipe for deriving insights and analytics from data. Lately, the world has been gripped in the talons of COVID-19. It is a highly contagious disease that is rampant across the world. More than 200 countries are currently affected by COVID-19.
COVID-19 as of today has affected 9 million and killed more than 0.8 million people.
In these harrowing times, Teradata aims to stand alongside all organizations both private and public to fight this pandemic.
What is COVID-19 and why should we be concerned?
COVID-19 is a global pandemic that has spread worldwide within a matter of months. Although its exact cause of spread is unknown, COVID-19 not only has implications on the physical well- being of the human race, it has a direct impact on psychological health as well. Since this pandemic is fairly new there is much to be researched about it. For now, all we know about this is that it spreads through interaction and is highly contagious.
There is a great debate about the shelf life of the disease. It does not discriminate amongst the population based on age. It affects people of all age brackets, but results have shown that people in the higher age bracket are more susceptible to the disease.
A more burning question that needs to be asked is why should you be concerned? Other diseases amount to a greater number of deaths daily, for example HIV, tuberculosis, cancer and many more. COVID-19 is a cause for major concern because it spreads at an alarmingly exponential rate. This gives the authorities insufficient time for proper planning in terms of appropriate medical supplies and other measures that need to be taken to stabilize or decrease the number of people affected or killed.
What can Teradata do about it?
To stay ahead of the disease, we aimed to utilize data science and analytics so that accurate projections regarding the number of confirmations and deaths could be obtained. This provides an avenue for devising better and informed public policies -- like the time and duration of lockdowns -- focusing on important and critical regions in the area. As a result of better public policy, the health sector can be equipped more efficiently (ordering of masks, ventilators, etc.). Knowing beforehand about the future figures of the confirmed cases and deaths, governments can design campaigns to raise awareness among the public sector which in turn helps in the prevention of a large number of deaths. This solution can also help the government sector to vet the current policies that they've made in light of COVID-19.
Forecasting COVID-19 with Data Science
Figure 1 Summary of the proposed solution
The diagram Figure 1 shows the summarized version of the solution proposed by a team of data scientists for the COVID-19 hackathon
organized on May 6, 2020. The idea was simple: use a time-series dataset of COVID-19 for analysis. Using various technologies offered by Teradata, multiple machine learning algorithms were used to enable forecasting of future projections.
The base dataset used in this analysis was taken from the European data portal
. The main variables that the dataset includes are given below:
- Confirmed cases
Similar variables are offered by many data sets available online like Kaggle, WHO, and Kdnuggets. One thing is similar amongst all these offerings: since the disease is new, the data set is small, and also the number of features available is less. To expand the dataset more relevant features were brainstormed and added to the base European data source.
Increasing the number of variables of a data set usually gives a better result in terms of the metric of choice (loss, accuracy) when using machine learning algorithms.
The additional features added to the data set are mentioned below:
- Daily weather (Online, 2020)
- Population (WHO, 1948)
- Number of doctors (WHO, 1948)
- Tests performed (School, 2015)
- Deaths by lungs diseases (WHO, 1948)
- Mortality rate (WHO, 1948)
- Air quality (WHO, 1948)
To scale up the analysis, any relevant features may be added to the data set.
Figure 2 Architecture of COVID-19 forecasting solution
The architecture of the proposed solution is shown above in Figure 2. Every block in the diagram represents a different stage in the pipeline. At the bottom of each block, the Teradata technology used in the development is mentioned alongside open source and third party tools. For the first data sources module, the data sets from different sources were combined using python and then pushed on to a Teradata Vantage instance. Since the data set was accessible in Teradata Vantage
, it could be queried using the Teradata SQL Engine.
The machine learning algorithms mentioned in the third block were developed using built-in python libraries like TensorFlow, Keras, and Prophet. To feed these machine learning algorithms, Teradata MLE was used to forklift the saved ADS so that it could be used for analytics via the Jupyter notebooks on the Teradata App center Transcend instance.
Once the predictions and forecasting were complete, the results were saved on Teradata Vantage and shown on interactive dashboards developed in Tableau. Although deployment through the AnalyticsOps offering of Teradata was also an option, it was not part of this implementation due to limited time.
The main models developed and tested for the solution are listed below and then mentioned in detail.
LSTM (single feature)
- LSTM (single feature)
- LSTM (multi features)
- GRUs 4. Prophet
An LSTM (Long short-term memory) model is a variant of a Recurrent Neural Network (RNN) and is used for predicting on datasets that sequentially carry patterns. LSTMs are typically used for time-series datasets and are also very well-suited for classification problems. This model was chosen for this problem statement as our dataset is also time-bound. The LSTM model takes in the last eight day’s features and predicts the figures for the eighth day. When it reaches a point when the target value ceases to exist, it takes into account its predictions.
An important point worth mentioning here that differs this version of LSTM than the next LSTM (multi features) is that this one doesn’t use any additional features (i.e., weather, population, tests performed, etc.). It takes into account only one feature at a time. The purpose was to show the gain that can be achieved by adding more information and features to the dataset. Figure 3 shows a single time-series (batches of eight days) being fed into the LSTM to give an output.
Figure 3 LSTM 1 Feature Model
LSTM (Multi Features)
The LSTM-based model, shown in Figure 4, consists of a four-layer deep network with three fully connected layers and a time distributed LSTM layer. It takes as input the last 14 days' data which includes not just the time series of confirmed cases/deaths but also 20+ other features such as temperature readings and healthcare facilities to predict accurately the number of cases and deaths for each day over the next 14 days. It was trained on around 80 examples of 14-day intervals and the results were remarkably accurate, as shown below, even with such a small dataset.
As mentioned before, this LSTM variant takes into account additional features that were added to the dataset. The performance of the algorithm in terms of predicted values show a significant improvement. This will be discussed in more detail in the results section. The diagram below shows a time-series and an additional feature set being fed into the LSTM model to produce and output.
Figure 4 LSTM Multi-Feature Model
A GRU Neural Network is also a type of a Recurrent Neural Network. It is based on the architecture of an LSTM but it has its subtle differences. Compared to an LSTM that has three memory gates (input, forget and output gate), a GRU has only two gates (reset and update). Another major difference between the LSTM and the GRU is that unlike the LSTM, GRUs don’t rely on memory units.
Research papers claim that it is computationally more efficient. Although at this stage efficiency was not in the scope of the project, the GRUs served as a good baseline for our LSTM models. The GRU model also trained on the last seven days of the time-series to forecast the eight day. Figure 5 below shows a time series being fed into the GRU Neural network to obtain forecast figures for one day.
Figure 5 GRU Model
The prophet is optimized for the forecast tasks which typically have any of the following characteristics: hourly, daily, or weekly observations with at least a few months (preferably a year) of history, a reasonable number of missing observations or large outliers’ trends that are non-linear growth curves, where a trend hits a natural limit or saturates.
As mentioned earlier, for COVID-19 we have a comparatively smaller dataset, the prophet algorithm still projected relatively good results as compared to other forecast algorithms like ARIMA which was also tested but not reported.
For every country, a separate model was trained and the parameters like Fourier number, period, etc., were accordingly tuned for the number of complete cases and deaths forecast. The input time series for which the models were trained was from January 1 – May 5, and for the next 15 days the models did the forecast and the results were really impressive.
Figure 6 Prophet Algorithm
The results of the different models described above were represented via a Tableau dashboard that was developed specifically for this project. Figure 7 shows the front page of the dashboard. It projects the population figures of the 10 countries chosen for our analysis:
- United States of America
The population figures stress the fact that the dataset itself might be showing varying patterns and trends because the recorded data per country is impacted by a lot of different factors.
Figure 7 Population of top 10 countries across the globe
The second tab of the dashboard shown by Figure 8 shows summarized COVID-19 stats of the chosen 10 countries. The bar-graph shows a country-wise bifurcation of the confirmed coronavirus cases whereas the bubble plot shows the country-wise deaths. The size of the bubble represents the number of deaths.
Figure 8 Country-wise COVID-19 summary statistics
To demonstrate graphically and to encompass the entire project into a storyboard, we show the results for Germany in the graphs below. Figure 9 shows the forecast trends for the LSTM 1 feature for Germany. The top chart shows the trends for the confirmed cases over the past three months and the bottom chart shows the deaths in Germany for the past three months. The blue line in the top chart represents the actual number for the confirmed COVID-19 cases in Germany whereas the orange line represents the predictions by the LSTM 1 feature model. It can be seen that although the two lines are not far apart, the algorithm seems to have flatlined. This could be indicative of the fact that the dataset size (single feature and less number of data points) is insufficient to predict the future.
A similar trend is also seen in the second graph as well, where the fuchsia line represents the number of COVID-19 deaths and the green line shows the predictions made by the algorithm. Around the end of April, the algorithm seems to have captured some trends of the dataset and the two lines coincide perfectly but then the algorithm flatlines.
Figure 9 LSTM 1 Feature results - Germany
The GRU model was also trained on single feature implementation and also shows a similar trend to the normal LSTM 1 feature as seen in Figure 10. It seems to be predicting the trend a little better than LSTM 1 Feature, but after April 22 it too flatlines.
Figure 10 GRU results - Germany
Figure 11 shows the results for the LSTM multi-feature. For this particular model, training of the algorithm was done on daily and accumulative figures. The charts on the right side are results for the daily figures and the graphs on the left side show results for the cumulative figures. It can be seen in the image that all the charts show an improved learning curve and a forecast curve that is closer to the previous trend. The LSTM multi-feature model shows better trends than LSTM 1 feature and the GRU model.
Figure 11 LSTM Multi-Feature results - Germany
Finally, the graphs shown in Figure 12 show the results of the model Prophet. The top graph shows the confirmed COVID-19 cases for Germany whereas the bottom graph shows the COVID-19 deaths. Both the graphs show a trend of forecast numbers very similar to the actual figures in the past. This model hints to be the best implementation of the lot. This becomes even clearer when we look at the actual figures of a single day and compare them to the forecasted figures by our model. This is shown in Table 1. The actual figures shown in the table were taken on the May 8, 2020. It can be seen from the table that the predictions of the model Prophet are comparable to the actual figures.
Figure 12 Prophet results - Germany
|Country Actual Predicted
|Germany 169,430 166,698
|US 1,292,800 1,275,296
|Italy 215858 216,916
|Spain 256,855 220,270
|Pakistan 25,337 23,224
Table 1 Actual vs Predicted for the model Prophet - 8th May 2020
As an end note, it can be summarized that this project gave us insight as to how neural networks would react to the COVID-19 dataset. It also proved that given an increase in the number of features to the COVID-19 dataset, the forecasting capabilities of the algorithms can be improved. This project focuses on the maximum utilization of Teradata technologies to help curb and fight COVID-19
. We as a team have explored such offerings to highlight the true power of Teradata.
A special thanks to co-authors: Muhammad Owais Masood, Rana Muhammad Ahmad and Fahad Zia
Muhammad Usman Syed
Usman is a Data Science Master’s Graduate from the University of Hildesheim (Germany) with prior experience in the Data Warehousing domain as a Business Intelligence Consultant. He has worked with versatile teams in the Telecom and Finance sector in order to cater to business requirements. Usman has worked on Access Layer development, Report development, Dashboard development, KPI reconciliation and Adhoc Data requirements for the BI domain. For the Data Science domain, he worked on a end -o-end Motion Classification Project based on sensor data, the result of which was published in the ECDA 2019 (European Conference of Data Analytics). His tasks in the project primarily were Data Exploration, Preprocessing and Training and testing logistic regression with LSTM. Usman also worked on the comparison and improvement of Model Averaging Techniques through Network Topology Modelling in a distributed environment with Pytorch and MPI.
View all posts by Muhammad Usman Syed