Data Science Versus Data Engineering

Share

In the third installment of this blog, we told you that “the analytic discovery process has more in common with research and development (R&D) than with software engineering.” But – symmetrically, if confusingly - what comes after the discovery process typically has more in common with software engineering than with R&D.

The objective of our analytic project will very often be to produce a predictive model that we can use, for example, to predict the level of demand for different products next week, to understand which customers are most at risk of churning, to forecast whether free cash flow next quarter will be above – or below – plan, etc.

For that model to be useful in a large organization, we may need to apply it to large volumes of data - some of it freshly minted - so that we can, for example, continuously predict product demand at the store / SKU level. In other cases, we may need to re-run the model in near real-time and on an event-driven basis, for example if we are building a recommendation engine to try and cross-sell and up-sell products to customers on the web, based on both the historical preferences that they have expressed in previous interactions with the website and on what they are browsing right now. And in almost all cases, the output of the model – a sales forecast, a probability-to-churn score, or a list of recommended products – will need to be fed back into one of the transactional systems that we use to run the business in order that we can take some useful action, based on the insight that the model has provided us.

To deliver any value to the business, then, we may need to take a model built in the lab from brown paper and string and use it to crunch terabytes, or petabytes, of data on a weekly, daily - or even hourly basis. Or we may need to simultaneously perform thousands of complex calculations on smaller data-sets - and to send the results back to an operational system within only a few hundred milliseconds. And achieving those sorts of levels of performance and scalability will require that we build a well-engineered system on top of a set of robust and well-integrated technologies.

Our system may have to ingest data from several sources, integrate them, transform the raw data into “features” that are the input to the model, crunch those features using one-or-more algorithms – and send the resulting output somewhere else. When you hear data engineers talking about building “machine learning pipelines”, it is this fetch-integrate-transform-crunch-send process that they are referring to.

Building, tuning and optimizing these pipelines at web and Internet of Things scale is a complex engineering challenge – and one that often requires a different set of skills from those required to design-and-prove the prototype model that demonstrates the feasibility and utility of the original concept. Some organizations put data scientists and data engineers into multi-disciplinary teams to address this challenge; others focus their data scientists on the discovery part of the process – and their data engineers on operationalizing the most promising of the models developed in the laboratory by the data scientists. Both of these approaches can work, but it is important to ensure that you have the right balance of both sets of skills. Over-emphasize creativity and innovation and you risk creating lots of brilliant prototypes that are too complex to implement in production; over-emphasize robust engineering and you risk decreasing marginal returns, as the team focusses on squeezing the last drop from an existing pipeline, rather than considering a completely new approach and process.

Of course, not every analytic discovery project will automatically result in a complex implementation project. As we pointed out in a previous blog, the “failure” rate for analytic projects is relatively high – so we may go several times around the CRISP-DM cycle before we settle on an approach worth implementing. And sometimes our ability may be merely to understand. For example, a bricks and mortar Retailer might want to identify and to understand different shopping missions – and “implementation” might then be about making changes to ranging and merchandising strategies, rather than about deploying a complex, real-time software solution.

Whilst there are several ways of employing the insight from an analytic discovery project, the one thing that they all have in common is this: change. As a former boss once said to one of us: old business process + expensive new technology = expensive old business process. And achieving meaningful and significant change in large and complex organizations is never merely about data, analytics and engineering – it’s also about organizational buy-in, culture and good change management. Whilst data scientists and data engineers often have different backgrounds and different skill sets, one thing that they often have in common - and that they may take for granted in others – is a belief that the data know best. Since plenty of other stakeholders see the business through a variety of entirely different lenses, securing the organizational buy-in required to action the insight derived from an analytic project is often as complex a process as the most sophisticated machine learning pipeline. Involve those other stakeholders often and early if you want to discover something worth learning in your data – and if you want that learning to change the way that you do business.

(Author):
Martin Willcox

Martin leads Teradata’s EMEA technology pre-sales function and organisation and is jointly responsible for driving sales and consumption of Teradata solutions and services throughout Europe, the Middle East and Africa. Prior to taking up his current appointment, Martin ran Teradata’s Global Data Foundation practice and led efforts to modernise Teradata’s delivery methodology and associated tool-sets. In this position, Martin also led Teradata’s International Practices organisation and was charged with supporting the delivery of the full suite of consulting engagements delivered by Teradata Consulting – from Data Integration and Management to Data Science, via Business Intelligence, Cognitive Design and Software Development.

Martin was formerly responsible for leading Teradata’s Big Data Centre of Excellence – a team of data scientists, technologists and architecture consultants charged with supporting Field teams in enabling Teradata customers to realise value from their Analytic data assets. In this role Martin was also responsible for articulating to prospective customers, analysts and media organisations outside of the Americas Teradata’s Big Data strategy. During his tenure in this position, Martin was listed in dataIQ’s “Big Data 100” as one of the most influential people in UK data- driven business in 2016. His Strata (UK) 2016 keynote can be found at: www.oreilly.com/ideas/the-internet-of-things-its-the-sensor-data-stupid; a selection of his Teradata Voice Forbes blogs can be found online here; and more recently, Martin co-authored a series of blogs on Data Science and Machine Learning – see, for example, Discovery, Truth and Utility: Defining ‘Data Science’.

Martin holds a BSc (Hons) in Physics & Astronomy from the University of Sheffield and a Postgraduate Certificate in Computing for Commerce and Industry from the Open University. He is married with three children and is a solo glider pilot, supporter of Sheffield Wednesday Football Club, very amateur photographer – and an even more amateur guitarist.

View all posts by Martin Willcox

Follow Connect

(Author):
Dr. Frank Säuberlich

Dr. Frank Säuberlich leads the Data Science & Data Innovation unit of Teradata Germany. It is part of his repsonsibilities to make the latest market and technology developments available to Teradata customers. Currently, his main focus is on topics such as predictive analytics, machine learning and artificial intelligence.
Following his studies of business mathematics, Frank Säuberlich worked as a research assistant at the Institute for Decision Theory and Corporate Research at the University of Karlsruhe (TH), where he was already dealing with data mining questions.

His professional career included the positions of a senior technical consultant at SAS Germany and of a regional manager customer analytics at Urban Science International. Frank has been with Teradata since 2012. He began as an expert in advanced analytics and data science in the International Data Science team. Later on, he became Director Data Science (International).

His professional career included the positions of a senior technical consultant at SAS Germany and of a regional manager customer analytics at Urban Science International.

Frank Säuberlich has been with Teradata since 2012. He began as an expert in advanced analytics and data science in the International Data Science team. Later on, he became Director Data Science (International).

View all posts by Dr. Frank Säuberlich

Follow Connect