Building the Machine Learning Infrastructure

Share

Making intelligent and accurate predictions is the core objective of machine learning and artificial intelligence applications. To achieve that objective, the machine learning or artificial intelligence application needs clean and well-organized information in a robust ecosystem architecture.

Machine Learning (ML) is the process of a computer system making a prediction based on samples of past observations. There are various types of ML methods. One of the approaches is where the ML algorithm is trained using a labeled or unlabeled training data set to produce a model. New input data is introduced to the ML algorithm and it makes a prediction based on the model. The prediction is evaluated for accuracy and if the accuracy is acceptable, the ML algorithm is deployed. If the accuracy is not acceptable, the ML algorithm is trained again with an augmented training data set. This is just a very high-level example as there are many factors and other steps involved.

Artificial intelligence (AI) takes machine learning to a more dynamic level producing a feedback loop in which an algorithm can learn from its experiences. In many cases an intelligent agent is used to perceive an environment and detect changes in the environment and then reacts to that change based on information and rules it has been taught.

Every AI program is dependent on information to make predictions and decisions. That information needs to be structured in the appropriate context to make informed decisions.

An example of appropriate context comes from an example application of a robotic vacuum cleaner [1] that would navigate a room on its own and how it was measured that it was doing a “good job”. The metric chosen was focused on “picking up the dirt” and therefore to measure the volume of dirt it vacuumed and the amount of time it spent collecting it. Based on this objective the vacuum would learn that when it bumped into an object dirt would get picked up, and thus it learned to identify where the most dirt was collected next to furniture or some other object and would bump the object harder to dislodge any additional dirt, such as knocking over a plant and dumping the dirt on the floor and then collecting it. It consumed more energy which in turn cost more, not to mention causing a mess, but it did a “good job” based on the metric by which it was measured. It based this on the context of the information to which it had access.

Keeping this type of approach continued to increase expenses and decrease benefits.

The solution was to change the perspective to a new metric of “clean the room and keep it clean” and thus the application learned to just focus on expending energy only in the areas that needed to be vacuumed and reduced the cost of energy consumed by the device. It needed additional sensors to accomplish this new mission which at first sight would seem to increase cost, but the reduction of energy used was paid back with each occurrence producing significant value. It functioned on the terms of efficiency.

For AI, machine learning, and any type of analytics, the better the information is modeled, structured and organized for fast retrieval, the more effective and efficient the processing will perform.

Conversely the more complex the model or structure, the more complex the processing.

AI and ML algorithms that search for patterns in unstructured or non-relational data still need structure. Even schema-less data must be wrangled into meaningful structures. AI and ML algorithms are most effective when the enterprise architecture enables efficient access and retrieval of information for specific contexts. The ingestion framework for an enterprise ecosystem architecture needs to consider the information and data needed for machine learning and analytics. The landed data should be a single usage point where data can be used across multiple applications and platforms, in other words land once, use many.

Kylo is an open source solution for data ingestion and data lake management employing NiFi templates to build an ingestion pipeline with cleansing, wrangling, and governance to transform data into meaningful structures needed for machine learning and analytics.

Kylo provides an ingestion framework that is a key component of any machine learning infrastructure. It leverages Nifi and Spark and is flexible to add others. The ingestion framework includes a wrangling component that facilitates the transformation of data into meaningful structures that ML and AI will rely on to make enhanced predictions. Data lineage is also captured in the framework to enforce governance. The framework accelerates the development process and iterations critical in constantly improving model accuracy.

Boosting business outcomes with the best ML and AI applications truly relies on a robust machine learning infrastructure and a well-thought-out ecosystem architecture. Kylo is a Teradata sponsored open source project under the Apache 2.0 license that provides an extensible framework for the machine learning infrastructure. Teradata also provides an ecosystem architecture consulting service to harness the vast experience of technology professionals in combining the right mix of technologies and data platforms into an efficient digital ecosystem.

References
[1] S. Russell and P. Norvig, Artificial Intelligence: A Modern Approach, Upper Saddle River, New Jersey: Pearson Higher Education, 1995, pp. 46-61.

(Author):
Pat Alvarado

Pat Alvarado is Sr. Solution Architect with Teradata and senior member of the Institute of Electrical and Electronics Engineers (IEEE). Pat’s background originally started in hardware engineering and software engineering applying open source software for distributed UNIX servers and diskless workstations. Pat joined Teradata in 1989 providing technical education to hardware and software engineers, and building out the new software engineering environment for the migration of Teradata Database development from a proprietary operating system to UNIX and Linux in a massively parallel processing (MPP) architecture.

Pat presently provides thought leadership in Teradata and open source big data technologies in multiple deployment architectures such as public cloud, private cloud, on-premise, and hybrid.
Pat is also a member of the UCLA Extension Data Science Advisory Board and teaches on-line UCLA Extension courses on big data analytics and information management.
View all posts by Pat Alvarado

Follow Connect