Machine Learning eXchange (MLX): One stop shop for Trusted Data and AI artifacts

By: Animesh Singh, Christian Kadner, Tommy Chaoping Li

In the AI lifecycle, we use data to build models for decision automation. Datasets, Models, and Pipelines (which take us from raw Datasets to deployed Models) become the three most critical pillars of the AI lifecycle. Due to the large number of steps that need to be worked on in the Data and AI lifecycle, the process of building a model can be bifurcated amongst various teams and large amounts of duplication can arise when creating similar Datasets, Features, Models, Pipelines, Pipeline tasks, etc. This also poses a strong challenge for traceability, governance, risk management, lineage tracking, and metadata collection.

Announcing Machine Learning eXchange (MLX)

To solve the problems mentioned above, we need a central repository where all the different asset types like Datasets, Models, and Pipelines are stored to  be shared and reused across organizational boundaries. Having opinionated and tested Datasets, Models, and Pipelines with high quality checks, proper licenses, and lineage tracking increases the speed and efficiency of the AI lifecycle tremendously.

To solve the above challenges, IBM and Linux Foundation AI and Data(LFAI and Data) are joining hands to announce Machine Learning eXchange (MLX), a Data and AI Asset Catalog and Execution Engine, in Open Source and Open Governance.

Machine Learning eXchange (MLX) allows upload, registration, execution, and deployment of AI pipelines and pipeline components, models, datasets, notebooks

MLX Architecture

MLX provides:

  • Automated sample pipeline code generation to execute registered models, datasets, and notebooks.
  • Pipelines engine powered by Kubeflow Pipelines on Tekton, the core of Watson Studio Pipelines.
  • Registry for Kubeflow Pipeline Components.
  • Dataset management by Datashim.
  • Serving engine by KFServing

MLX Katalog Assets

Pipelines

In machine learning, it is common to run a sequence of tasks to process and learn from data, all of which can be packaged into a pipeline.

ML Pipelines are:

  • A consistent way for collaborating on data science projects across team and organization boundaries
  • A collection of coarse grained tasks encapsulated as pipeline components to be snapped together like lego bricks
  • A one-stop shop for people interested in training, validating, deploying, and monitoring AI models

Some sample Pipelines included in the MLX catalog: Trusted AI Pipeline (with AI Fairness 360 and Adversarial Robustness 360), Hyperparameter Tuning, Nested Pipeline.

Pipeline Components

A pipeline component is a self-contained set of code that performs one step in the ML workflow (pipeline), such as data acquisition, data preprocessing, data transformation, model training, and so on. A component is a block of code performing an atomic task and can be written in any programming language and using any framework.

Some sample pipeline components included in the MLX catalog: Create Dataset Volume with DataShim, Deploy a Model on Kubernetes, Adversarial Robustness Evaluation, Model Fairness Check.

Models

MLX provides a collection of free, open source, state-of-the-art deep learning models for common application domains. The curated list includes deployable models that can be run as a microservice on Kubernetes or OpenShift and trainable models where users can provide their own data to train the models.

Some sample models included in the MLX catalog: Human Pose Estimator, Image Caption Generator, Recommender System, Toxic Comment Classifier.

Datasets

The MLX catalog contains reusable datasets and leverages Datashim to make the datasets available to other MLX assets like notebooks, models, and pipelines in the form of Kubernetes volumes.

Sample datasets contained in the MLX catalog include: Finance Proposition Bank, NOAA Weather Data – JFK Airport, Thematic Clustering of Sentences, TensorFlow Speech Commands.

Notebooks

Jupyter notebook is an open-source web application that allows data scientists to create and share documents that Jupyter notebook is an open-source web application that allows data scientists to create and share documents that contain runnable code, equations, visualizations, and narrative text. MLX can run Jupyter notebooks as self-contained pipeline components by leveraging the Elyra-AI project.

Sample notebooks contained in the MLX catalog include: AIF360 Bias Detection, ART Poisoning Attack, JFK Airport Analysis, Project CodeNet Language Classification. 

Join us to build cloud-native AI Marketplace on Kubernetes

The Machine Learning Exchange provides a marketplace and platform for data scientists to share, run, and collaborate on their assets. You now can use it to host and collaborate on Data and AI assets within your team and across teams. Please join us on the Machine Learning eXchange github repo, try it out, give feedback, and raise issues. Additionally, you can connect with us via the following:

  • To contribute and build end to end Machine Learning Pipelines on OpenShift and Kubernetes, please join the Kubeflow Pipelines on Tekton project and reach out with any questions, comments, and feedback!
  • To deploy Machine Learning Models in production, check out the KFServing project.