MLflow 1.0 vs. MLflow 2.0 - A Quick Comparison
Written by Hafsa Mustafa
Technical Content Writer
October 27, 2022
MLflow 2.0, the advanced version of MLflow Databricks, is out. The development begs the question: how is MLflow 2.0 different than the first version? This blog will answer this riddle by going in-depth into the topic of MLOps, Machine learning lifecycle, and the two versions of Databricks MLflow platform. Read on.
Deconstructing MLOps
The term MLOps is the combination of two fields ML (Machine Learning) and Ops (Operations) – which may refer to development or system operations. MLOps is a systematic process for developing, training, deploying, and optimizing statistical models (e.g., machine learning, deep learning, natural language processing) that enables many AI and other AI-based applications. Most MLOps platforms offer collaborative model construction, training, assessment, and deployment in addition to data preparation to drive the modeling and training processes.
MLOps technologies frequently automate the key components of this workflow, including feature engineering and hyperparameter tweaking, model scoring and best-fit deployment; continuous model learning; and model governance. These procedures are frequently referred to as data science workflow processes in general.
Databricks MLflow as MLOps Framework
The main purpose of MLOps is to automate the Data Science process – from training the machine learning models to finally deploying them into production. MLflow is an open-source machine learning platform developed by the Databricks team to operationalize the Machine Learning workflows, meaning it helps practitioners from training to production by supporting a diverse set of frameworks (TensorFlow, PyTorch, XGboost & SparkML) and with a diverse set of environments for ML serving such as Sagemaker, Azure ML serving & spark, etc.
In simple words, MLflow databricks can be defined as a tool for managing Machine Learning life cycle. It is used by data scientists and MLOps teams to streamline the process of model development and production.
Common Problems MLflow 1.0 Aimed to Solve
MLflow 1.0 was designed to solve some core problems related to Machine Learning practice:
- There was no proper way to keep track of experiments, especially hyperparameter tuning and other metrics.
- Reproducing the model in a colleague’s environment from your optimal runs was a challenge.
- No proper way to do a model exchange. Although GitHub was being used to share the codes, at some level, practitioners were not able to reproduce the model.
- Comparing models was another issue because as one went on to do different experiments, it was quite difficult to remember which model performed well with accuracy.
- No standard way of packaging & deploying a model.
In the Initial release of MLflow 1.0, they addressed all the above-mentioned issues by providing four components shown below:
- MLflow Tracking – keeps track of all the runs, codes, and metrices. It is essential for tracking experiments.
- MLflow Projects – keeps a format for packaging codes
- MLflow Models – a standard format for packaging models
- MLflow Model Registry – a centralized model store. MLflow Model Registry also stores API and UI.
Limitations in MLflow 1.0
Although MLflow made Machine learning workflow a lot easier than it previously was, some aspects were still left unattended. The remaining issues are:
- As an MLOps tool, MLflow 1.0 lacked a proper machine learning pipeline infrastructure that could accelerate the production and deployment of ML models at scale.
- There was no shortcut for updating the model if changes had been made in the data. Data Science and DevOps teams that handle productionizing had to retrain and deploy the model all over again, which only increased complexity and ate up a lot of time.
- Absence of pre-defined steps to do data ingestion, splitting, and transformation to support the transformation of datasets.
- Limited features for visualizing the runs and comparing them to get a better understanding of the experiments.
Key Features in MLflow 2.0: What’s New?
Let’s take a look into what new features have been added to the new and more advanced version of MLflow.
MLflow Pipelines
MLflow2.0 simplifies developing and managing Machine Learning operations by reducing the manual work required for iteration and deployment of models through the provision of production-grade MLflow pipelines and well-defined pipeline templates.
This open ML platform now contains a pipeline engine which makes it easier to switch from one step to another and also reduces the overall complexity of the work. The pipeline constitutes an opinionated workflow in which each step represents the best practices like deterministic splits, data profiles, transformed feature names, feature importance, automatic MLflow tracking, etc.
Lastly, it provides a standard command line interface to integrate with CI/CD and has test suites to prevent unexpected results.
Support for Tracking MLflow Pipeline
Support for MLflow tracking is available now at the workflow level, which ensures the automatic tracking of each pipeline execution’s metadata and thus provides support and simplicity to the Machine learning workflow.
Improved Visualization
As discussed above there, previously, there were only a few features available to visualize and compare your runs. However, in MLflow 2.0, Databricks have introduced some more visualization tools like box plot visualization and data profile feature, which help with getting stats from the data, especially about outliers, and allow you to visualize each column independently.
Support for NAN values
Null values remain important to the practice because replacing them through techniques (for instance, mean, median, and mode) sometimes renders the data unreliable. In MLflow 2.0, support for NAN values during logging and visualization has been added to enable practitioners to visualize their data without changing any null values.
Searching Enhancement
In MLflow2.0, unlike MLflow 1.0, which contained only a few simple features to compare and search your runs, the ‘mlflow.search_experiments()’ API has been introduced for searching experiments by name and by tags.
Importable MLflowClient API
We have APIs that support MLflow (Python, R, Scala), which interacts with MLflowClient at the higher level, and at the lower level, MLflowClient interacts with REST APIS. In MLflow2.0, MlflowClient is made importable as mellow.MlflowClient to interact.
Conclusion
By Introducing MLflow2.0, things are going to the next level as data scientists and ML engineers now have a proper open source MLOps tool that simplifies developing and managing Machine Learning projects by reducing the manual work related to iteration and deployment of models.
Royal Cyber’s certified data experts have extensive experience in deploying and productionalizing smart Machine Learning models for businesses thriving in diverse industries. Being a consulting partner of Databricks, Royal Cyber is well-versed in modern data technology. You can reach out to our experts if you have any queries on the subject.
Recent Blogs
- Middleware is often considered the glue that binds different systems and connecting platforms, and it …Read More »
- Learn to write effective test cases. Master best practices, templates, and tips to enhance software …Read More »
- In today’s fast-paced digital landscape, seamless data integration is crucial for businessRead More »