Five years ago, very few companies had any form of AI in production. Most AI was still experimental. One (of many) reasons for this was that the art of production machine learning (or MLOps) was in its nascent stages. Now, MLOps is relatively well established - just in time for a new operational skill set to become required - LLMOps (operational practices for large language models)! In this article, I cover what MLOps was and is, what parts I believe remain relevant for LLMs, and what new challenges the current LLM wave brings to operationalization for enterprises.
What is MLOps?
MLOps (or machine learning in production) refers to the set of practices, skills, and tools required to bring a machine learning (or deep learning, or AI) model into production while maintaining correctness, ethics, governance, and security. MLOps contains several subcategories. For example - Continuous Integration and Continuous Deployment focus on the process of deploying and integrating new model versions as well as the associated validation, while ML Observability (or ML Health) focuses on the monitoring of ML model behavior in production. A combination of technologies for each area together form a good MLOps practice.
What is a Large Language Model?
Large Language Models (or LLMs) are the most recent (and substantial) advancement of Natural Language Processing - or NLP. Powered by technologies such as Transformers and Reinforcement Learning with Human Feedback (RLHF), large language models are trained on massive datasets and are able to do a range of tasks such as text summarization, content generation, question/answer, and more. Brought to increased mainstream awareness by ChatGPT, LLMs are now easily and practically useful in a massive range of use cases from writing emails and marketing copy to creating learning tools. Their pervasiveness has made MLOps targeted to them extremely important.
Unique Challenges of Large Language Models
MLOps predated large language models by just a few years, but Large Language Models already bring new challenges that are at the very least different (if not more complex) than traditional machine learning or deep learning models. For example
- The notion of ML model quality, for simpler supervised learning models, can be captured by metrics such as accuracy for classification models (such as models that say Yes/No for disease diagnosis) or error for regression models (such as models that predict house prices). LLMs however, generate language as responses to a language prompt. These models can be measured on many metrics besides correctness (which is still key). Additional metrics can be the coherence of the answer, appropriateness to the audience, ability to cite suitable references, quality of the output - does the model ramble on or get to the point, etc etc.
- Complex error behaviors. LLMs can display subtle forms of bias, such as changing the tone of a response for a male vs female user, or making subtle changes in response verbiage depending on the perceived race of the customer.
- Inconsistency in behavior over time. This concept already existed in MLOps - commonly called Drift. The idea is that over time, for example as questions change in ways that the training set did not anticipate, or datasets get updated, a model’s behavior can change in ways not intended by its creators. In LLMs, such drifts can become more complex. For example, the same question can yield different answers - and that is normal for an LLM. However, if the patterns of these answers tend to change over time, that can become a problem for products and cause user confusion or displeasure.
Is this just about ChatGPT (and friends)?
No. ChatGPT (and competitors like Bard) may be the most publicly known LLMs, but they are by no means the only ones. Businesses are already creating custom LLMs with specialized domain specific knowledge, such as Bloomberg GPT. Many of these models are also available for download online. For example, a casual search of Hugging Face’s model hub (one of the largest model repositories in the world) for transformer based models in finance reveals 95 models already available for download. Several of these have been downloaded more than a million times in the last month alone. It is clear that LLMs are a rapidly growing practical segment of the AI solutions landscape.
What can we leverage from MLOps and what do we need to add?
There is debate as to whether LLMOps is a new area or merely a subset of MLOps. My belief is that much of MLOps is still core to LLMOps. The areas of leverage include the methodologies for deployment and integration, the need for governance, the collaboration required between IT teams, AI teams, and user-facing teams, etc. The areas where new work will be needed are wherever the structure of large language models provides new challenges previously not seen in ML or even large DL models. These challenges tend to be in
- Scale - these models are enormous in size and cost large amounts of resources to train and tune. While Deep Learning and Reinforcement Learning models did previously generate a large scale, there are many Machine Learning models whose scale was comparably much smaller. Large scale implies greater resource management challenges - such as managing GPU costs.
- Quality - we are still learning how to properly test, assess and monitor these models’ behaviors - and how to properly compare one model to another. The metrics are still evolving, and as such, model evaluations prior to deployment, A/B tests, model monitoring, all have LLM specific challenges.
- Dataset management - the size and scale of these models may render new elements in the dataset management stack - such as vector databases.
For enterprises - how to get started
In my view, the easiest way to get started is to focus on applying what we already know from MLOps. As with MLOps, all indications are that starting with a practical use case, learning along the way, and iterating as fast as possible, is the way to keep up with the technology and keep ahead of the competition.
The second key items to keep up with the evolving space of MLOps as it pertains to LLMs - such as LLM specific techniques for guardrails, security, quality, and monitoring. Expect this space to change rapidly even month to moth. However, iteration is still the best way to learn and develop expertise.
Whether LLMOps becomes its own category or becomes a subset of MLOps, really does not matter in the long term. LLMs are here to stay and the value they can add to most businesses is profound. What will matter is what your organization learns by using LLMs and how quickly they can learn it.
Follow me on Twitter or LinkedIn.
0 Comments