Intro
Advancements in GenAI, including LLMs and multi-modal models, have created an evaluation challenge for organizations deploying these technologies into their ecosystems.
New foundation models are now so complex that standard accuracy and performance metrics will no longer cut it. Datasets are unstructured, so quantifying concepts such as drift and quality require more advanced statistical techniques to be meaningful. Furthermore, the notion of 'ground truth’ is subjective in many dual-use scenarios (dual-use means, like Chat-GPT, the model can be used for many different tasks).
Luckily, advancements in GenAI methods for how we Evaluate and Align our AI systems have mostly kept up. What Evaluation and Alignment really boils down to is quite simple. What do we want an AI system to do, is it doing it in a way that isn’t harming anyone, and – if not – how do we get it to do so?
More technically, AI Evaluation is the process of assessing the performance, accuracy reliability, of AI Models. AI Alignment is the process of steering AI systems towards behaving in ways that are consistent with our Values, Goals, and Ethical standards.
Determining the best approach for the evaluation and, subsequent alignment, of AI systems requires information that spans multiple disciplines to be successful.
A central part of Fairo’s mission is to bridge the knowledge gaps associated with AI and AI Governance across the organization. This is especially important for Evaluating and Aligning AI. There are so many metrics, libraries, datasets, and frameworks available in the AI ecosystem. Keeping track of what tool does what, and how to use it is complex. Special attention is required.
A Holistic Approach
To best explain our approach to AI Evaluations at Fairo, let’s consider the following analogy.
Imagine you go to the doctor experiencing certain symptoms. The doctor is busy, overworked, and only has a few minutes before attending to the next patient. Most doctors in this situation only have time to zoom in, quickly examine the symptoms in isolation, and prescribe a remedy - a quick fix– aimed at addressing the specific symptoms, before moving on to the next case.
Those fortunate enough to be able to afford more personalized medical services, including complementary and alternative medicine - which is not always covered by insurance - will benefit from a completely different approach, one that looks at the whole-person.
Taking a whole person approach requires going beyond just diagnostic testing and medical history. We need to know who they are, how they sleep, how they eat, are they stressed, do they drink, do they take any medications, what does their medical history look like, what about their family history?
Identifying the root cause, or causes, of certain aliments will help create a plan more likely to have long term success, improve quality of life, and bring many other benefits.
To be clear, we are not by any means rejecting the need for invasive or targeted procedures or treatments, instead we are considering them in the context of the bigger picture. Looking at the whole person, we can determine what is best for them, and ensure we create lasting value.
When Evaluating AI systems, we must take a similar approach. Instead of just looking at individual metrics in isolation, we should look at the model within its ecosystem. This means looking beyond the model, and into the product, the use case, the intended user, and the impacted parties. We need to deeply understand how the AI-enabled tool or feature is used, and what different impacts might occur based on how the AI behaves.
A Model vs. a Use Case
We look at Evaluations and Alignment through the lens of an AI Use case rather than through that of a model. Therefore, it’s important to highlight the differences between the two, especially because most platforms are purely focused on models. While improving model performance is important, only so much can be done in isolation. Many times, if you don’t understand the Use Case, you can’t improve the model.
A model can be used in many different ways. It can serve as a base model for additional fine tuning, it can be put into a pipeline with different technologies (i.e. Retrieval Augmented Generation), it can stand alone with a simple system prompt, or - in very rare cases - it can be served raw with no additional wrapping or prompting.
In all of these instances, even the ones where the model is left totally untouched, we use the term ‘AI Use Case’, to effectively describe the use of a model within an organization. More specifically, an AI Use Case generally corresponds to a specific challenge or opportunity that your organization can address with AI. Broad examples include, fraud detection, document management, personalized marketing, etc.
Much like non-AI projects, it’s important to spend time framing the problem correctly. Use cases can be broad, or narrow, but how you define and implement them is entirely dependent on your AI Strategy.
While a broadly defined AI Use case may be the right solution in certain instances, challenges and risks can arise if we are not specific enough when defining the problem or opportunity we are addressing with AI. Biases, risks, and limitations can be associated with a model, but will be accentuated at the Use Case level if we are not careful.
Organizations that consistently define broad AI use cases, hoping to rely on large (and expensive) foundation models to solve a variety of loosely defined problems, may find that their AI strategy is not returning the desired ROI.
This is because when an AI Use Case is poorly understood, broadly defined, or not defined at all, we are unable to bring meaning to our model Evaluation and Alignment procedures.
Of course, we can look at the model type (i.e. LLM, tabular, image, etc.), look at the training data, look at the production data, and run a set of pre-determined metrics for that type of problem. We can check systems for performance and drift, and alert you when they fall outside of some conventionally accepted boundaries.
But so, what? If we don’t have a well-framed AI Use Case and understanding of the target ecosystem (including product and feature requirements), most of these statistics will be meaningless to us.
Tying it Together
A central part of Fairo’s mission is to bring meaning to the technical components of AI Governance, especially AI Evaluations and Alignment. There are so many tools, datasets, and frameworks available today, keeping track of what tool does what, and how to use it is almost impossible without a dedicated team. Even more complex is the task of ensuring the right tools are selected, that they are configured correctly for your problem, and work within your ecosystem.
Fairo’s platform is ready to be your dedicated AI Consultant in this area. We will ensure you’re adopting the best tools, frameworks, procedures, and technologies for your AI Use Case. Streamlining your adoption of successfulAI technology.
Next Steps
This post is the first in a series on AI Evaluation and Alignment. Here, we introduced our foundational guiding principles and discussed how AI needs wells framed and defined Use Cases in order to be Evaluated and Aligned effectively.
In our upcoming posts we will dive into specific tutorials based on actual AI Use cases that you can follow along with and adapt to your environment.
If you would like to learn more or have any questions, schedule some time to chat with us.