Building confidence with arbitrary prompt changes: Datasets

2024-10-15

9 minute read

At Aldara, we solve HOAs (homeowners’ associations’) management at scale. One of our core services is customer support and task management for our clients, ranging from short-lived customer support tickets addressing owner financial petitions to long-term projects like constructions. Currently, over 80% of tasks are generated by our AI based on customer support conversations, and more than 70% of task actions are done with the help of our LLM pipelines. This AI-driven approach is what enables us to scale our operations while providing high-quality service to our clients.

Integrating AI into the product is not enough; long-term value comes from being able to iterate models and prompts quickly and with confidence to improve it’s recommendations. With the product and evaluation in place, remaining challenge lies in curating datasets to evaluate new iterations against. For these datasets to be useful, they need to have several properties:

Relevance: Data should accurately mirror real-world use cases well.
Completeness: Data should cover as many use cases as possible, to prevent overfitting to a single type of task.
Determinism: Evaluation cases should involve all required context to run in a closed environment, so we need to capture all the input without leaving context behind.
Correctness: Datasets should represent the ideal behavior/response of the model as much as possible, so curating responses to get a set of ground truths is a must.

Human validation plays a big role here. All of our AI suggested outputs / messages must be validated by a user before being executed. This human-in-the-loop flow as part of the ordinary operation allows us to automatically curate the datasets. If the AI messed up, our operations team will correct (or throw away) the output, inadvertently generating the end to end “test case” in the process.

In this post, we will cover how we collect data and how we build datasets to robustly evaluate three of our main AI pipelines:

Task creation: Automatically creating tasks from customer support conversations. All the context from the HOA, customer support conversation and current tasks is sent to the AI pipelines as input to decide if a task should be created, returning title and description.
Task action suggestion: Task actions are the multiple actions that can be taken to solve (or advance) a task. This pipeline proposes an action type to be taken inside a given task. The pipeline receives the task, customer and HOA context and returns one task action type from a predefined set.
Task action content: Proposing the content of the action to be taken on a task (if this task type needs content). E.g.: If the task action is to notify the owner, the content would be the message to be sent. This pipeline receives the same context as the task action suggestion pipeline, with the addition of the task action type that should be taken. It returns the content of the action, which can be either a description or message text.

Types of data

These pipelines involve multiple types of data, such as categorization, and text generation. These are the types we’ll go over:

Binary classification: For the task creation use case, our main goal is to make sure tasks are created when needed. Minimizing false negatives is crucial. The title or description of the task is less important here, as our back-office team can always change the title or description if needed before any customer reads them. That’s why we focus on the binary classification of whether a task should be created or not.
Multi-class classification: When suggesting task actions, we use multi-class classification to identify and recommend an appropriate action from a predefined set. For example, a task might require one of the following actions: “schedule meeting”, “notify owner”, or “communicate to provider”.
Text generation: Task action content requires a text-generated dataset with human feedback. For example, in customer communications, we have to make sure that the text generated is contextually appropriate, informative, and complete. Our back-office team labels all our AI-generated text as accepted, corrected, or overridden. This data is essential to improve the quality of the generated text.

Data annotation: Uncovering the Truth in AI-Driven Tasks

When trying to refine AI pipelines, we face a critical challenge: How do we determine what the AI should ideally produce? This is where the concept of “ground truth” comes into play. In a world where AI’s decisions directly impact our operations, it’s important to establish a baseline of correctness as a standard against which we can measure and improve our AI’s performance. But how do we achieve this?

The journey to uncovering this ground truth begins with meticulous data annotation, a process where our back-office team steps in to provide the human validation necessary to guide our AI systems. The annotation process isn’t one-size-fits-all; it varies depending on each pipeline. Let’s take a closer look at how we tackle this for the three pipelines explained before:

Task creation: This is the simpler case. When a task is created by AI, we monitor its early life closely. If a task is cancelled within a few hours, we conclude that the task was unnecessary. Otherwise, we consider that the AI made the right call, and the task was needed.
Task action suggestion: This flow is not so much about whether the AI suggested an action, but whether it suggests the right one. Here is where our back-office team comes into play. If they reject an AI-suggested action and create a new one, this new action becomes our ground truth. On the other hand, if the team accepts an AI suggestion and moves forward with it, the action is considered correct.
Task action content: If our back-office team accepts the action and its content, we consider both the type and content correct. If the content is edited, we take the edited version as the ground truth.

By embedding this data annotation process into our existing workflows, we manage to constantly generate ready-to-use test cases without having anyone invest dedicated time into annotating the ground truths.

Data gathering: Preserving the Moment in Time

Our AI pipelines strongly rely on the business context to make accurate decisions. When a customer support conversation is sent to our model to guess if a task should be created, we also send HOA information and current tasks, so the AI can make more informed decisions like avoiding the creation of duplicate tasks.

We can’t just query the data when needed, as it could be different from the initial input sent to the AI. We can’t evaluate the behavior of the AI on the same context used to generate the output (accepted or edited by our back-office team) if we don’t save this context when the call is made. Since the context data changes over time, we have to create a snapshot at the moment of the AI call. This ensures result reproducibility and consistent AI performance evaluation over time.

While this approach allows us to automatically capture all necessary data, there are some downsides. Any significant changes to the context format could require a time-consuming backfilling process or even the deprecation of the entire dataset gathered so far. For example, if we change the input format from JSON to YAML or add context like an HOA’s president’s phone number, the dataset would need to be modified accordingly. Adding new context that we can’t easily backfill, such as the in-progress tasks of the HOA at that moment, could imply the deprecation of the existing dataset and creation of a new one that incorporates the latest updates. Case by case, we have to evaluate the trade-offs between the time investment required to backfill the data and the impact of dataset deprecation.

Dataset Creation

Now that we know how we annotate and gather the data, let’s dive into how we organize and curate it to build our datasets. We rely on two main types of datasets: bulk generalist datasets and cherry-picked datasets. Each type serves a distinct purpose in the training, evaluation, and fine-tuning of our AI models. Below is an overview of how each dataset is created, their sizes, and the pros and cons associated with each approach.

Bulk Generalist Dataset:
- Generated by selecting data using minimal filtering, reflecting a broad spectrum of our daily operations. For example, in the case of task actions, we gather all the input data from our task action suggestion pipeline from the last two weeks across all our tasks. This approach captures a wide variety of cases, ensuring the dataset mirrors real-world scenarios. The data selection process is largely automated, often managed through ETL pipelines that continuously feed data into the dataset.
- Size of the dataset: The bulk dataset typically includes over 500 cases, ensuring that we have a substantial volume of data to work with. Otherwise, we’d get too much variance in the results of different test runs, given the non-deterministic nature of LLMs.
- Pros:
  - Environment similar to reality: Since the dataset includes a broad range of cases, it closely resembles the environment our AI operates in.
  - High volume: The large size of the dataset provides a comprehensive view of AI performance.
  - Low effort: Bulk datasets are typically created automatically through ETLs, reducing the manual effort involved.
  - Reliable evaluations: The large number of cases minimizes variance, providing stable and representative evaluations of our AI pipelines. The image illustrates a significant reduction in accuracy variance once the dataset size exceeds 500 samples, based on 10 test runs for each sample size.

Cons:
- Noisy or corrupted data: Bulk datasets created mainly using code, can include low-quality data, incorrect annotations, which could skew results.
  - Unbalanced distribution: Since the data reflects real-world distributions, some occurrences might be underrepresented, which could lead to blind spots in model performance.
  To mitigate the unbalanced distribution and make sure no edge cases are left behind, we use cherry-picked datasets. The objective of this dataset is to keep the distribution as close to reality as possible.

Cherry-Picked Dataset:
- Cherry-picked datasets are carefully curated, often with manual selection, focusing on specific use cases or challenging scenarios. For instance, we might create a language-specific communications dataset by selecting cases where our AI had to generate text in a specific language. Another example could be a “hard mode” actions dataset, where we focus on tasks that have historically been difficult for the AI to handle. This manual selection process ensures that each case is highly relevant to the specific goal of the dataset.
- Size of the dataset: Cherry-picked datasets are typically smaller, containing around 50 to 100 cases due to the manual selection process. These datasets are designed for targeted analysis rather than broad evaluation.
- Pros:
  - Edge case analysis: These datasets are ideal for analyzing and improving the AI’s performance on specific, challenging cases.
  - Identification of weaknesses: By focusing on examples where the AI struggled or where significant human intervention was required, these datasets help identify and address specific weaknesses in the model.
  - High-quality data: The manual selection process ensures that the data is clean, well-annotated, and highly relevant to the task at hand.
- Cons:
  - Manual and time-consuming: Building cherry-picked datasets is a labor-intensive process, often requiring case-by-case selection.
  - Potential for bias: The involvement of humans in selecting cases can introduce bias or subjectivity into the dataset. It’s not obvious what should go into the dataset and what shouldn’t.
  - Smaller size: The limited number of cases increases variance during evaluation, which might lead to less stable results compared to bulk datasets.

By employing both bulk generalist and cherry-picked datasets, we can achieve a balance between broad coverage and targeted focus. This dual approach allows us to continuously refine our AI3 models, addressing both common and rare challenges effectively.