Scale AI to develop testing framework for Pentagon's LLMs

By Dwaipayan Roy

Feb 21, 2024

03:52 pm

What's the story

The Pentagon's Chief Digital and Artificial Intelligence Office (CDAO) has teamed up with Scale AI, a San Francisco-based company. Together, they will develop a reliable testing and evaluation (T&E) framework for large language models (LLMs). These LLMs could play a significant role in military planning and decision-making. The one-year contract aims to create a comprehensive T&E system for generative AI inside the Defense Department, ensuring its safe deployment by measuring model performance and providing real-time feedback for warfighters.

Goal

Addressing the complexities of generative AI testing

Generative AI, which includes LLMs that can produce text, images, software code, and other media based on human prompts, poses unique challenges for T&E processes. Unlike traditional systems with established safety standards, generative AI lacks universally accepted guidelines. To address these complexities, Scale AI will develop "holdout datasets" with the help of Department of Defense (DOD) insiders who can provide response pairs and review them through multiple layers.

Process

Iterative process to refine datasets and evaluate models

The T&E process for LLMs will be iterative, involving the creation and refinement of datasets relevant to the DOD's needs. Experts will then evaluate existing LLMs against these datasets. As holdout datasets are established, evaluations can be conducted to develop model cards—short documents detailing the best use context and performance measurement information for various machine learning models. This approach will help establish a baseline understanding of model performance, strengths, and limitations.

Aim

Automating model evaluation and feedback

The development process aims to automate as much as possible, allowing for quick assessments of new models as they emerge. The goal is for models to provide signals to CDAO officials when they deviate from the domains they have been tested against. Scale AI's statement explains that this work will allow the DOD to mature its T&E policies for generative AI by "measuring and assessing quantitative data" through benchmarking and gathering qualitative feedback from users.

Partners

Collaboration with industry leaders

Scale AI has previously partnered with Microsoft, Meta, OpenAI, the US Army, the Defense Innovation Unit, General Motors, and NVIDIA. Alexandr Wang, Scale AI's CEO, said in a statement, "Testing and evaluating generative AI will help the DoD understand the strengths and limitations of the technology, so it can be deployed responsibly." This partnership aims to increase the resilience and robustness of AI systems in classified environments. This will ensure LLM technology adoption "in secure settings."