AWS releases "Bedrock Model Evaluation" to optimize AI development

2023-11-30

At the AWS re:Invent conference, Swami Sivasubramanian, Vice President of Databases, Analytics, and Machine Learning at AWS, announced the launch of Model Evaluation on Bedrock and is currently offering a preview version. The purpose of this tool is to help users better evaluate the models in their Amazon Bedrock repository. Without a transparent model testing method, developers may use less accurate models to complete their projects or use models that are too large for their use cases. Sivasubramanian stated, "Model selection and evaluation should not only be done at the beginning but should be repeated regularly." He also emphasized the importance of human involvement and provided a convenient way to manage human evaluation workflows and model performance metrics. Sivasubramanian previously mentioned that some developers are unsure whether they should use a larger model for their projects because they assume a more powerful model will meet their needs. However, they later discover that they could have built a smaller model instead. Model Evaluation consists of two parts: automated evaluation and manual evaluation. In automated evaluation, developers can access their Bedrock console and select a model for testing. They can then evaluate the model's performance based on metrics such as robustness, accuracy, or toxicity, which are applicable to tasks such as summarization, text classification, question answering, and text generation. Bedrock includes popular third-party AI models such as Meta's Llama 2, Anthropic's Claude 2, and Stability AI's Stable Diffusion. Although AWS provides a test dataset, customers can also import their own data into the benchmarking platform to gain a better understanding of the model's performance. The system will then generate a report. For manual evaluation, users can choose to collaborate with AWS's manual evaluation team or their own team. Customers must specify the task type (e.g., summarization or text generation), evaluation metrics, and the dataset they want to use. AWS will provide customized pricing and schedules for customers working with their evaluation team. Vasi Philomin, Vice President of AI at AWS, stated in an interview that a better understanding of model performance can guide development. It also allows companies to assess whether a model meets certain responsible AI standards, such as having an appropriate level of sensitivity to toxicity. Philomin said, "It is important for our customers to understand which model is best suited for them, and we are providing a better evaluation method." Sivasubramanian also pointed out that when humans evaluate AI models, they can detect other indicators that automated systems cannot, such as empathy or friendliness. Philomin stated that AWS does not require all customers to perform benchmark testing on their models because some developers may have already used basic models on Bedrock or have an understanding of what models can do for them. However, companies that are still exploring which models to use may benefit from the benchmark testing process. AWS stated that although the benchmark testing service is currently in preview, they will only charge for the model inference used during the evaluation period. Although there are no specific standards for AI model benchmark testing, certain industry-accepted metrics exist. Philomin stated that the goal of benchmark testing on Bedrock is not to extensively evaluate models but to provide companies with a way to measure the impact of models on their projects.