In the rapidly evolving field of AI, staying ahead is not just a goal but a necessity. At Assembled, we are dedicated to leveraging the latest technology to enhance customer experience and optimize efficiency. Recently, we migrated all our customers from GPT-4 to the newer, faster GPT-4o (Omni) model within 24 hours of its launch. Achieving this required precise execution and a robust metrics and testing framework to evaluate model quality. Here's how we did it.
OpenAI's release of GPT-4o, a multimodal model capable of understanding text, images, and more, presented a significant opportunity. The model promised enhanced performance and broader capabilities, making it a valuable upgrade for our services. However, migrating our entire customer base to a new AI model within a day posed a substantial risk. The task involved not just a simple swap of models but a comprehensive overhaul of our integration processes, ensuring that the new model met our quality and reliability standards without causing disruptions to our customers' operations across tens of thousands of support tickets.
At Assembled, we embrace change and actively seek to innovate. Our team is encouraged to challenge assumptions and test hypotheses, fostering an environment where new ideas can thrive. This culture is crucial for tackling large-scale projects like the GPT-4o migration. Instead of seeing the task as insurmountable, we viewed it as an opportunity to push our limits and demonstrate our innovative capabilities.
We used an A/B test in parallel with our existing models to manage the migration efficiently. By running both GPT-4 and GPT-4o models concurrently, we tracked various metrics to ensure a smooth transition. These metrics included response text from the LLM, documents used to augment LLM knowledge, tokens used for queries and responses, and latency for the first and final tokens. This approach allowed us to gather data on the quality and accuracy of the resulting switch to make a more informed decision. It also enabled us to identify any discrepancies or issues early, ensuring that our customers experienced minimal disruption during the migration.
We maintain a golden dataset of user tickets and human replies to benchmark updates. This dataset is specifically tailored to reflect the diverse range of queries our customers encounter, ensuring that any system changes are analyzed on a representative set of our customers' data. By consistently using this dataset, we can accurately measure the performance and quality of new models, ensuring that they meet our high standards before full deployment. This practice helps us maintain the reliability and effectiveness of our AI solutions, providing our customers with consistent and high-quality service. For our GPT-4o rollout, we ran our new model on the golden dataset and did a manual review of the quality of responses and that GPT-4o performed significantly better on most benchmarks.
Analyzing collected data accurately is crucial for maintaining the quality and performance of our AI models. We utilize several strategies to ensure consistent analysis:
To ensure reliability, we have robust fallback mechanisms in place. If any issues were detected during the migration, customers were seamlessly switched back to GPT-4, ensuring uninterrupted service. These mechanisms are designed to provide an extra layer of protection during major transitions, minimizing the impact of potential issues on our customers. By prioritizing safety and resilience, we can confidently implement significant changes, knowing that we have safeguards in place to maintain service continuity and customer trust.
By combining a culture of experimentation, parallel A/B testing, automated analysis, and robust fallback mechanisms, we successfully migrated our entire customer base to GPT-4o in under 24 hours. As we continue to innovate, we’re excited to continue delivering reliable, high-quality answers that enhance customer experience and drive support efficiency.