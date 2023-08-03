It’s a question on the mind of every artificial intelligence startup founder and developer: Can I trust my large-language model?

This deceptively simple inquiry has become more difficult to answer, as recent research shows OpenAI’s GPT-4 and other LLMs can improve in some ways over time but also get worse too.

For startups using these models, it can be especially difficult to evaluate their performance because OpenAI and other providers of models share little information about how they trained and developed them. And previously chatty researchers at these companies are becoming tight-lipped at industry forums. To combat this, some customers of LLMs are pursuing a novel approach: using other LLMs to evaluate them.