MLCommons today released AILuminate, a new benchmark test for evaluating the safety of large language models. Launched in 2020, MLCommons is an industry consortium backed by several dozen tech firms.
MiMo-V2-Pro utilizes a 7:1 hybrid ratio (increased from 5:1 in the Flash version) to manage its massive 1M-token context ...
Google DeepMind researchers introduce new benchmark to improve LLM factuality, reduce hallucinations
Hallucinations, or factually inaccurate responses, continue to plague large language models (LLMs). Models falter particularly when they are given more complex tasks and when users are looking for ...
Tech Xplore on MSN
New 'renewable' benchmark streamlines LLM jailbreak safety tests with minimal human effort
As new large language models, or LLMs, are rapidly developed and deployed, existing methods for evaluating their safety and discovering potential vulnerabilities quickly become outdated. To identify ...
CTI-REALM is Microsoft’s open-source benchmark that evaluates AI agents on real-world detection engineering. It measures ...
OpenAI today detailed o3, its new flagship large language model for reasoning tasks. The model’s introduction caps off a 12-day product announcement series that started with the launch of a new ...
To stay up to date and work forward in their fields, scientists must have at their fingertips and in their minds thousands of ...
Google’s new model may be one of the most powerful LLMs yet. Onlookers have noted that Gemini 3.1 Pro appears to be a big ...
This illustrates a widespread problem affecting large language models (LLMs): even when an English-language version passes a ...
Simbian today announced the “AI SOC LLM Leaderboard,” a comprehensive benchmark to measure LLM performance in Security Operations Centers (SOCs). The new benchmark compares LLMs across a diverse range ...
Unlock the full InfoQ experience by logging in! Stay updated with your favorite authors and topics, engage with content, and download exclusive resources. Ludi Akue discusses how the tech sector’s ...
In today's crowded AI landscape, organizations looking to leverage AI models are faced with an overwhelming number of options. But how to choose? An obvious starting point are all the various AI ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results