snabelen.no er en av mange uavhengige Mastodon-servere du kan bruke for å delta i det desentraliserte sosiale nettet.
Ein norsk heimstad for den desentraliserte mikroblogge-plattformen.

Administrert av:

Serverstatistikk:

374
aktive brukere

#agents

2 innlegg2 deltakere0 innlegg i dag

LiveMCP-101: Benchmarking AI Tool Use
New benchmark with 101 real-world queries testing AI agents on multi-step tasks using diverse MCP tools (search, file ops, math, data analysis).

Key points:
• Ground-truth execution plans for realistic evaluation
• Frontier LLMs succeed <60% → major orchestration challenges
• Error analysis highlights inefficiencies & failure modes

arxiv.org/abs/2508.15760v1
#AI #Agents #ToolCalling #Benchmarking

arXiv logo
arXiv.orgLiveMCP-101: Stress Testing and Diagnosing MCP-enabled Agents on Challenging QueriesTool calling has emerged as a critical capability for AI agents to interact with the real world and solve complex tasks. While the Model Context Protocol (MCP) provides a powerful standardized framework for tool integration, there is a significant gap in benchmarking how well AI agents can effectively solve multi-step tasks using diverse MCP tools in realistic, dynamic scenarios. In this work, we present LiveMCP-101, a benchmark of 101 carefully curated real-world queries, refined through iterative LLM rewriting and manual review, that require coordinated use of multiple MCP tools including web search, file operations, mathematical reasoning, and data analysis. Moreover, we introduce a novel evaluation approach that leverages ground-truth execution plans rather than raw API outputs, better reflecting the evolving nature of real-world environments. Experiments show that even frontier LLMs achieve a success rate below 60\%, highlighting major challenges in tool orchestration. Detailed ablations and error analysis further reveal distinct failure modes and inefficiencies in token usage, pointing to concrete directions for advancing current models. LiveMCP-101 sets a rigorous standard for evaluating real-world agent capabilities, advancing toward autonomous AI systems that reliably execute complex tasks through tool use.

Блин, в Codex от OpenAI чувствуется опытный разработчик.

Я в него залил ~20к символов техзадания.

Он написал README.md на 37 строк и довольный ушёл отдыхать.

#log#dev#LLM

→ Why I'm Betting Against AI Agents in 2025 (Despite Building Them)
utkarshkanwat.com/writing/bett

“[E]rror compounding makes autonomous multi-step workflows mathematically #impossible at #production scale. […] Production systems need 99.9%+ #reliability. Even if you magically achieve 99% per-step reliability (which no one has), you still only get 82% success over 20 steps. This isn't a prompt #engineering problem. This is #mathematical reality.”

Utkarsh Kanwat · Why I'm Betting Against AI Agents in 2025 (Despite Building Them)I've built 12+ AI agent systems across development, DevOps, and data operations. Here's why the current hype around autonomous agents is mathematically impossible and what actually works in production.

📢 Développeuses, développeurs, le 16 Octobre à #StationF Google organise le 𝗔𝗜 𝗔𝗴𝗲𝗻𝘁𝘀 𝗟𝗶𝘃𝗲 + 𝗟𝗮𝗯𝘀

Une matinée de 𝗰𝗼𝗻𝗳𝘀 et une après-midi d'𝗮𝘁𝗲𝗹𝗶𝗲𝗿𝘀

Au menu #Gemini, #Imagen, #Veo, #ADK, #A2A, #MCP... Tout pour vos #agents #IA

cloudonair.withgoogle.com/even

image
cloudonair.withgoogle.comAI Agents Live + Labs ParisConcrétisez la promesse initiale de l'IA et augmentez son impact dans votre organisation en la mettant entre les mains de chaque employé et dans chaque workflow.

"Turning #ChatGPT Codex Into A #ZombAI Agent
Posted on Aug 2, 2025#llm #agents #month of ai bugs
Today we cover ChatGPT Codex as part of the Month of AI Bugs series.

ChatGPT Codex is a cloud-based software engineering agent that answers codebase questions, executes code, and drafts pull requests."

embracethered.com/blog/posts/2

Let's have fun with #AI

Embrace The RedTurning ChatGPT Codex Into A ZombAI Agent · Embrace The Red

🔁 What about〖 loop flows 〗with #ADK for #Java for refinement, trial/error, self corrective #AI #agents?

We'll talk about ⏪ before & ⏩ after agent callbacks, function calling exit, and max iteration limits ♾

Concrete example: a simple #Python code refinement loop agent

👓 Read all the details about #ADK #Java loop flows for your #AI #agents in this article ⬇️

glaforge.dev/posts/2025/07/28/

The last of the series on agentic workflows! 🔚

glaforge.devMastering agentic workflows with ADK: Loop agentsTech blog of Guillaume Laforge, with articles on generative AI, LLMs, cloud computing, microservices architecture, serverless solutions, Java and Apache Groovy programming