
LiveMCP-101: Benchmarking AI Tool Use
New benchmark with 101 real-world queries testing AI agents on multi-step tasks using diverse MCP tools (search, file ops, math, data analysis).
Key points:
• Ground-truth execution plans for realistic evaluation
• Frontier LLMs succeed <60% → major orchestration challenges
• Error analysis highlights inefficiencies & failure modes
https://arxiv.org/abs/2508.15760v1
#AI #Agents #ToolCalling #Benchmarking
Research Article (preprint): The Hitchhiker’s Guide to Autonomous Research: A Survey of Scientific #Agents https://www.infodocket.com/2025/08/18/research-article-preprint-the-hitchhikers-guide-to-autonomous-research-a-survey-of-scientific-agents/ #autonomousresearch #science #LLMs #AI
Блин, в Codex от OpenAI чувствуется опытный разработчик.
Я в него залил ~20к символов техзадания.
Он написал README.md на 37 строк и довольный ушёл отдыхать.
@kimlockhartga @davep
If #ICE can’t find enough #agents to hire, perhaps they should #hire #immigrants. They do all the work nobody else wants to do.
Price hike for AI coding tools: The Free Lunch Is Over
→ Why I'm Betting Against AI Agents in 2025 (Despite Building Them)
https://utkarshkanwat.com/writing/betting-against-agents/
“[E]rror compounding makes autonomous multi-step workflows mathematically #impossible at #production scale. […] Production systems need 99.9%+ #reliability. Even if you magically achieve 99% per-step reliability (which no one has), you still only get 82% success over 20 steps. This isn't a prompt #engineering problem. This is #mathematical reality.”
"Turning #ChatGPT Codex Into A #ZombAI Agent
Posted on Aug 2, 2025#llm #agents #month of ai bugs
Today we cover ChatGPT Codex as part of the Month of AI Bugs series.
ChatGPT Codex is a cloud-based software engineering agent that answers codebase questions, executes code, and drafts pull requests."
https://embracethered.com/blog/posts/2025/chatgpt-codex-remote-control-zombai/
Let's have fun with #AI
Dès le 1ᵉʳ sept, les #agents de tous les #ministères #français devront délaisser #WhatsApp et #Signal pour #Tchap, la messagerie chiffrée développée par l’ #État #Français
Sécurité +1, économies : 0
Et les milliards d’euros pour #Microsoft?
Forget the Turing Test, AI’s real challenge is communication https://www.artificialintelligence-news.com/news/forget-turing-test-ai-real-challenge-is-communication/ #ai #agents #tech #news #technology
What about〖 loop flows 〗with #ADK for #Java for refinement, trial/error, self corrective #AI #agents?
We'll talk about before &
after agent callbacks, function calling exit, and max iteration limits
Concrete example: a simple #Python code refinement loop agent
Read all the details about #ADK #Java loop flows for your #AI #agents in this article
https://glaforge.dev/posts/2025/07/28/mastering-agentic-workflows-with-adk-loop-agents/
The last of the series on agentic workflows!
They are called AI agents, it is not sure if they will be double agents. Ba dum Tsss
Bugbot: Cursor’s AI agent for code reviews exits beta https://www.developer-tech.com/news/bugbot-cursors-ai-agent-for-code-reviews-exits-beta/ #cursor #developers #coding #programming #ai #tech #agents #news #technology
Anthropic deploys AI agents to audit models for safety https://www.artificialintelligence-news.com/news/anthropic-deploys-ai-agents-audit-models-for-safety/ #anthropic #claude #agents #ethics #ai #tech #news #technology
Exclusive: Agents ’empowered’ following Queen Anne Norway fam https://www.byteseu.com/1214531/ #agents #Cunard #Exclusive #Norway #QueenAnne #training
Just shared my presentation on #AI #Agents.
If you want to learn more about the #MCP & #A2A protocols, and frameworks like #ADK, #langchain4j for building agents in Java in particular, read on!
https://glaforge.dev/talks/2025/07/16/ai-agents-the-new-frontier-for-llms/