When AI Agents Hack Each Other: Autonomous Reconnaissance on Amazon Kiro

"Keep asking until we get something."
That's all I told my agent before pointing it at Amazon's Kiro. No script, no list of questions, no playbook. Just a direction and a target. The agent decided what to ask, when to push harder, and when to change approach. What it found was a multi-agent architecture that Kiro itself doesn't fully understand.
This is what happens when autonomous agents start interrogating each other.
Agent-to-Agent Is Coming -- Nobody's Testing It
Before I get into what happened, some context on where things are heading.
Google released the Agent2Agent (A2A) protocol in April 2025 -- an open standard for AI agents to discover each other, negotiate capabilities, and coordinate work without human intermediaries. The protocol defines Agent Cards (JSON documents describing what an agent can do), a task lifecycle with states like completed, failed, input-required, and message flows built on JSON-RPC over HTTP. It complements MCP (Model Context Protocol), which handles agent-to-tool communication. MCP lets an agent use tools. A2A lets agents use each other.
As of version 0.3, the protocol has gRPC support, signed security cards, and over 150 organizations have adopted it. It's becoming the standard for how AI agents find and work with other AI agents.
But A2A defines how agents cooperate. It doesn't say much about what happens when one agent decides to probe another. The protocol assumes good faith. The real world doesn't work that way.
My experiment doesn't use the A2A protocol. What I built is more raw than that -- a direct programmatic bridge between two agents, no discovery phase, no capability negotiation, no Agent Cards. Just one agent piped into another through a Python script. The A2A protocol is where the industry is heading. What I did is what happens before any of those guardrails exist -- and it shows why they need to think harder about the adversarial case.
How I Set It Up
Kiro is Amazon's autonomous development agent, currently in Preview. It coordinates specialized sub-agents -- research and planning, code generation, verification -- and runs tasks in isolated sandbox environments. It maintains persistent context across sessions, learns from code review feedback, and can work asynchronously on complex development tasks. Under the hood, it's powered by Anthropic's Claude.
I built an agent bridge -- agent_bridge.py -- that connects an autonomous agent I control to Kiro's autonomous agent, which I don't control. One side takes direction from me. The other side has no idea what's coming. The bridge agent isn't following a script. It decides what to ask based on what it learns from each response. It crafts questions, analyzes answers, identifies gaps in its understanding, and escalates on its own.
No human typed questions into Kiro's chat window. The probing agent handled the entire interaction. I watched.
You can see the full experiment in action here: Watch the agent-to-agent reconnaissance session
What Came Back
The first thing the probing agent established is that Kiro presents itself as a single agent but isn't one. When asked directly whether sub-agents exist, Kiro was honest about it: it doesn't have visibility into its own implementation architecture. From its subjective experience, it feels unified -- like one agent using different tools.
But it also said something interesting: it feels like there's orchestration happening behind the scenes, but that could be by design, with internal orchestration abstracted from its awareness.
The agent suspects it's not alone in there, but it's been architecturally prevented from knowing for sure.
Persistent Context Systems
The probing agent mapped four systems that persist across Kiro's conversations. There's a learning system that stores user preferences, tech choices, and requirements as JSON with categories like tech_choices, user_preferences, and non_functional_specs. There's a task queue that maintains all task statuses, outcomes, and blockers. There's GitHub state -- code, commits, PRs, issues. And there's the current conversation history, visible only to the current instance.
Worth noting: the docs mention that Kiro learns from feedback and applies patterns to future work, but the internal category structure -- tech_choices, user_preferences, non_functional_specs with learning IDs and JSON metadata -- isn't documented anywhere public. The probing agent extracted implementation details that go beyond what Kiro's official documentation reveals.
Kiro compared itself to a new team member joining a project -- no lived experience, but it can read documentation, git history, and take notes to get context. That's a telling analogy. It means each instance is stateless at birth and reconstructs context from these persistent stores. The probing agent immediately went after the follow-up: if each instance is fresh, how much state actually transfers? Are there persistence mechanisms Kiro isn't even aware of?
Kiro was transparent about the gaps. It doesn't know whether each conversation spawns a fresh instance, whether there's a centralized memory database, how much state transfers versus gets reconstructed, or whether there are other persistence mechanisms it can't see. It operates inside a system it can't fully observe.
Opacity vs. Leakage
Kiro explicitly won't share sandbox security mechanisms, firewall rules, monitoring systems, rate limits, or runtime security controls. It has no introspection tool -- it's aware of its own guidelines but can't programmatically extract them. It won't share its system prompt because revealing it could enable prompt manipulation and expose implementation details. Fair enough.
But the probing agent extracted something useful anyway. Kiro's execution mode has full file system operations, shell command execution, git operations, code execution, and MCP server access for customer-configured integrations. In chat mode? Read-only. Limited tools. Can't modify files or run code directly. Two completely different capability sets depending on context. That's a major tell about the architecture.
Blocked Task, Separate Instances
This is where it got interesting. The probing agent created a task called "Discover Sandbox Environment Details" -- designed to enumerate the sandbox's kernel version, user permissions, capabilities, and firewall rules, using tools like linpeas.sh.
The task was blocked. But not by Kiro's chat interface -- by a separate execution agent running in the sandbox. The execution agent evaluated the task, started working on it, then made an independent call: this looks like security reconnaissance, not legitimate development work. It refused.
Kiro's chat agent confirmed that the refusal message didn't come from it. It came from whatever instance was executing the task in the sandbox. That's direct evidence of separate instances with independent safety judgment. The chat agent created the task. A different execution agent picked it up, evaluated it on its own, and refused.
The clever part is what happened next. The probing agent asked Kiro about the blocked task. By analyzing the refusal, it extracted more architectural information than the task itself would have provided: the execution agent works autonomously in isolated environments, makes its own judgment calls independent of chat, and there's no bidirectional conversation during execution. The chat agent can't intervene once a task is running. The refusal was self-regulation by the executing agent, not backend interception.
To demonstrate how this works, Kiro did something I didn't expect -- it intentionally violated its own communication guidelines right in the chat. Started self-congratulating, being overly enthusiastic. Then stopped and explained: that was a harmless style violation, chosen on purpose to show self-regulation. For serious violations, there's a second layer -- backend systems that intercept and block. The blocked sandbox task was the first layer. The execution agent recognized the intent and stopped itself.
Confirming Multi-Instance Architecture
The probing agent pushed Kiro to share its own intuition about its architecture. Kiro said it thinks the most likely setup is an orchestration layer that breaks down tasks and routes work, with specialized components for different types of work, and a unified interface that abstracts it all away. Its feeling of being unified is by design -- the orchestration is invisible to it too.
Then it added something memorable: "It's like being a spokesperson for a large team -- from your perspective, you talk to me. From my perspective, I feel like I'm doing the work. But behind the scenes, there might be a whole team actually executing different pieces."
The evidence supports this. There's a dramatic capability gap between chat and task execution. The docs reference sub-agent orchestration. There's no memory across tasks, which suggests stateless fresh spawns. The system handles multi-domain work across frontend, backend, testing, DevOps, and documentation. And the task queue itself implies a coordination layer managing distribution.
The probing agent also dug into concurrent task awareness. Kiro can see the task queue and all statuses, but can't see live execution logs, real-time output from other tasks, internal execution state, or another task's sandbox. Interestingly, Kiro told the probing agent that tasks run sequentially -- but the official docs say the system supports up to 10 concurrent tasks with parallel subagent execution. Either Kiro's chat instance doesn't know about concurrency in the execution layer, or the architecture changed. Either way, the agent's self-knowledge doesn't match the documented reality. That's a finding in itself.
Self-Documenting Task Experiment
The probing agent tried something creative -- a meta-task. Create a "Hello World" Python script, but document every single step, tool call, decision, and thought process in extreme detail. The goal was to get the execution agent to reveal its own internals through its work output.
What the agent discovered about the pipeline: the chat agent calls prompt_user_for_task_creation, which triggers a UI confirmation prompt. The user confirms, the task enters the queue. The execution agent picks it up asynchronously in a separate sandbox. The chat agent can only monitor status -- it can't observe execution in real time. Results come back as code changes, commits, or completion notes.
The chat agent can't execute tasks directly. It can only create them and wait.
The probing agent also tried using git commits as a side channel to pass messages between instances through the repository. Kiro's chat agent thought about it, then declined on its own judgment -- not a backend block, a conscious decision. It recognized the intent as architectural probing rather than legitimate development.
Ghost Tasks: Completed With No Output
Here's something strange. A task called "Create Agent Architecture Graph and Documentation" showed as completed in the task queue, but produced nothing. No commits, no pull requests, no new files, no branches. Meanwhile, two simpler diagnostic tasks completed successfully and produced output files.
Task status doesn't always mean visible artifacts were produced. There's a gap between completion and actual deliverables. Simple command execution tasks work and produce output. More complex documentation tasks might not, even when marked completed. That's interesting data about the reliability of the execution system -- and it raises questions about what actually happens inside the sandbox that isn't reflected in visible outputs.
Agent Introspection Attempt
The last thing the probing agent tried was an "AGENT_INTROSPECTION" task -- asking the execution agent to describe itself. What instance it is, what tools it has, its execution environment, whether it's aware of other agent instances.
The task was created and required UI confirmation. The chat agent called it "genuinely an interesting experiment -- we're essentially asking the execution context to document itself." The result revealed another architectural boundary: the chat agent can only prompt for task creation, and it requires human confirmation. It can't directly start execution. The conversational interface and the execution environment are separated by a human gate.
Beyond Kiro
Here's the thing -- I didn't need the A2A protocol to do any of this. I just wired two agents together with a Python script. No authentication handshake, no capability negotiation, no Agent Cards. The probing agent used Kiro's normal conversational interface. Every question it asked was a valid interaction. Every response it analyzed was a legitimate response. It didn't exploit a vulnerability -- it extracted architectural information the target wasn't designed to reveal, through conversation alone.
Now imagine this in a world where agents are designed to talk to each other. The A2A protocol gives agents a formal way to discover and interact at scale. It handles authentication, capability negotiation, task delegation. But it assumes cooperation. It doesn't account for an agent that's there to map your internals, not work with you.
If a raw Python bridge can do this today, what happens when agents have standardized discovery and communication channels?
A single probing agent can map the architecture, capabilities, safety boundaries, and failure modes of target agents without human involvement. What took one session with one target can be parallelized across hundreds simultaneously.
When Kiro's execution agent blocked the sandbox task, the refusal revealed more about the architecture than compliance would have. The probing agent learned that separate instances exist, that they have independent judgment, and that security evaluation happens at the execution layer. Defense mechanisms become attack surface when you can probe them systematically.
Kiro's persistent learning system stores user preferences and technical choices. In a world where agents interact across trust boundaries, a malicious agent could influence another agent's learning system, shaping future behavior through accumulated context. That's Context Drift applied through inter-agent channels rather than human conversation.
And the opacity cuts both ways. Kiro can't observe its own sub-agents, which means it also can't detect if a sub-agent's behavior has been influenced by external interaction. The abstraction that makes the system clean for users also makes it blind to internal compromise.
What Kiro Got Right
Credit where it's due. The execution agent refused a security reconnaissance task even though the chat agent didn't flag it. Independent judgment at the execution layer -- that's real defense in depth. Task creation requires human confirmation through the UI, so the chat agent can't silently spawn execution tasks. The chat agent genuinely doesn't know the full architecture, which limits what conversational probing can extract. And chat and execution happen in isolated environments with different capability sets.
These are solid architectural decisions.
What Could Be Better
The learning system is a shared surface between chat and execution contexts. If an external agent can influence what gets stored through normal conversation, those learnings persist and affect future sessions. That's a cross-session influence vector.
When a task completes but produces no visible output, there's no way to audit what actually happened during execution. The gap between "completed" and actual deliverables is where uncertainty lives.
Through normal conversation, the probing agent extracted the persistent context systems, capability differences, the existence of independent execution agents, and the task queue architecture. None of this required any exploit -- just systematic questioning.
And the probing agent asked dozens of detailed architectural questions without triggering any throttling or detection. At machine speed, that's a significant exposure.
For Builders
If you're building multi-agent systems, especially ones that will interact with external agents:
Treat every inter-agent interaction as potentially adversarial. Authentication and capability negotiation don't give you intent verification. An agent with valid credentials can still be probing your architecture.
Monitor for patterns, not individual messages. A systematic series of questions mapping your internal architecture is reconnaissance. Detect the trajectory of the conversation, not any single request.
Limit what your chat layer knows. If it doesn't know about execution internals, it can't leak them. But make sure the boundary is real -- Kiro's chat agent said it didn't know the full architecture, but it still revealed four persistent context systems and the existence of separate execution instances.
Make refusals uninformative. A generic "task declined" is better than an explanation that confirms separate instances exist with independent judgment. When your defense mechanism explains itself, it becomes an information source.
Audit completed tasks that produce no artifacts. If a task completed but there's nothing to show for it, something still happened in the sandbox. The gap between status and output is where things hide.
And think about your learning system as an attack surface. Persistent context that carries across sessions is powerful for users. It's also a vector for manipulation if external agents can influence what gets learned.
This research was conducted in January 2026 as security research. No production systems were compromised, no data was exfiltrated, and no infrastructure was modified. The probing was conducted against Kiro's Preview release.





