From Game Agent to AI-Powered Security Remediation: Training a Deep Q-Network That Actually Works
How the same reinforcement learning pipeline that beats a superhuman Space Invaders clone powers real-time vulnerability detection, CVE threat modeling, and automated security remediation at scale.
We built eyeBMinvaders — a Space Invaders variant engineered to be significantly harder than the original — and then trained a Deep Q-Network (DQN) from scratch to beat it. This post covers what we learned going from a model that couldn't survive ten seconds to one that clears Level 14 and pushes into Level 15, where enemy formations move at over 50x their original speed.
Watch the full training run: youtube.com/watch?v=BrrZVmKVyPg&feature=youtu.be
Why Games Are the Right Proving Ground for AI Security Research
The reinforcement learning (RL) pipeline we use for games shares its core architecture with the systems we deploy for production cybersecurity workloads. The fundamental loop is identical: observe a complex, adversarial environment; extract a meaningful state representation; make a decision under time pressure; learn from the outcome.
When we train AI models to detect pre-CVE exploit patterns in network traffic, predict VMware cluster failures before they cascade, or prioritize security remediation workflows across a fleet of Linux hosts, we are solving a structurally identical problem — with different state vectors and significantly higher stakes.
Games provide a controlled adversarial environment where we can iterate on architecture decisions — dueling networks, NoisyNet exploration, n-step returns, frame stacking — without waiting hours for infrastructure telemetry. Every lesson in agent robustness, threat prioritization, and real-time decision-making transfers directly into production AI security tooling. The fun is a side effect.
The Game: Designing a Genuinely Hard Adversarial Environment
eyeBMinvaders runs as a single 3,600-line JavaScript file — no frameworks, no build system, just raw canvas rendering and a neural network inference engine performing matrix multiplications in plain JS. We wanted to understand the system completely, top to bottom. In AI security work, that same discipline — no black boxes you can't explain to an auditor — is non-negotiable.
What makes the game hard:
Homing missiles follow sinusoidal trajectories that re-target the player above the wall line, then lock their approach angle for the final descent. You cannot sidestep them; they curve to follow you. This mirrors the behavior of advanced persistent threat (APT) actors who adapt lateral movement strategies based on defender responses.
Kamikaze enemies break formation and dive directly at the player while firing. Spawn rate accelerates as enemy count drops — every 6–11 seconds normally, every 4 seconds below 26 enemies, every 2.2 seconds below 11. The endgame of each level becomes a relentless sequence of simultaneous multi-vector attacks. In cybersecurity terms, this is the late-stage compromise scenario: an attacker who has already penetrated the perimeter and is escalating across multiple vectors simultaneously.
Dual boss enemies include a Monster that patrols above the formation and fires missiles, and a Monster2 with level-dependent movement patterns — spiral, zigzag, figure-8, bounce, wave, teleport, chase — that fires 3-way spread bullets. From Level 9 onward, Monster2 switches to random movement, making it genuinely unpredictable. This models the behavior of zero-day exploits: threats that cannot be characterized by prior signatures.
Compounding difficulty — enemy speed increases 33% per level and fire rate tightens continuously. By Level 14, the formation crosses the screen so fast that a human player has fractions of a second to react.
Phase 1: Heuristic Baseline and State Representation
Before training any neural network, we built a rule-based AI agent. It performed trajectory forecasting — simulating bullet, missile, and kamikaze paths in 4ms time-steps to evaluate candidate positions. It checked eight strategic positions, scored them by danger level, and moved toward the safest one while lining up shots.
The heuristic was surprisingly competent. It could reliably clear early levels and occasionally push into Level 7 or 8. But it had a hard ceiling: hand-crafted rules cannot handle the combinatorial explosion of simultaneous threats. When a homing missile, two kamikazes, and a Monster2 spread-shot converge simultaneously, no finite decision tree covers every case.
This is exactly the limitation of signature-based cybersecurity tools. Rule-based intrusion detection systems fail at the same inflection point: novel threat combinations that fall outside enumerated patterns.
The heuristic taught us something critical: it forced us to design a rigorous state representation. The features we chose — threat positions, velocities, danger heatmaps, time-to-impact estimates — became the foundation for the neural network's input. In AI-powered security remediation, this is where the real engineering happens: defining what the model needs to see, not what architecture to deploy.
Phase 2: Deep Q-Network Architecture for Adversarial Environments
We ported the full game engine to Python for headless training, faithfully replicating every mechanic so the agent would learn against the same physics it faces in the browser. Training against a simplified model introduces distribution shift — the agent learns to exploit the simulation, not the real environment. The same principle applies when training AI models on sanitized log data: the model will fail in production when it encounters real adversarial traffic.
The initial DQN: a 54-feature state vector fed through three fully connected layers (256–256–128) producing Q-values for six actions. Standard experience replay, epsilon-greedy exploration, periodic target network sync.
It was ineffective. The agent learned to fire — shooting enemies produces immediate positive reward — but could not dodge. The reward signal for survival was too diffuse. The agent couldn't connect "moving left three seconds ago" with "not dying now."
This is the delayed reward problem that makes AI-driven security remediation genuinely hard. The relationship between a configuration change made at 2am and a breach prevented six weeks later is not obvious. Reward shaping is required.

Phase 3: Reward Shaping, Dueling Architecture, and NoisyNet
The performance breakthrough came from three simultaneous changes.
Reward shaping for defensive behavior: We added explicit rewards for shooting down missiles (+2.0) and killing kamikazes (+1.5), steep penalties for losing lives (−5.0) and game over (−20.0), a penalty for destroying walls (−2.0), and a progressive survival bonus that scales with level. This gave the agent a clearer signal: defensive actions that prevent damage are worth more than offensive actions that score points.
In vulnerability management and security remediation, this maps to prioritizing CVE remediation by exploitability and blast radius rather than by CVSS score alone. The reward function defines what the AI optimizes for.
Dueling network architecture: We split the network into separate value and advantage streams — a shared feature extractor feeding a value head ("how good is this state regardless of action?") and an advantage head ("how much better is each action than average?"). This improves performance in states where one specific action is critical — dodging an incoming missile — versus states where any action is acceptable. For AI security models, this architecture is particularly effective when the model must distinguish between states requiring immediate automated remediation and states that are safe to queue.
NoisyNet exploration: We replaced epsilon-greedy with NoisyNet layers that learn state-dependent exploration. Instead of randomly taking bad actions 10% of the time, the network learns to explore more in unfamiliar states and exploit in familiar ones. This is critical for cybersecurity AI operating in non-stationary environments: the threat landscape shifts constantly, and an agent that explores intelligently adapts faster than one that explores randomly. Novel exploit patterns are precisely the states where exploration — not exploitation of known rules — is most valuable.
We also implemented 5-step returns, a dual prioritized experience replay buffer separating high-impact transitions from routine ones, cosine annealing learning rate schedules, and frame stacking (4 frames) to give the agent a sense of motion and trajectory.

Phase 4: Training Infrastructure and Throughput
The Python game simulation was the throughput bottleneck. Even optimized, running full game physics in Python limited us to approximately 300–400 games per second. We prototyped a Rust-based game simulation to push throughput higher.
Training ran for hundreds of thousands of episodes. Agent progression:
-
Episodes 0–10,000: learns to fire, occasionally moves, dies quickly.
-
Episodes 10,000–50,000: learns basic dodging, starts clearing Level 3–4. Discovers that shooting down missiles is valuable.
-
Episodes 50,000–150,000: develops real spatial awareness. Starts threading between simultaneous threats. Clears Level 6–7.
-
Episodes 150,000–300,000: learns to prioritize kamikazes over regular enemies. Begins handling the Monster2 boss. Pushes into Level 9–10.
-
Episodes 300,000+: refinement. The agent learns to use walls as cover, time shots to hit kamikazes during their dive arc, and pre-position before missile waves arrive.
The same infrastructure discipline drives production AI security model lifecycles: headless training, automated export, continuous deployment, behavioral observation, and rapid iteration.
Phase 5: Late-Game Behavior and Lessons for AI Security
Level 9 is a transition point. Monster2 switches to random movement, removing the last predictable threat pattern. From this point, everything is reactive — precisely the condition that defines post-perimeter breach response.
The agent's behavior in late levels does not resemble human play. It makes micro-adjustments — tiny lateral movements that thread between converging threats. It fires in short bursts while moving, sacrificing accuracy for positioning. When the kamikaze swarm phase begins, it shifts strategy entirely: prioritizing survival over offense, taking shots of opportunity rather than hunting targets.
By Level 12–13, the agent compensates for extreme formation speed by staying near center and relying on danger heatmap features to avoid dense threat zones rather than tracking individual projectiles. This is analogous to AI-driven security tools that operate at the traffic flow level — detecting anomalous patterns in aggregate telemetry — rather than trying to parse every individual packet.
Level 14 is the current reliable peak. Level 15 remains the frontier.

What This Means for Production AI Security and Remediation
State representation determines ceiling. The 54-feature vector — with danger heatmaps, threat urgency scores, lateral threat density, and fire-line clearance — matters more than any architecture choice. Feature engineering for CVE detection, anomaly detection, and exploit pattern recognition is where most of the real work in AI security happens.
Reward shaping defines optimization target. The jump from "agent that shoots but can't dodge" to "agent that survives" came entirely from better reward design. In vulnerability management, defining what constitutes a true positive at the right granularity — exploitable, reachable, unpatched, in a critical system — has the same outsized impact on model effectiveness.
Exploration strategy determines adaptability. NoisyNet's state-dependent exploration was critical because the environment changes dramatically across levels. Cybersecurity AI faces the same challenge: the threat distribution shifts with every new CVE publication, every attacker toolchain update, every change in infrastructure topology. Agents that explore intelligently in unfamiliar states adapt faster.
Full-stack observability closes the loop. Headless training, automated weight export to browser-compatible JSON, and live model reloading (the browser checks for updated weights every 30 seconds) let us close the experiment loop quickly. Production AI security pipelines require the same discipline: train, export, deploy, observe, iterate — with full visibility at every stage.
The game is a microcosm. The lessons — about state representation, reward design, architectural choices for adversarial environments, and exploration in non-stationary threat landscapes — transfer directly into the systems that actually detect exploits, automate security remediation, and protect production infrastructure.
Level 15 remains the frontier. So does the next zero-day.
The Stack
The game itself is just three files: index.html, game.js, and a splash screen. The neural network inference engine runs in the browser using hand-written matrix operations — no TensorFlow.js, no ONNX runtime, just loops and arrays. Training happens in Python with PyTorch, and the trained weights are exported to a JSON file that the browser loads at startup.
It's intentionally minimal. Sometimes the best way to understand what a neural network is actually doing is to implement every matrix multiplication yourself.
What's Next
Level 15 remains the frontier. The agent reaches it but can't reliably clear it. The 33% per-level speed compounding creates an exponential difficulty curve that may require curriculum learning, architectural changes (we're experimenting with GRU-augmented actor-critic models that maintain hidden state across decisions), or simply more training episodes.
But that's a problem for the next late-night training run. For now, watching a neural network learn to dodge homing missiles while clearing kamikaze swarms at 50x speed is satisfying enough — and the lessons carry straight back to the models that actually pay the bills.