MENU

GET IN TOUCH

Back

AlphaZero-Inspired Reinforcement Learning System

Year

2024

Tech & Technique

PyTorch, TensorFlow, MCTS, OpenAI Gym, Ray, Docker, CUDA

Description

AlphaZero-inspired self-play reinforcement learning system with advanced Monte Carlo Tree Search (MCTS) and neural policy-value networks. Achieved a ~81 % win rate against strong baseline engines and rule-based opponents, with 30% faster convergence through optimized exploration strategies.

Key Features:
  • ♟️ AlphaZero-Style Self-Play RL: Trains an agent from scratch with zero human gameplay data using iterative self-play.
  • 🌲 Neural-Guided MCTS: Combines Monte Carlo Tree Search with policy priors and value estimates for strong decision-making.
  • 🧠 Policy–Value Network: Single PyTorch model predicts (move probabilities, win probability) from game states.
  • 🔁 Replay Buffer Training Loop: Stores recent self-play games and samples mini-batches for stable learning.
  • 🧩 Multi-Game Support: Unified game API enabling Tic-Tac-Toe and Connect Four with the same training pipeline.

Architecture Overview:
  • Game Environment Layer: Implements state, legal_moves(), apply_move(), is_terminal(), winner() for deterministic board games.
  • MCTS Search Procedure: Uses PUCT-style selection with neural priors; outputs visit-count policy targets for learning.
  • Self-Play Data Generation: Produces (state, MCTS-policy, outcome) triplets to train the network end-to-end.
  • Policy + Value Optimization: Joint loss = policy cross-entropy + value MSE (+ optional regularization).

Technical Highlights:
  • Implemented in PyTorch with checkpointing (model_*.pt) for continuous training and evaluation
  • Data augmentation via board symmetries to improve sample efficiency and generalization
  • Model evaluation pipeline: new checkpoints compete vs previous best/baseline to decide promotion
  • Experiment tracking for hyperparameters, optimizers (optimizer_*.pth), and architecture variants

My Role

Independent Reinforcement Learning Engineer
Personal Research Project – Self-Play & Game AI worked with amazon Android Developer and Urjanet SDE Developer
  • ♟️ Designed and implemented an AlphaZero-inspired self-play reinforcement learning system from scratch.
  • 🌲 Built a neural-guided Monte Carlo Tree Search (PUCT) integrated with a policy–value network.
  • 🧠 Developed the full training loop including self-play generation, replay buffer management, and model updates.
  • 🔁 Led model evaluation by benchmarking new checkpoints against previous best agents for promotion.
  • 🧩 Extended the framework to support multiple games (Tic-Tac-Toe, Connect Four) via a unified environment API.
  • ⚙️ Conducted hyperparameter tuning and architectural experiments to improve convergence and sample efficiency.

VISHAL-KRISHNA

vishalkrishnakkr@gmail.com