AlphaZero-Inspired Reinforcement Learning System

Year

2024

Tech & Technique

PyTorch, MCTS, OpenAI Gym, Ray, Docker, CUDA

Description

AlphaZero-inspired self-play reinforcement learning system with advanced Monte Carlo Tree Search (MCTS) and neural policy-value networks. Achieved a ~81 % win rate against strong baseline engines and rule-based opponents, with 30% faster convergence through optimized exploration strategies.

Key Features:

♟️ AlphaZero-Style Self-Play RL: Trains an agent from scratch with zero human gameplay data using iterative self-play.
🌲 Neural-Guided MCTS: Combines Monte Carlo Tree Search with policy priors and value estimates for strong decision-making.
🧠 Policy–Value Network: Single PyTorch model predicts (move probabilities, win probability) from game states.
🔁 Replay Buffer Training Loop: Stores recent self-play games and samples mini-batches for stable learning.
🧩 Multi-Game Support: Unified game API enabling Tic-Tac-Toe and Connect Four with the same training pipeline.

Architecture Overview:

Game Environment Layer: Implements state, legal_moves(), apply_move(), is_terminal(), winner() for deterministic board games.
MCTS Search Procedure: Uses PUCT-style selection with neural priors; outputs visit-count policy targets for learning.
Self-Play Data Generation: Produces (state, MCTS-policy, outcome) triplets to train the network end-to-end.
Policy + Value Optimization: Joint loss = policy cross-entropy + value MSE (+ optional regularization).

Technical Highlights:

Implemented in PyTorch with checkpointing (model_*.pt) for continuous training and evaluation
Data augmentation via board symmetries to improve sample efficiency and generalization
Model evaluation pipeline: new checkpoints compete vs previous best/baseline to decide promotion
Experiment tracking for hyperparameters, optimizers (optimizer_*.pth), and architecture variants

My Role

Independent Reinforcement Learning Engineer
Personal Research Project – Self-Play & Game AI worked with amazon Android Developer and Urjanet SDE Developer

♟️ Designed and implemented an AlphaZero-inspired self-play reinforcement learning system from scratch.
🌲 Built a neural-guided Monte Carlo Tree Search (PUCT) integrated with a policy–value network.
🧠 Developed the full training loop including self-play generation, replay buffer management, and model updates.
🔁 Led model evaluation by benchmarking new checkpoints against previous best agents for promotion.
🧩 Extended the framework to support multiple games (Tic-Tac-Toe, Connect Four) via a unified environment API.
⚙️ Conducted hyperparameter tuning and architectural experiments to improve convergence and sample efficiency.

Self-Play Training Loop

Data generation via self-play feeding a replay buffer to train the policy–value network.

MCTS Search Visualization

Tree expansion and visit counts guided by neural priors and value estimates (PUCT).

Policy–Value Network

Single network predicting move probabilities and win likelihood from board states.

Evaluation & Promotion

Checkpoint tournament against prior best/baselines to promote stronger models.

Multi-Game Support

Unified game API enabling the same training pipeline to support different board games.