Back
Year
2024
Tech & Technique
PyTorch, MCTS, OpenAI Gym, Ray, Docker, CUDA
Description
AlphaZero-inspired self-play reinforcement learning system with advanced Monte Carlo Tree Search (MCTS) and
neural policy-value networks. Achieved a ~81 % win rate against strong baseline engines and rule-based opponents, with 30% faster convergence through optimized exploration strategies.
Key Features:
Architecture Overview:
Technical Highlights:
Key Features:
- ♟️ AlphaZero-Style Self-Play RL: Trains an agent from scratch with zero human gameplay data using iterative self-play.
- 🌲 Neural-Guided MCTS: Combines Monte Carlo Tree Search with policy priors and value estimates for strong decision-making.
- 🧠 Policy–Value Network: Single PyTorch model predicts (move probabilities, win probability) from game states.
- 🔁 Replay Buffer Training Loop: Stores recent self-play games and samples mini-batches for stable learning.
- 🧩 Multi-Game Support: Unified game API enabling Tic-Tac-Toe and Connect Four with the same training pipeline.
Architecture Overview:
- Game Environment Layer: Implements state, legal_moves(), apply_move(), is_terminal(), winner() for deterministic board games.
- MCTS Search Procedure: Uses PUCT-style selection with neural priors; outputs visit-count policy targets for learning.
- Self-Play Data Generation: Produces (state, MCTS-policy, outcome) triplets to train the network end-to-end.
- Policy + Value Optimization: Joint loss = policy cross-entropy + value MSE (+ optional regularization).
Technical Highlights:
- Implemented in PyTorch with checkpointing (model_*.pt) for continuous training and evaluation
- Data augmentation via board symmetries to improve sample efficiency and generalization
- Model evaluation pipeline: new checkpoints compete vs previous best/baseline to decide promotion
- Experiment tracking for hyperparameters, optimizers (optimizer_*.pth), and architecture variants
My Role
Independent Reinforcement Learning Engineer
Personal Research Project – Self-Play & Game AI worked with amazon Android Developer and Urjanet SDE Developer
Personal Research Project – Self-Play & Game AI worked with amazon Android Developer and Urjanet SDE Developer
- ♟️ Designed and implemented an AlphaZero-inspired self-play reinforcement learning system from scratch.
- 🌲 Built a neural-guided Monte Carlo Tree Search (PUCT) integrated with a policy–value network.
- 🧠 Developed the full training loop including self-play generation, replay buffer management, and model updates.
- 🔁 Led model evaluation by benchmarking new checkpoints against previous best agents for promotion.
- 🧩 Extended the framework to support multiple games (Tic-Tac-Toe, Connect Four) via a unified environment API.
- ⚙️ Conducted hyperparameter tuning and architectural experiments to improve convergence and sample efficiency.
Self-Play Training Loop
Data generation via self-play feeding a replay buffer to train the policy–value network.
MCTS Search Visualization
Tree expansion and visit counts guided by neural priors and value estimates (PUCT).
Policy–Value Network
Single network predicting move probabilities and win likelihood from board states.