Back
Year
2024
Tech & Technique
PyTorch, TensorFlow, MCTS, OpenAI Gym, Ray, Docker, CUDA
Description
AlphaZero-inspired self-play reinforcement learning system with advanced Monte Carlo Tree Search (MCTS) and
neural policy-value networks. Achieved a ~81 % win rate against strong baseline engines and rule-based opponents, with 30% faster convergence through optimized exploration strategies.
Key Features:
Architecture Overview:
Technical Highlights:
Key Features:
- ♟️ AlphaZero-Style Self-Play RL: Trains an agent from scratch with zero human gameplay data using iterative self-play.
- 🌲 Neural-Guided MCTS: Combines Monte Carlo Tree Search with policy priors and value estimates for strong decision-making.
- 🧠 Policy–Value Network: Single PyTorch model predicts (move probabilities, win probability) from game states.
- 🔁 Replay Buffer Training Loop: Stores recent self-play games and samples mini-batches for stable learning.
- 🧩 Multi-Game Support: Unified game API enabling Tic-Tac-Toe and Connect Four with the same training pipeline.
Architecture Overview:
- Game Environment Layer: Implements state, legal_moves(), apply_move(), is_terminal(), winner() for deterministic board games.
- MCTS Search Procedure: Uses PUCT-style selection with neural priors; outputs visit-count policy targets for learning.
- Self-Play Data Generation: Produces (state, MCTS-policy, outcome) triplets to train the network end-to-end.
- Policy + Value Optimization: Joint loss = policy cross-entropy + value MSE (+ optional regularization).
Technical Highlights:
- Implemented in PyTorch with checkpointing (model_*.pt) for continuous training and evaluation
- Data augmentation via board symmetries to improve sample efficiency and generalization
- Model evaluation pipeline: new checkpoints compete vs previous best/baseline to decide promotion
- Experiment tracking for hyperparameters, optimizers (optimizer_*.pth), and architecture variants
My Role
Independent Reinforcement Learning Engineer
Personal Research Project – Self-Play & Game AI worked with amazon Android Developer and Urjanet SDE Developer
Personal Research Project – Self-Play & Game AI worked with amazon Android Developer and Urjanet SDE Developer
- ♟️ Designed and implemented an AlphaZero-inspired self-play reinforcement learning system from scratch.
- 🌲 Built a neural-guided Monte Carlo Tree Search (PUCT) integrated with a policy–value network.
- 🧠 Developed the full training loop including self-play generation, replay buffer management, and model updates.
- 🔁 Led model evaluation by benchmarking new checkpoints against previous best agents for promotion.
- 🧩 Extended the framework to support multiple games (Tic-Tac-Toe, Connect Four) via a unified environment API.
- ⚙️ Conducted hyperparameter tuning and architectural experiments to improve convergence and sample efficiency.