Latest Articles

Engineering Methods to Reduce Costs in DeepSeek-R1’s Large-Scale Reinforcement Learning for Reasoning

2025-1-26

DeepSeek-R1 uses large-scale reinforcement learning to enhance the reasoning capabilities of large language models without extensive manual labeling. Starting with a base model, a novel technique called Group Relative Policy Optimization eliminates the need for a large critic, reducing computational overhead. Additionally, a small supervised cold-start dataset ensures readability, and rule-based reward functions further curb costs. The system then collects high-quality training samples via rejection sampling, merges multiple objectives in a two-stage RL framework, and finally distills the resulting model into more compact variants. These engineering methods deliver near state-of-the-art performance while substantially mitigating resource consumption and training time overall.

Read More