应用RL:改进代码合并
Published on Jul 3, 2025
于2025年7月3日出版
While foundation models have continued to improve at coding capabilities, using foundation models for high specificity, low complexity tasks like code merging can be overkill.
尽管基础模型在编码功能方面继续提高,但使用基础模型来提高特异性,但诸如代码合并之类的低复杂性任务可能是过分的。
With that in mind, we saw an opportunity to use reinforcement learning to fine-tune a model (Qwen3-1.7B) for code merge - the result is a small model that’s better and faster than foundation models, while also able to run locally. You can use the model via MCP server here!
考虑到这一点,我们看到了一个机会使用强化学习将模型(QWEN3-1.7B)微调用于代码合并的机会 - 结果是一个小型模型,比基础模型更好,更快,同时也能够在本地运行。您可以在此处通过MCP服务器使用该模型!
Download Osmosis-Apply-1.7B here: Ollama | Hugging Face
在此处下载渗透量-1.7b:Ollama |拥抱脸
After training, we tested the model (1xH100 running on SGLang) against OpenAI o3, Claude 4 Sonnet, and Gemini 2.5 Flash on 10,000 validation examples to measure performance. We also defined a reward criteria for these models, it breaks down into 3 types. If the model merges the code perfectly, we reward with a score of 1. If the model merges the code correctly but has extra new lines, we reward with a score of 0.2. All other cases receive a score of 0. Osmosis-Merge-1.7B outperformed all three with a 0.98 reward score (1.00 being perfect):
训练后,我们测试了对OpenAI O3,Claude 4 Sonnet和Gemini 2.5 Flash在10,000个验证示例上的模型(在SGLANG上运行的1xH100),以衡量性能。我们还为这些模型定义了奖励标准,它分为3种类型。如果该模型完美地合并了代码,我们将以1分的成绩进行奖励。如果该模型正确合并了代码但具有额外的新行,我们的分数为0.2。所有其他案例的得分为0。渗透 - 渗透 - 1.7B的表现优于所有三个奖励得分(1.00是完美的):
Osmosis-Apply-1.7B is also significantly cheaper than the foundation model options, being 3X-5X (input-output token cost) cheaper than the next cheapest model, Gemini 2.5 Flash.
渗透 - apply-1.7b也比基础模型选项便宜得多,比下一个最便宜的型号Gemini 2.5 Flash便宜3 x-5X(输入输出令牌成本)。
Model Latency (ms) Reward Score Cost ($/M tokens in) Cost ($/M tokens out) Osmosis-Apply-1.7B 151 0.9893 $0.11 $0.42 Claude Sonnet 4 1,180 0.9328 $3.00 $15.00 OpenAI o3 1,230 0.8639 $2.00 $8.00 Gemini 2.5 Flash 1,050 0.7745 $0.30 $2.50
模型延迟(MS)奖励得分成本($/m代币)成本($/m代币)渗透 - apply-1.7b 151 0.9893 $ 0.11 $ 0.11 $ 0.42 CLAUDE SONNET 4 1,180 0.9328 $ 3.00 $ 3.00 $ 3.00
We trained Osmosis-Apply-1.7B on CommitPackFT, a 2GB dataset of code commits, using GRPO. Given the size of the dataset, we only used a portion for training (100K examples, or roughly 1/7 of the dataset, uniformly sampled). The reward function was really, really simple - we rewarded the model when it merged code successfully, gave a minor reward when formatting was slightly off, and didn’t reward it when it failed. Like so:
我们使用grpo训练了2GB的代码提交数据集commitpackft上的渗透渗透量为1.7b。考虑到数据集的大小,我们仅使用一部分进行训练(100k示例,或大约1/7数据集,均匀采样)。奖励功能非常非常简单 - 当模型成功合并代码时,我们奖励了该模型,在格式略有关闭时给予了较小的奖励,并且在失败时没有奖励它。像这样:
def extract_solution ( solution_str ): matches = list (re.finditer( r'
(.*?)
' , solution_str, re.DOTALL)) if (matches and len (matches) == 1 ): return matches[ 0 ].group( 1 ).strip() return None def filter_empty_lines ( lines ): return list ( filter ( lambda line : line.strip() !="" , lines)) def calc_score ( answer, ground_truth ): answer = answer.strip() ground_truth = ground_truth.strip() if (answer == ground_truth): return 1.0 else : answer_lines = filter_empty_lines(answer.splitlines( True )) ground_truth_lines = filter_empty_lines(ground_truth.splitlines( True )) if (answer_lines == ground_truth_lines): return 0.2 return 0 def compute_score ( data_source, solution_str, ground_truth, extra_info= None , format_score= 0.0 , score= 1.0 ): answer = extract_solution(solution_str=solution_str) if answer is None : return 0 else : return calc_score(answer, ground_truth)def extract_solution(solution_str):匹配= list(re.finditer(r'
(。线))def calc_score(atnement,ground_truth):答案= wonse.strip()fack_truth_lines):返回0.2返回0 def compute_score(data_source,solution_str,ground_truth,extra_info = none,format_score = 0.0,得分= 1.0,得分= 1.0):答案= extract_solution(solution_str = solution_str = solution_str)(如果答案none:return none:返回calc_score none:return calc_score none:return calc_score none:
We trained the model using GRPO with a learning rate of 1e-5 and batch size of 64. The training was optimized for efficiency with FSDP (Fully Sharded Data Parallel) strategy across 8 GPUs, using parameter offloading to manage memory. We set maximum prompt length to 3,072 tokens and maximum response length to 6,144 tokens to handle the typical size of code merge scenarios.
我们使用GRPO培训了该模型,学习率为1E-5,批次大小为64。培训已通过8 GPU进行了优化的FSDP(完全碎片数据并行)策略,以使用参数卸载来管理内存。我们将最大及时长度设置为3,072代币和最大响应长度为6,144个令牌,以处理代码合并方案的典型尺寸。
Notably, we disabled KL divergence regularization and entropy bonuses, allowing the model to focus purely on the reward signal from successful merges. The model was trained for just one epoch with 16 rollout samples per iteration, demonstrating that even minimal training can achieve strong performance when the reward function is well-designed.
值得注意的是,我们禁用了KL差异正则化和熵奖金,使该模型可以纯粹专注于成功合并的奖励信号。该模型仅在一个迭代中以16个推出样本进行了训练,这表明当奖励功能精心设计时,即使是最小的训练也可以实现强大的性能。
Let us know what you think - and reach out if you’re interested in learning more about reinforcement learning!
让我们知道您的想法 - 如果您有兴趣了解有关增强学习的更多信息,请伸出援手!