LLMs can learn better coding behavior from problems with no known answers.
Rohan Paul Twitter · Rohan Paul (@rohanpaul_ai) · 2026-06-27
A new RL method called RiVER enables LLMs to improve coding performance on optimization problems without ground-truth solutions by ranking programs against each other on shared hidden tests rather than checking absolute correctness.
Extraction
Topics: reinforcement-learningcode-generationllm-trainingoptimization-problems
Claims
- Many real optimization problems have no certified gold-standard solution, making standard RL reward signals inapplicable.
- RiVER rewards programs by ranking their relative performance on shared test cases rather than comparing against a known correct answer.
- Raw scores are not used directly because test cases with larger numerical outputs would disproportionately distort training gradients.
- RiVER assigns extra weight to the top-ranked program while still providing graded feedback to other valid programs.
- Training on 12 AtCoder Heuristic Contest tasks improved both heuristic contest scores and standard pass-or-fail coding benchmarks.
Key quotes
Normal reinforcement learning works well when it can check a clear right answer, but that breaks down when the best answer is unknown.
RiVER does not trust raw scores directly, because some test cases naturally produce much bigger numbers and can distort training.
Instead, it ranks programs within each test case, gives extra weight to the best one, and still gives smaller graded feedback to other valid programs.