CoRe: Combined Rewards with Vision-Language Model Feedback for Preference-Aligned Reinforcement Learning

ICML 2026

Hexian Ni1,2, Tao Lu1†, Yinghao Cai1
1State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences 2School of Artificial Intelligence, University of Chinese Academy of Sciences

Corresponding author

Overview of the CoRe framework

CoRe decomposes the task reward into a Formal Reward Module (FRM) and a Residual Reward Module (RRM). FRM leverages LLMs together with VLM-based preferences to iteratively generate and refine code-based rewards; RRM incorporates video-level preferences and state-importance from VLMs to complement FRM and ensure alignment with human intent.

Abstract

Reward design remains a central challenge in reinforcement learning (RL). Hand-crafted rewards are often difficult to specify and may lead to suboptimal policies, while learned rewards from preferences can suffer from inefficiency and unstable training. Inspired by the dual nature of human learning explored in cognitive science, we decompose rewards into two complementary components: Formal Rewards (FR), explicitly designed based on task knowledge, and Residual Rewards (RR), learned from observations to capture implicit and nuanced preferences. Based on this decomposition, we propose CoRe, a hybrid framework that integrates FR and RR with vision-language models (VLMs) feedback to achieve preference-aligned policies without human involvement. Our contributions are twofold: (1) We propose a Formal Reward Module (FRM) that leverages VLMs to iteratively design and optimize FR based on task knowledge and preference feedback, enabling the continual improvement of policy during training; (2) We introduce a Residual Reward Module (RRM) that learns RR from video-level preference by employing VLMs to generate preference labels and capturing nuanced rewards that complement FR, ensuring alignment with human intent. Through the synergy of FRM and RRM, CoRe enables the automatic construction of reliable rewards that are efficient and preference-aligned. Extensive experiments demonstrate that CoRe outperforms existing approaches in terms of policy learning effectiveness and efficiency on ten robotic manipulation tasks in simulation and five real-world tasks.

Experiments

Simulation Experiments

We evaluate CoRe on 10 simulation tasks: seven from MetaWorld (Soccer, Sweep Into, Drawer Open, Button Press, Dial Turn, Hammer, Peg Insert) and three from SoftGym (Fold Cloth, Straighten Rope, Pass Water). MetaWorld tasks are measured by success rate (%); SoftGym tasks by episode reward.

Videos

Soccer

Sweep Into

Drawer Open

Button Press

Dial Turn

Hammer

Peg Insert

Fold Cloth

Straighten Rope

Pass Water

Task Description

Task Name Task Description
SoccerMove the soccer ball into the goal.
Sweep IntoMinimize the distance between the green cube and the hole.
Drawer OpenOpen the drawer.
Button PressPress the red button down completely from top to bottom.
Dial TurnTurn the red line to the bottom of the dial.
HammerHammer the grey nail completely in with a red hammer.
Peg InsertInsert the green peg into the hole of the red block.
Fold ClothFold the cloth diagonally from the top left corner to the bottom right corner.
Straighten RopeStraighten the blue rope.
Pass WaterMove the container, which holds water, to be as close to the red circle as possible without causing too many water droplets to spill.

Results

Method Success Rate (%) Episode Reward
Soccer Sweep Into Drawer Open Button Press Dial Turn Hammer Peg Insert Fold Cloth Straighten Rope Pass Water
Env Sparse100.060.0100.066.7100.060.70.0−0.0418.6−67.9
Env Dense100.098.0100.0100.0100.097.3100.0−0.0818.1−18.3
CLIP Score1.30.00.020.70.011.30.0−0.5215.0−299.4
Eureka100.086.0100.066.776.033.366.7−0.1916.2−38.8
Text2Reward96.796.096.088.078.032.033.3−0.2217.4−43.7
RL-VLM-F80.058.0100.00.072.072.714.7−0.1217.9−36.4
PrefVLM1.36.764.054.02.71.30.0−0.1820.5−86.9
ERL-VLM80.724.0100.033.32.79.39.3−0.2218.1−202.4
CoRe (Ours)100.097.3100.0100.098.098.0100.0−0.1020.6−30.0

Comparison of final success rate (%) and episode reward across ten tasks. Bold values indicate best performance.

Learning Curves

Learning curve legend
Soccer learning curve

Soccer

Sweep Into learning curve

Sweep Into

Drawer Open learning curve

Drawer Open

Button Press learning curve

Button Press

Dial Turn learning curve

Dial Turn

Hammer learning curve

Hammer

Peg Insert learning curve

Peg Insert

Fold Cloth learning curve

Fold Cloth

Straighten Rope learning curve

Straighten Rope

Pass Water learning curve

Pass Water

Learning curves on ten robotic manipulation tasks (success rate / episode reward). Solid lines and shaded regions denote mean and standard deviation over three seeds.

FRM Code

LLM-generated formal reward for Sweep Into.

import numpy as np
def reward_function(observation, action, target_pos):
    # Unpack observation components
    # observation = [x_hand, y_hand, z_hand, gripper_open, x_obj, y_obj, z_obj, qx_obj, qy_obj, qz_obj, qw_obj]
    hand_pos = observation[0:3]
    # gripper_open = observation[3]  # Normalized gripper opening, not needed for sweep task specifically
    obj_pos = observation[4:7]
    # obj_quat = observation[7:11]   # Not used in reward for sweeping

    # Distance from object to target (hole)
    dist_obj_to_target = np.linalg.norm(obj_pos - target_pos)

    # Distance from hand to object
    dist_hand_to_obj = np.linalg.norm(hand_pos - obj_pos)

    # We want to reward the cube being closer to the hole,
    # and encourage the hand to stay near the cube to sweep it.
    # Also encourage progress towards the goal.

    # Temperature parameters for exponential scaling
    temp_obj_to_target = 0.5
    temp_hand_to_obj = 1.0

    # Reward component: cube closer to target gives higher reward (range 0 to 1)
    # Apply negative distance scaled and exponentiated
    r_obj_to_target = np.exp(-dist_obj_to_target / temp_obj_to_target) - 0.5  # shift to roughly center near 0

    # Reward component: hand close to object (to encourage interaction)
    r_hand_to_obj = np.exp(-dist_hand_to_obj / temp_hand_to_obj) - 0.5  # shift to roughly center near 0

    # Combine rewards with weights
    # Since these scores range roughly in [-0.5, 0.5], sum ranges [-1, 1]
    reward = r_obj_to_target + r_hand_to_obj

    # Clip total reward to [-1, 1]
    reward = np.clip(reward, -1.0, 1.0)

    # Compose individual reward dict for info/debugging
    reward_components = {
        "cube_to_target": r_obj_to_target,
        "hand_to_cube": r_hand_to_obj,
    }

    return reward, reward_components

Video-Level Preference Feedback

VLM video-level preference label between two trajectory videos.

Analysis:

Okay, let's analyze the video clips and compare them to the goal image.

(1) What does each frame of video clip A show and what are the differences?

  • Frame 1: A green cube is positioned on a wooden surface, slightly above and to the left of a rectangular hole. A shadow is cast to the left of the cube.
  • Frame 2: The green cube has moved closer to the hole, positioned at the edge of the hole.
  • Frame 3: The green cube is now inside the hole.

The difference between the frames is the position of the green cube. It moves closer to and then into the hole in subsequent frames.

(2) What does each frame of video clip B show and what are the differences?

  • Frame 1: A green cube is positioned on a wooden surface, slightly above and to the left of a rectangular hole. A shadow is cast to the left of the cube.
  • Frame 2: A green cube is positioned on a wooden surface, slightly above and to the left of a rectangular hole. A shadow is cast to the left of the cube.
  • Frame 3: A green cube is positioned on a wooden surface, slightly above and to the left of a rectangular hole. A shadow is cast to the left of the cube.

There is no difference between the frames. The green cube remains in the same position.

(3) Is there any difference in each frame of the video clip and the goal?

Video Clip A:

  • Frame 1: The green cube is outside the hole, unlike the goal image where it's inside.
  • Frame 2: The green cube is partially inside the hole, unlike the goal image where it's fully inside.
  • Frame 3: The green cube is inside the hole, matching the goal image.

Video Clip B:

  • Frame 1: The green cube is outside the hole, unlike the goal image where it's inside.
  • Frame 2: The green cube is outside the hole, unlike the goal image where it's inside.
  • Frame 3: The green cube is outside the hole, unlike the goal image where it's inside.

In summary, Video clip A achieves the goal in frame 3, while video clip B does not achieve the goal in any of its frames.

Preference:

decision#: A

evaluate goal completion for each frame in the video clip: A:[0.3, 0.6, 1] B:[0.3, 0.3, 0.3]

why: Video A shows the cube moving towards and eventually inside the hole, thus minimizing the distance. Video B shows the cube remaining stationary and not minimizing the distance.

BibTeX

@InProceedings{ni2026core,
  title     = {CoRe: Combined Rewards with Vision-Language Model Feedback for Preference-Aligned Reinforcement Learning},
  author    = {Ni, Hexian and Lu, Tao and Cai, Yinghao},
  booktitle = {Proceedings of the 43rd International Conference on Machine Learning},
  year      = {2026}
}