CoRe

CoRe decomposes the task reward into a Formal Reward Module (FRM) and a Residual Reward Module (RRM). FRM leverages LLMs together with VLM-based preferences to iteratively generate and refine code-based rewards; RRM incorporates video-level preferences and state-importance from VLMs to complement FRM and ensure alignment with human intent.

Abstract

Reward design remains a central challenge in reinforcement learning (RL). Hand-crafted rewards are often difficult to specify and may lead to suboptimal policies, while learned rewards from preferences can suffer from inefficiency and unstable training. Inspired by the dual nature of human learning explored in cognitive science, we decompose rewards into two complementary components: Formal Rewards (FR), explicitly designed based on task knowledge, and Residual Rewards (RR), learned from observations to capture implicit and nuanced preferences. Based on this decomposition, we propose CoRe, a hybrid framework that integrates FR and RR with vision-language models (VLMs) feedback to achieve preference-aligned policies without human involvement. Our contributions are twofold: (1) We propose a Formal Reward Module (FRM) that leverages VLMs to iteratively design and optimize FR based on task knowledge and preference feedback, enabling the continual improvement of policy during training; (2) We introduce a Residual Reward Module (RRM) that learns RR from video-level preference by employing VLMs to generate preference labels and capturing nuanced rewards that complement FR, ensuring alignment with human intent. Through the synergy of FRM and RRM, CoRe enables the automatic construction of reliable rewards that are efficient and preference-aligned. Extensive experiments demonstrate that CoRe outperforms existing approaches in terms of policy learning effectiveness and efficiency on ten robotic manipulation tasks in simulation and five real-world tasks.

Experiments

Simulation Experiments

We evaluate CoRe on 10 simulation tasks: seven from MetaWorld (Soccer, Sweep Into, Drawer Open, Button Press, Dial Turn, Hammer, Peg Insert) and three from SoftGym (Fold Cloth, Straighten Rope, Pass Water). MetaWorld tasks are measured by success rate (%); SoftGym tasks by episode reward.

Videos

Soccer

Sweep Into

Drawer Open

Button Press

Dial Turn

Hammer

Peg Insert

Fold Cloth

Straighten Rope

Pass Water

Task Description

Task Name	Task Description
Soccer	Move the soccer ball into the goal.
Sweep Into	Minimize the distance between the green cube and the hole.
Drawer Open	Open the drawer.
Button Press	Press the red button down completely from top to bottom.
Dial Turn	Turn the red line to the bottom of the dial.
Hammer	Hammer the grey nail completely in with a red hammer.
Peg Insert	Insert the green peg into the hole of the red block.
Fold Cloth	Fold the cloth diagonally from the top left corner to the bottom right corner.
Straighten Rope	Straighten the blue rope.
Pass Water	Move the container, which holds water, to be as close to the red circle as possible without causing too many water droplets to spill.

Results

Method	Success Rate (%)							Episode Reward
Method	Soccer	Sweep Into	Drawer Open	Button Press	Dial Turn	Hammer	Peg Insert	Fold Cloth	Straighten Rope	Pass Water
Env Sparse	100.0	60.0	100.0	66.7	100.0	60.7	0.0	−0.04	18.6	−67.9
Env Dense	100.0	98.0	100.0	100.0	100.0	97.3	100.0	−0.08	18.1	−18.3
CLIP Score	1.3	0.0	0.0	20.7	0.0	11.3	0.0	−0.52	15.0	−299.4
Eureka	100.0	86.0	100.0	66.7	76.0	33.3	66.7	−0.19	16.2	−38.8
Text2Reward	96.7	96.0	96.0	88.0	78.0	32.0	33.3	−0.22	17.4	−43.7
RL-VLM-F	80.0	58.0	100.0	0.0	72.0	72.7	14.7	−0.12	17.9	−36.4
PrefVLM	1.3	6.7	64.0	54.0	2.7	1.3	0.0	−0.18	20.5	−86.9
ERL-VLM	80.7	24.0	100.0	33.3	2.7	9.3	9.3	−0.22	18.1	−202.4
CoRe (Ours)	100.0	97.3	100.0	100.0	98.0	98.0	100.0	−0.10	20.6	−30.0

Comparison of final success rate (%) and episode reward across ten tasks. Bold values indicate best performance.

Learning Curves

Soccer

Sweep Into

Drawer Open

Button Press

Dial Turn

Hammer

Peg Insert

Fold Cloth

Straighten Rope

Pass Water

Learning curves on ten robotic manipulation tasks (success rate / episode reward). Solid lines and shaded regions denote mean and standard deviation over three seeds.

Real-World Experiments

Policies trained in simulation are deployed on a UR5 robotic arm without fine-tuning across five real-world tasks. Each method is evaluated over 20 trials per task.

Videos

Drawer Open

Dial Turn

Hammer

Fold Cloth

Straighten Rope

Results

Method	Drawer Open ↑		Dial Turn ↑		Hammer ↑		Fold Cloth ↓		Straighten Rope ↑
Method	sim	real	sim	real	sim	real	sim	real	sim	real
CLIP Score	0	0	0	0	0	0	11.0	15.6	55	33
Eureka	100	70	60	45	0	0	4.8	7.3	76	69
Text2Reward	100	75	55	45	40	25	5.1	7.0	51	42
PrefVLM	85	55	0	0	5	0	2.9	4.5	81	71
RL-VLM-F	100	65	55	45	50	35	2.5	4.9	83	73
ERL-VLM	100	80	0	0	0	0	4.2	5.2	81	53
CoRe (Ours)	100	90	75	70	90	80	2.1	3.2	90	84

Success rate (%) for Drawer Open, Dial Turn and Hammer; Fold Cloth reports positional error (cm, lower is better); Straighten Rope reports task completion (%). CoRe consistently outperforms other baselines across five real-world tasks, demonstrating robust sim-to-real transfer.

Feedback Efficiency

Method	MetaWorld		Fold Cloth		Straighten Rope / Pass Water
Method	Label	Image	Label	Image	Label	Image
RL-VLM-F	5.0K	10.0K	0.50K	1.00K	2.0K	4.0K
PrefVLM	21.0K	2.0M	1.79K	10.74K	4.1K	0.4M
ERL-VLM	4.9K	4.9K	0.75K	0.75K	0.9K	0.9K
CoRe (Ours)	0.5K	4.0K	0.15K	1.20K	0.2K	1.6K

Feedback numbers across ten tasks (labels and images used for preference-based reward learning). CoRe demonstrates superior preference feedback efficiency. Across ten tasks, it uses only 0.15K–0.5K preference labels, compared to 0.5K–21K for other baselines — a 3–40× reduction.

Training Cost

Method	Token (M)	API Cost ($)	Time (h)
SAC	—	—	0.97
CLIP Score	—	—	2.58
Eureka	0.03	0.03	1.87
Text2Reward	0.03	0.02	5.73
RL-VLM-F	5.50	0.79	6.72
PrefVLM	—	—	1.82
ERL-VLM	3.19	0.55	5.80
CoRe (Ours)	2.00	0.37	2.15

Average training cost across ten tasks (token usage, API cost, and wall-clock time). A full CoRe training run requires approximately 2.00M tokens, $0.37 API cost, and 2.15 hours of wall-clock time. Compared other baselines, CoRe achieves a more favorable trade-off between performance, computational cost, and runtime efficiency.

FRM Code

LLM-generated formal reward for Sweep Into.

import numpy as np
def reward_function(observation, action, target_pos):
    # Unpack observation components
    # observation = [x_hand, y_hand, z_hand, gripper_open, x_obj, y_obj, z_obj, qx_obj, qy_obj, qz_obj, qw_obj]
    hand_pos = observation[0:3]
    # gripper_open = observation[3]  # Normalized gripper opening, not needed for sweep task specifically
    obj_pos = observation[4:7]
    # obj_quat = observation[7:11]   # Not used in reward for sweeping

    # Distance from object to target (hole)
    dist_obj_to_target = np.linalg.norm(obj_pos - target_pos)

    # Distance from hand to object
    dist_hand_to_obj = np.linalg.norm(hand_pos - obj_pos)

    # We want to reward the cube being closer to the hole,
    # and encourage the hand to stay near the cube to sweep it.
    # Also encourage progress towards the goal.

    # Temperature parameters for exponential scaling
    temp_obj_to_target = 0.5
    temp_hand_to_obj = 1.0

    # Reward component: cube closer to target gives higher reward (range 0 to 1)
    # Apply negative distance scaled and exponentiated
    r_obj_to_target = np.exp(-dist_obj_to_target / temp_obj_to_target) - 0.5  # shift to roughly center near 0

    # Reward component: hand close to object (to encourage interaction)
    r_hand_to_obj = np.exp(-dist_hand_to_obj / temp_hand_to_obj) - 0.5  # shift to roughly center near 0

    # Combine rewards with weights
    # Since these scores range roughly in [-0.5, 0.5], sum ranges [-1, 1]
    reward = r_obj_to_target + r_hand_to_obj

    # Clip total reward to [-1, 1]
    reward = np.clip(reward, -1.0, 1.0)

    # Compose individual reward dict for info/debugging
    reward_components = {
        "cube_to_target": r_obj_to_target,
        "hand_to_cube": r_hand_to_obj,
    }

    return reward, reward_components

Video-Level Preference Feedback

VLM video-level preference label between two trajectory videos.

Analysis:

Okay, let's analyze the video clips and compare them to the goal image.

(1) What does each frame of video clip A show and what are the differences?

Frame 1: A green cube is positioned on a wooden surface, slightly above and to the left of a rectangular hole. A shadow is cast to the left of the cube.
Frame 2: The green cube has moved closer to the hole, positioned at the edge of the hole.
Frame 3: The green cube is now inside the hole.

The difference between the frames is the position of the green cube. It moves closer to and then into the hole in subsequent frames.

(2) What does each frame of video clip B show and what are the differences?

Frame 1: A green cube is positioned on a wooden surface, slightly above and to the left of a rectangular hole. A shadow is cast to the left of the cube.
Frame 2: A green cube is positioned on a wooden surface, slightly above and to the left of a rectangular hole. A shadow is cast to the left of the cube.
Frame 3: A green cube is positioned on a wooden surface, slightly above and to the left of a rectangular hole. A shadow is cast to the left of the cube.

There is no difference between the frames. The green cube remains in the same position.

(3) Is there any difference in each frame of the video clip and the goal?

Video Clip A:

Frame 1: The green cube is outside the hole, unlike the goal image where it's inside.
Frame 2: The green cube is partially inside the hole, unlike the goal image where it's fully inside.
Frame 3: The green cube is inside the hole, matching the goal image.

Video Clip B:

Frame 1: The green cube is outside the hole, unlike the goal image where it's inside.
Frame 2: The green cube is outside the hole, unlike the goal image where it's inside.
Frame 3: The green cube is outside the hole, unlike the goal image where it's inside.

In summary, Video clip A achieves the goal in frame 3, while video clip B does not achieve the goal in any of its frames.

Preference:

decision#: A

evaluate goal completion for each frame in the video clip: A:[0.3, 0.6, 1] B:[0.3, 0.3, 0.3]

why: Video A shows the cube moving towards and eventually inside the hole, thus minimizing the distance. Video B shows the cube remaining stationary and not minimizing the distance.

BibTeX

@InProceedings{ni2026core,
  title     = {CoRe: Combined Rewards with Vision-Language Model Feedback for Preference-Aligned Reinforcement Learning},
  author    = {Ni, Hexian and Lu, Tao and Cai, Yinghao},
  booktitle = {Proceedings of the 43rd International Conference on Machine Learning},
  year      = {2026}
}

CoRe: Combined Rewards with Vision-Language Model Feedback for Preference-Aligned Reinforcement Learning

ICML 2026

Abstract

Experiments

Simulation Experiments

Videos

Task Description

Results

Learning Curves

Real-World Experiments

Videos

Results

Feedback Efficiency

Training Cost

FRM Code

Video-Level Preference Feedback

BibTeX