实验室工作

Report this week (5.30 — 6.5)

  1. Inverse kinematics to control Piper going up and down;
  2. Try to reproduce GraspNet-1B.

使用 IK 控制 Piper 机械臂的运动

Use curobo project, needs the urdf and usd of Piper

Python 代码示例 (IK/FK 控制 Piper 机械臂)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
robot_cfg = RobotConfig.from_basic(URDF_PATH, BASE_LINK, EE_LINK)
ik_config = IKSolverConfig.load_from_robot_config(
robot_cfg,
position_threshold=0.05, # 位置阈值 (米)
rotation_threshold=0.3, # 旋转阈值 (弧度)
self_collision_check=False,
self_collision_opt=False,
)
ik_solver = IKSolver(ik_config)

def IK(ee_position: torch.tensor, ee_quat: torch.tensor):
# 确保是float32
ee_position = ee_position.float()
ee_quat = ee_quat.float()
goal = Pose(ee_position, ee_quat)
result = ik_solver.solve_batch(goal)

if torch.count_nonzero(result.success).item() > 0:
return result.solution[0][0].cpu().numpy()
else:
raise RuntimeError("IK solution not found.")

def K(joint_angles: torch.tensor):
joint_angles = joint_angles.float() # 确保是float32
kin_state = ik_solver.fk(joint_angles)
return kin_state.ee_position.cpu().numpy(), kin_state.ee_quaternion.cpu().numpy()

安装 GraspNet 环境

  • python needs 3.8, the only one which can compile drivers successfully, but open3d errors (need glibc 2.18)
  • Compile open3d by source code ot install by conda(conda cannot do this)?(need clang but I do not have SUDO)
  • python3.7, cannot compile other cuda drivers.



Report this week (6.6 – 6.12)

Curobo in server

  • load pytorch by pip, conda gets glibc error,
  • need to run pip install scikit_build_core after load pytorch,
  • module load cuda/12.1 gcc/9.3.0 before pip install -e . --no-build-isolation
  • 暂时无法解决

Configure Curobo of Piper

  • urdf and usd,
  • use Lula Robot to create sphere,
  • some configure need to change,
  • current result:
  • possible result: FK different from hands and gripper.



Report this week (6.26 – 7.3)

  1. reproduce Minimind-V and understand the code of Vision-Language model;

VLM model

VLM 架构
Assume that the batch size is 11, seqence length is 639639, hidden size is 768768, vocab size is 64006400.

  • Input example: “<image>\n这个图像中有什么内容?”
  • Prompt: “@@...@@196个 @\overbrace{@@...@@}^{\text{196个 @}}\n这个图像中有什么内容?”, input_ids: “3434 34343434 12341234 183183123123
  • To tokens: [1,639,768][1, 639, 768]
  • Image feature: [1,3,224,224][1,1,196,768][1, 3, 224, 224] \rightarrow [1, 1, 196, 768]
  • Replace the position of @@ to image feature, so the input shape of transformer layers is [1,639,768][1, 639, 768]
  • Output shape of transformer(after RMSNorm) is also [1,639,768][1, 639, 768]
  • Output shape after the linear classification layer is [1,639,6400][1, 639, 6400]

VLA model

VLA 架构
Use the structure of OpenVLA, the last 256256 token ids are action tokens.

  • Input example: “<image>\n机器人应该怎么做才能完成{任务}?答:”
    For OpenVLA, image features are always placed at the beginning of the prompt.
  • Input_ids: “3434 34343434 12341234 183183123123
  • To tokens: [1,639,768][1, 639, 768]
  • Output shape after the linear classification layer is [1,639,6400][1, 639, 6400], and assume the last 77 tokens are “63446344 62986298 63926392 63066306 62806280 63146314 63996399
  • This last 77 tokens would be decoded into action ranged [1,1][-1, 1], that is [0.569,0.208,0.945,0.270,0.067,0.333,0.996][-0.569, -0.208, -0.945, -0.270, -0.067, -0.333, -0.996]
    Formula:
    action[i]=1+2clip(6400last_tokens[i]1,0,254)+1255\text{action}[i] = -1 + \frac{2 \cdot \text{clip}(6400 - \text{last\_tokens}[i] - 1, 0, 254) + 1}{255}

VLM to VLA

  1. Training mode: freeze vision model and train language model;
  2. Training data: RLDSRLDS, LerobotLerobot or common dataset, first using libero data?
  3. Model Architecture: requires implementing an action tokenizer and model architecture needs fine-tuning: for example, the image’s position must always be at the very beginning of the prompt.
  4. Inference Code: using the Libero simulation environment.

VLM Pretrained Result

minimind-v 复现结果
The pretrained model has not been evaluated yet, given the huge loss gap compared to what author reported, I suspect there’s an issue with loading the pre-trained language model.

Next week

  1. Complete the VLA structure and training script;
  2. turning LIBERO dataset to RLDS or Lerobot or common dataset;
  3. 训练时的输入架构?查看openvla
  4. 直接通过训好的 vlm 进行训练

Report this week (7.3 – 7.10)

  1. OpenVLA 架构研究与实现:深入分析了 OpenVLA 的输入架构,并着手实现了类似的结构。
  2. Action Token 筛选:从 64006400 个词汇中,筛选出使用频率最低的 128128token 序号,并整理生成了 action_token_map.json 文件。
  3. Libero 数据集处理:成功下载了 Libero 数据集,并将其处理成模型训练所需的格式。

Work

1. OpenVLA Input Structure
  • 输入与输出形状 (Input/Output Shape):

    输入/输出部分 维度/形状 (Dimension/Shape)
    <bos> (起始符) [bs, 1, hidden_size]
    image features (图像特征) [bs, num * 196, hidden_size]
    text features (文本特征) [bs, text_length, hidden_size]
    action tokens (动作指令) [bs, 7, hidden_size]
    <eos> (结束符) [bs, 1, hidden_size]
  • 时间错位机制 (Temporal Misalignment):

    • 模型输入 (Input): full_labels[:,:1]full\_labels[:, :-1]
    • 模型输出 (Output): full_labels[:,1:]full\_labels[:, 1:]
  • 损失计算 (Loss Calculation) 策略:

    1. 官方策略:同时计算 text featuresaction tokens 的损失。
    2. 备选策略:仅计算 action tokens 的损失。
  • 后续优化点:

    • 根据原论文,当前实现缺少本体感知特征历史信息特征,可在后续版本中迭代添加。

2. Least Used Tokens
  • 发现问题:与 Llama 不同,minimind 的词汇表并非按使用频率倒序排列。例如,词汇表的最后5个是常见英文词:

    • "Ġeconomy": 6395
    • "Ġethically": 6396
    • "éĻĪ": 6397
    • "Ġschools": 6398
    • "Ġnetworks": 6399
  • 解决方案:为精确找到真正低频的 token 以用作 action token,我下载了约 13G 的中英文语料数据集,并使用 minimindtokenizer 对其进行词频统计,筛选出频率最低的词汇,并将其索引存入 action_token_map.json 文件。


3. Libero Dataset
  • 数据加载:直接利用 Hugging Face 提供的 lerobot 格式 Libero 数据集,简化了数据获取流程。

    1
    2
    from datasets import load_dataset
    ds = load_dataset("physical-intelligence/libero")['train']
  • 数据适配:对数据集进行了格式处理,当前阶段主要使用 ds['image'] (主视角图像), ds['wrist_image'] (腕部图像) 和 ds['action'] (动作)。任务描述则通过 json 文件加载。

  • LiberoDataset 类代码预览

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    class LiberoDataset(Dataset):
    def __init__(self, hf_dataset, transform=None, task_file_path='dataset/meta/tasks.jsonl'):
    self.dataset = hf_dataset
    self.transform = transform
    # 加载任务信息
    self.task_info = load_task_info(task_file_path)

    def __len__(self):
    return len(self.dataset)
    def __getitem__(self, idx):
    sample = self.dataset[idx]
    image = self.transform(sample['image'])
    wrist_image = self.transform(sample['wrist_image'])
    state = torch.tensor(sample['state'], dtype=torch.float32)
    actions = torch.tensor(sample['actions'], dtype=torch.float32)
    task_index = sample['task_index']

    # 根据 task_index 获取任务描述
    task_description = self.task_info.get(task_index, f"Unknown task {task_index}")

    return {
    'image': image,
    'wrist_image': wrist_image,
    'state': state,
    'actions': actions,
    'task_description': task_description,
    }

Problems

  1. 模型格式问题:发现 minimind-v 预训练模型似乎没有可用的 .pth 格式权重。加载官方提供的 .pth 文件会导致输出乱码,只有 Hugging Face 格式的 model.bin 则可以正常使用。这是否意味着我们还是需要再进行预训练?
  2. 语言能力限制:当前模型主要处理中文,这是否会对 VLA (Vision-Language-Action) 模型的性能,产生负面影响?



Report this week (7.10 – 7.17)

  1. 成功让训练代码跑起来,但是结果还没有进行验证,也就是并没有训练到底,有可能训练代码存在问题;
    train
  2. 全量微调正好需要吃掉24G显存
  3. 下周先训练一下这个模型,并且查看一下代码是否有问题,写一下 evaluation 的脚本
  4. 感觉可以在训练时给长任务分配大一点权重。



Report this week (7.17 – 7.24)

  1. 训练代码中增加动作正确率检测环节;
  2. 写好Libero的evaluation代码,但是还没有检查细节;
  3. 对数据集的处理进行了一些改动;

Training Scripts

训练代码中增加动作检测环节:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
def compute_action_metrics(outputs, batch, action_tokenizer: ActionTokenizer, num_vision_patches=196):
"""
计算动作预测准确率
"""
action_preds = outputs['logits'][:, 2 * num_vision_patches + 1:, ].argmax(dim=2)
action_gt = batch['input_ids'][:, 1:].to(action_preds.device)

# 创建mask来标识真正的动作token位置
action_token_ids = set(action_tokenizer.action_to_token_id.values())
mask = torch.zeros_like(action_gt, dtype=torch.bool)
for token_id in action_token_ids:
mask |= (action_gt == token_id)
if mask.sum() == 0: return None, None

# 计算准确率
correct_preds = (action_preds == action_gt) & mask
action_accuracy = correct_preds.sum().float() / mask.sum().float()

return action_accuracy

目前动作输出正确率如下,不知道结果是否正常。

训练轮数 训练集Loss 测试集Loss 训练集动作准确率 测试集动作准确率
1 0.77870.7787 0.80060.8006 0.16090.1609 0.14160.1416
2 0.6424 0.8550 0.1272 0.1165
3 0.5552 0.9446 0.1001 0.1016
4 0.4880 1.0350 0.0862 0.0949
5 0.4332 1.1186 0.0785 0.0793

LIBERO Evaluation

这个代码已经写好,现在正在检查细节是否有问题,预计下周可以测试一下训练出来的模型是否有效果。

Dataset

  • 对main image 和wrist image使用不同的归一化代码
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    self.stats = load_stats(self.stats_path)
    self.main_tfs = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
    transforms.Normalize(
    mean=self.stats['image']['mean'], # [0.485, 0.456, 0.406]
    std=self.stats['image']['std'] # [0.229, 0.224, 0.225]
    )
    ])
    self.wrist_tfs = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
    transforms.Normalize(
    mean=self.stats['wrist_image']['mean'], # [0.512, 0.398, 0.321]
    std=self.stats['wrist_image']['std'] # [0.201, 0.189, 0.243]
    )
    ])
  • 由于我认为动作准确率不高,所以对action数据,根据openvla论文中说的,首先收集好了每个动作维度的1st和99th中位数,将其放入stats.json中,接下来以它们为min action和max action,这样就可以减少异常动作的影响。(这个因为有7维动作,似乎需要7个min和max,但openvla中好像没有这个考虑,具体代码我还在实现中。)
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    def compute_and_save_action_quantiles(self, quantiles=[1, 99]):
    """
    计算动作数据的分位数并保存到stats.json文件
    quantiles: 要计算的分位数列表,例如 [1, 99]
    stats_path: stats.json文件路径
    """
    stats_path = self.stats_path
    total_samples = len(self.dataset)
    all_actions = []

    for i in tqdm(range(total_samples), desc="Loading actions"):
    sample = self.dataset[i]
    actions = np.array(sample['actions'])
    all_actions.append(actions)

    all_actions = np.array(all_actions)
    print(f"Action data shape: {all_actions.shape}")

    # 计算每个维度的分位数
    action_quantiles = {}
    for q in quantiles:
    action_quantiles[f'{q}th_percentile'] = np.percentile(all_actions, q, axis=0).tolist()

    # 读取现有的stats.json
    with open(stats_path, 'r', encoding='utf-8') as f:
    stats_data = json.load(f)

    # 更新actions部分
    if 'actions' not in stats_data: stats_data['actions'] = {}
    stats_data['actions'].update(action_quantiles)
    with open(stats_path, 'w', encoding='utf-8') as f:
    json.dump(stats_data, f, indent=4, ensure_ascii=False)

Problems

我觉得可以增加一些框架,对state进行处理。



Report this week (7.25 – 7.31)

  1. 给ActionTokenizer每个维度加上1st - 99th 分位数;
  2. 完成大部分的推理代码撰写,但是还没有调试通KV cache的部分;
  3. 完成ddp加速训练代码的撰写。
  4. 完成对action-only的loss计算。

1st Percentile

实现方式是把分位数提取出来放进config中,然后在解码的时候,根据不同的动作维度进行解码。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
class ActionTokenizer:
"""
一个动作分词器,使用预先计算好的、非连续的低频词元ID作为动作空间。
支持每个动作维度使用不同的最小值和最大值。
"""
def __init__(
self,
tokenizer: PreTrainedTokenizerBase,
action_token_map_path: str,
min_actions: List[float] = None,
max_actions: List[float] = None,
bins: int = 256,
) -> None:
self.tokenizer = tokenizer
self.n_bins = bins
self.min_actions = np.array(min_actions)
self.max_actions = np.array(max_actions)
self.action_dims = len(min_actions)

with open(action_token_map_path, 'r') as f:
action_map_str_keys = json.load(f)
self.action_to_token_id = {int(k): v for k, v in action_map_str_keys.items()}

self.token_id_to_action = {v: k for k, v in self.action_to_token_id.items()}

# 为每个动作维度创建bins和bin_centers
self.bins = []
self.bin_centers = []
for i in range(self.action_dims):
bins_i = np.linspace(self.min_actions[i], self.max_actions[i], self.n_bins)
bin_centers_i = (bins_i[:-1] + bins_i[1:]) / 2.0
self.bins.append(bins_i)
self.bin_centers.append(bin_centers_i)

self.bins = np.array(self.bins) # shape: [action_dims, n_bins]
self.bin_centers = np.array(self.bin_centers) # shape: [action_dims, n_bins-1]

def encode(self, action: np.ndarray) -> np.ndarray:
"""
核心编码功能:将连续的动作值编码为对应的【词元ID数组】。
这是用于模型训练的主要函数。
"""
action = np.array(action)
if len(action.shape) == 1: # 单个动作: [action_dims]
action = action.reshape(1, -1)
is_single = True
else: is_single = False # 批量动作: [batch_size, action_dims]

batch_size, action_dims = action.shape
token_ids = np.zeros_like(action, dtype=int)

for dim in range(action_dims):
clipped_action = np.clip(action[:, dim], self.min_actions[dim], self.max_actions[dim])
discretized_bins = np.digitize(clipped_action, self.bins[dim]) - 1
discretized_bins = np.clip(discretized_bins, 0, self.n_bins - 1)
mapper = np.vectorize(self.action_to_token_id.get)
token_ids[:, dim] = mapper(discretized_bins)

# 恢复原始形状
if is_single:
token_ids = token_ids.squeeze(0)

return token_ids

def decode_token_ids_to_actions(self, action_token_ids: np.ndarray) -> np.ndarray:
"""
将token IDs解码为动作值
"""
action_token_ids = np.array(action_token_ids)
if len(action_token_ids.shape) == 1: # 单个动作: [action_dims]
action_token_ids = action_token_ids.reshape(1, -1)
is_single = True
else: is_single = False # 批量动作: [batch_size, action_dims]

batch_size, action_dims = action_token_ids.shape
actions = np.zeros_like(action_token_ids, dtype=float)

for dim in range(action_dims):
mapper = np.vectorize(self.token_id_to_action.get)
discretized_actions = mapper(action_token_ids[:, dim])
actions[:, dim] = self.bin_centers[dim][discretized_actions]

# 恢复原始形状
if is_single:
actions = actions.squeeze(0)

return actions

DDP

  1. 初始化:获得当前rank(0为主进程),所有的print信息只在主进程上打印
  2. 数据集需要并行处理:用DistributedSampler
    1
    2
    3
    4
    train_sampler = DistributedSampler(train_dataset, num_replicas=world_size, rank=rank)
    val_sampler = DistributedSampler(val_dataset, num_replicas=world_size, rank=rank, shuffle=False)

    sampler.set_epoch(epoch) # 保证每个epoch数据打乱的方式不同
  3. 模型需要封装:
    1
    model = DDP(model, device_ids=[local_rank], find_unused_parameters=True)
    保存模型时,需要保存model.module.state_dict()。
    1
    torch.save(model.module.state_dict(), checkpoint_path)

Action Only Loss

思路:在输出的结果中,找到那些是action_token 的输出位置,并将其设置为计算loss。

1
2
3
4
5
6
7
8
9
# 创建action token mask
action_only_labels = text_labels.clone()
action_mask = torch.zeros_like(action_only_labels, dtype=torch.bool)
action_token_ids = set(self.action_tokenizer.action_to_token_id.values())
for token_id in action_token_ids:
action_mask |= (action_only_labels == token_id)

# 将非action tokens设为IGNORE_INDEX
action_only_labels[~action_mask] = IGNORE_INDEX

发现有问题。希望的修改方式:严格按照格式,mask掉除了7个action以外的任何位置。

Evaluate

其他部分已经全部调试好,目前只剩下model输出时候的函数predict_action还没有完全实现,主要难点在如何实现kv cache上。

  • kv cache: 即只在在没有缓存(即首次调用)时,才进行图像特征的提取和拼接;在有缓存时,则跳过这些操作,因为其所需要的信息都储存在past_key_value中。
  • 动作token选择:找到概率最高的动作token,然后输出。

实验室工作
https://kingdom-of-warriors.github.io/2025/06/04/实验室工作/
作者
Rayy
发布于
2025年6月4日
许可协议