如今深度学习正在逐步朝着具有空间感知能力的具身智能任务落地,以满足更多的场景需求。比如机器人操作和导航,要求代理感知3d场景,理解人类指令,并通过自我行动做出决策。在实时场景中,3D 感知模型具备下面特征:(1)实时性:输入数据是一个流RGB-D视频,而不是预先收集的视频,视觉感知应该与数据收集同步执行,需要较高的推理速度;(2) 更细粒度:应该识别场景中几乎出现的任何对象; (3) 更强泛化性能:一个模型可以应用于不同类型的场景,并与不同的传感器参数兼容。
本题是对具身视觉的3d感知模型进行工程化实践与探索,是各种下游任务的基础。
System environment:
sys.platform: linux
Python: 3.10.15 | packaged by conda-forge | [GCC 13.3.0]
CUDA available: True
MUSA available: False
numpy_random_seed: 1510685637
GPU 0: NVIDIA GeForce RTX 4090 Laptop GPU
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 11.7, V11.7.64
GCC: gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
PyTorch: 1.13.0+cu117
PyTorch compiling details: PyTorch built with:
- GCC 9.3
- C++ Version: 201402
- Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
- Intel(R) MKL-DNN v2.6.0 (Git Hash 52b5f107dd9cf10910aaa19cb47f3abf9b349815)
- OpenMP 201511 (a.k.a. OpenMP 4.5)
- LAPACK is enabled (usually provided by MKL)
- NNPACK is enabled
- CPU capability usage: AVX2
- CUDA Runtime 11.7
- NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86
- CuDNN 8.9.7 (built against CUDA 11.8)
- Built with CuDNN 8.5
- Magma 2.6.1
- Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.7, CUDNN_VERSION=8.5.0, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -fabi-version=11 -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wunused-local-typedefs -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.13.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF,
TorchVision: 0.14.0+cu117
OpenCV: 4.10.0
MMEngine: 0.10.3
采集部分核心代码如下,统一使用linux的python接口进行处理:
while True:
stFrameData=MV3D_RGBD_FRAME_DATA()
ret=camera.MV3D_RGBD_FetchFrame(pointer(stFrameData), 5000)
if ret==0:
for i in range(0, stFrameData.nImageCount):
stData = stFrameData.stImageData[i]
print("MV3D_RGBD_FetchFrame[%d]:enImageType[%d],nWidth[%d],nHeight[%d],nDataLen[%d],nFrameNum[%d],bIsRectified[%d],enStreamType[%d],enCoordinateType[%d]" % (
i, stData.enImageType, stData.nWidth, stData.nHeight, stData.nDataLen, stData.nFrameNum,
stData.bIsRectified, stData.enStreamType, stData.enCoordinateType))
if i == 0:
# 获取depth
p_depth = string_at(stData.pData, stData.nDataLen)
depth_img = np.frombuffer(p_depth, dtype=np.uint16)
depth_img = depth_img.reshape((stData.nHeight, stData.nWidth))
cv2.imwrite(os.path.join(save_dir, f'{stData.nFrameNum}_depth.jpg'), depth_img)
# cv2.imshow("depth_img", depth_img)
# cv2.waitKey(0)
# cv2.destroyAllWindows()
# 获取点云
Cloud_Img_Data = MV3D_RGBD_IMAGE_DATA()
nty = camera.MV3D_RGBD_MapDepthToPointCloud(pointer(stData), pointer(Cloud_Img_Data))
pp1 = ctypes.string_at(Cloud_Img_Data.pData,Cloud_Img_Data.nDataLen)
img1 = np.frombuffer(pp1,dtype=np.float32)
np.save(os.path.join(save_dir, f"{stData.nFrameNum}_point.npy"), img1)
print(img1.shape)
elif i == 1:
# 获取图像数据
pp = ctypes.string_at(stData.pData, stData.nDataLen)
img = np.frombuffer(pp, dtype=np.uint8)
img11 = img.reshape((stData.nHeight, stData.nWidth, 2))
bgf = cv2.cvtColor(img11, cv2.COLOR_YUV2BGR_YUYV)
# cv2.imshow("rgb_img", bgf)
cv2.waitKey(0)
cv2.destroyAllWindows()
print("bgf.shape:", bgf.shape)
cv2.imwrite(os.path.join(save_dir, f'{stData.nFrameNum}_rgb.jpg'), bgf)
关键代码如下:
首先python写一个解析函数方便读取ctypes结构体进行读取:
def extract_calib_info(calib_info):
# 提取内参矩阵
intrinsic_matrix = [calib_info.stIntrinsic.fData[i] for i in range(9)]
# 提取畸变系数
distortion_coefficients = [calib_info.stDistortion.fData[i] for i in range(12)]
return {
'intrinsic_matrix': intrinsic_matrix,
'distortion_coefficients': distortion_coefficients,
}
接下来调用解析函数保存相机参数,后面模型提取特征提取super_point关键点的时候要用到:
# 获取 depth 相机参数
depth_calib = MV3D_RGBD_CALIB_INFO()
camera.MV3D_RGBD_GetCalibInfo(1, pointer(depth_calib))
depth_calib_dict = extract_calib_info(depth_calib)
print("depth_calib_dict:")
save_to_json(depth_calib_dict, "depth_calib.json")
pprint.pprint(depth_calib_dict)
# 获取 rgb 相机参数
rgb_calib = MV3D_RGBD_CALIB_INFO()
camera.MV3D_RGBD_GetCalibInfo(2, pointer(rgb_calib))
rgb_calib_dict = extract_calib_info(rgb_calib)
save_to_json(rgb_calib_dict, "rgb_calib.json")
print("rgb_calib_dict:")
pprint.pprint(rgb_calib_dict)
深度图:
对应rgb图:
深度相机和rgb相机参数文件:
{
"intrinsic_matrix": [
669.1973266601562,
0.0,
630.7010498046875,
0.0,
669.1973266601562,
342.984375,
0.0,
0.0,
1.0
],
"distortion_coefficients": [
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0
]
}
{
"intrinsic_matrix": [
641.2483520507812,
0.0,
640.4540405273438,
0.0,
641.2787475585938,
358.39166259765625,
0.0,
0.0,
1.0
],
"distortion_coefficients": [
-0.2017442286014557,
0.03686028718948364,
1.0394737728347536e-05,
-8.135665120789781e-05,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0
]
}
点云数据:
点云数据为了方便模型处理,直接保存为 .npy 格式,参考SceneNN 数据集格式进行处理。以上深度图、RGB图像、点云数据还需进一步处理为模型能够接受的输入格式,将在后续文章内容详细介绍。