Seeking Guidance for Autonomous Tracking Project (Simulation + camera grabbing issues)

Hi BlueRobotics Team and Community,

I’m a new member (just started a few days ago!) and I’m diving into making my BlueROV2 autonomous. My goal is to implement real-time biological tracking (e.g., following a diver) using YOLO for video processing on a surface-side NVIDIA machine.

So far, I’ve successfully assembled the ROV, operated it via QGroundControl/Joystick, and managed to Arm/Disarm and control motors using Pymavlink. However, I’m hitting a wall regarding the video stream access in Python and setting up a reliable simulation environment.

I’ve realized that Windows/WSL isn’t ideal for Gazebo, so I’ve moved my workspace to Linux (Ubuntu 22.04) on my NVIDIA machine. I want to use SITL and Gazebo to test my tracking algorithms before going into the water, but I’m struggling to find an up-to-date tutorial that works for ArduSub in 2026.

I have a few specific questions to avoid me “reinventing the wheel”:

  1. SITL/Gazebo: Which version of Gazebo (Classic vs. Garden/Ionic) is currently recommended for the most stable BlueROV2 simulation on Ubuntu 22.04?

  2. ROS vs. Pymavlink: For a complex task like YOLO-based tracking with multiple cameras, would you recommend switching to ROS 2 (Humble) via MAVROS, or is staying with Pymavlink viable for high-level autonomy?

  3. Video Stream: What is the most efficient way to “grab” the BlueROV2 H.264 stream into an OpenCV/Python pipeline for AI processing with minimal latency?

  4. Existing Projects: Are there any open-source “autonomous follower” projects for ArduSub that I should look into for inspiration?

Edit : To provide more context: our goal is to track species (but we are beginning with a human) in open, deep water (low-clutter environment). The mission starts with the target in close proximity to the ROV, from which point the autonomous tracking takes over. Our current hardware integration on the BlueROV2 includes:

  • Stereo Camera: Used for real-time depth estimation and calculating the relative distance to the target.

  • Fluxmeter: Used to deduce water current speed, allowing us to estimate the ROV’s absolute velocity for more precise station-keeping and pursuit.

Thank you for you help !

Hi @Nath -

I’ll try and answer what I can!

  1. I’ve not done much with Gazebo simulation, but SITL is very easy with BlueOS!
  2. I’ve never used ROS, and it’s not particularly popular with the Blue Robotics software team - if your topside computer is acting as a pilot based on its processing of video stream I’d think that pymavlink is all you need, as you’ll just be sending simulated joystick commands?
  3. Changing your video stream from the default UDP to RTSP should allow you to have multiple client access the stream - both Cockpit (beta 1.1.8 for RTSP support) and your code. OpenCV consumes RTSP pretty easily!
  4. I’m not aware of any (open source) follower projects.

What stereo camera are you using?

Hi @tony-white ,

Thank you for your fast answer :slight_smile:

Regarding Gazebo, I managed to get it running with ArduSub SITL on Ubuntu 22.04 (Jetson Orin Nano). It runs a bit slow but the key advantage is that I can test PyMAVLink scripts directly against the simulated ROV — which I couldn’t achieve with BlueOS SITL (I couldn’t find a way to send external PyMAVLink commands to it, possibly a port/endpoint configuration issue?).

My next Gazebo question (for the community): is it possible to place a virtual object next to the BlueROV2 in the Gazebo world and access a simulated camera feed from Python/OpenCV? The goal would be to close the loop — run a detection code on the simulated feed and send tracking commands back to the SITL, all without touching real hardware.

Regarding visual servoing — yes, exactly. The idea is to compute an error signal directly from the image (target position relative to image center, bounding box size for distance) and map it to ROV commands (yaw, heave, surge). No GPS needed, which is critical for underwater use. The stereo camera would give us the right distance to reach the target. What do you think of this idea ?

One additional detail on our stereo setup: we built it using two BlueRobotics Low-Light cameras mounted at a fixed baseline. We are computing a live disparity map to estimate the metric distance to any detected target — this gives us the Z axis for 3D tracking without relying on the depth sensor alone.

Hi @Nath -

BlueOS should have been the easier solution to running SITL - did you switch the board type under Autopilot Firmware to SITL and select the firmware you’d like to use? That’s literally all it takes!

If you weren’t able to send PyMAVLink to the SITL instance, then you may struggle to send it to the standard Autopilot when you’re ready for real-world testing. If you share your code, we can provide feedback on what may be going wrong with how you’re targeting the messages!

It should be possible to place objects in your Gazebo simulation, and even have them move around…

Your approach for visual tracking seems reasonable! Have you calibrated your DIY stereocamera with a checkerboard in OpenCV? How are you ensuring images from the two USB cameras are captured at the exact same instant? This is the usual shortfall of such approaches - any time disparity at capture between the two images, with motion in the camera’s view can result in large inaccuracies… I’ve used simple cameras like this that return both images in a single wider image format (to be cut in half) to ensure sync is maintained… however the challenge with these is their uncompressed MJPEG output, which not only is not supported by QGC/Cockpit to stream, but would use a ton of bandwidth to stream to your surface GPU… modules like that with h264 output do exist!

Additionally, I’m curious if your cameras are looking through a flat clear surface, or the curved wall of a clear enclosure tube…