Seeking Guidance for Autonomous Tracking Project (Simulation + camera grabbing issues)

Hi BlueRobotics Team and Community,

I’m a new member (just started a few days ago!) and I’m diving into making my BlueROV2 autonomous. My goal is to implement real-time biological tracking (e.g., following a diver) using YOLO for video processing on a surface-side NVIDIA machine.

So far, I’ve successfully assembled the ROV, operated it via QGroundControl/Joystick, and managed to Arm/Disarm and control motors using Pymavlink. However, I’m hitting a wall regarding the video stream access in Python and setting up a reliable simulation environment.

I’ve realized that Windows/WSL isn’t ideal for Gazebo, so I’ve moved my workspace to Linux (Ubuntu 22.04) on my NVIDIA machine. I want to use SITL and Gazebo to test my tracking algorithms before going into the water, but I’m struggling to find an up-to-date tutorial that works for ArduSub in 2026.

I have a few specific questions to avoid me “reinventing the wheel”:

  1. SITL/Gazebo: Which version of Gazebo (Classic vs. Garden/Ionic) is currently recommended for the most stable BlueROV2 simulation on Ubuntu 22.04?

  2. ROS vs. Pymavlink: For a complex task like YOLO-based tracking with multiple cameras, would you recommend switching to ROS 2 (Humble) via MAVROS, or is staying with Pymavlink viable for high-level autonomy?

  3. Video Stream: What is the most efficient way to “grab” the BlueROV2 H.264 stream into an OpenCV/Python pipeline for AI processing with minimal latency?

  4. Existing Projects: Are there any open-source “autonomous follower” projects for ArduSub that I should look into for inspiration?

Edit : To provide more context: our goal is to track species (but we are beginning with a human) in open, deep water (low-clutter environment). The mission starts with the target in close proximity to the ROV, from which point the autonomous tracking takes over. Our current hardware integration on the BlueROV2 includes:

  • Stereo Camera: Used for real-time depth estimation and calculating the relative distance to the target.

  • Fluxmeter: Used to deduce water current speed, allowing us to estimate the ROV’s absolute velocity for more precise station-keeping and pursuit.

Thank you for you help !

Hi @Nath -

I’ll try and answer what I can!

  1. I’ve not done much with Gazebo simulation, but SITL is very easy with BlueOS!
  2. I’ve never used ROS, and it’s not particularly popular with the Blue Robotics software team - if your topside computer is acting as a pilot based on its processing of video stream I’d think that pymavlink is all you need, as you’ll just be sending simulated joystick commands?
  3. Changing your video stream from the default UDP to RTSP should allow you to have multiple client access the stream - both Cockpit (beta 1.1.8 for RTSP support) and your code. OpenCV consumes RTSP pretty easily!
  4. I’m not aware of any (open source) follower projects.

What stereo camera are you using?

Hi @tony-white ,

Thank you for your fast answer :slight_smile:

Regarding Gazebo, I managed to get it running with ArduSub SITL on Ubuntu 22.04 (Jetson Orin Nano). It runs a bit slow but the key advantage is that I can test PyMAVLink scripts directly against the simulated ROV — which I couldn’t achieve with BlueOS SITL (I couldn’t find a way to send external PyMAVLink commands to it, possibly a port/endpoint configuration issue?).

My next Gazebo question (for the community): is it possible to place a virtual object next to the BlueROV2 in the Gazebo world and access a simulated camera feed from Python/OpenCV? The goal would be to close the loop — run a detection code on the simulated feed and send tracking commands back to the SITL, all without touching real hardware.

Regarding visual servoing — yes, exactly. The idea is to compute an error signal directly from the image (target position relative to image center, bounding box size for distance) and map it to ROV commands (yaw, heave, surge). No GPS needed, which is critical for underwater use. The stereo camera would give us the right distance to reach the target. What do you think of this idea ?

One additional detail on our stereo setup: we built it using two BlueRobotics Low-Light cameras mounted at a fixed baseline. We are computing a live disparity map to estimate the metric distance to any detected target — this gives us the Z axis for 3D tracking without relying on the depth sensor alone.

Hi @Nath -

BlueOS should have been the easier solution to running SITL - did you switch the board type under Autopilot Firmware to SITL and select the firmware you’d like to use? That’s literally all it takes!

If you weren’t able to send PyMAVLink to the SITL instance, then you may struggle to send it to the standard Autopilot when you’re ready for real-world testing. If you share your code, we can provide feedback on what may be going wrong with how you’re targeting the messages!

It should be possible to place objects in your Gazebo simulation, and even have them move around…

Your approach for visual tracking seems reasonable! Have you calibrated your DIY stereocamera with a checkerboard in OpenCV? How are you ensuring images from the two USB cameras are captured at the exact same instant? This is the usual shortfall of such approaches - any time disparity at capture between the two images, with motion in the camera’s view can result in large inaccuracies… I’ve used simple cameras like this that return both images in a single wider image format (to be cut in half) to ensure sync is maintained… however the challenge with these is their uncompressed MJPEG output, which not only is not supported by QGC/Cockpit to stream, but would use a ton of bandwidth to stream to your surface GPU… modules like that with h264 output do exist!

Additionally, I’m curious if your cameras are looking through a flat clear surface, or the curved wall of a clear enclosure tube…

Hi @tony-white,

Thank you for your feedback! Our project has made significant progress.

We are now implementing Visual Servoing (IBVS) using YOLOv8 for real-time detection via the BlueROV2 camera. To handle the depth estimation, we are using a DIY stereo camera (2bluerobotics usb camera with h264) setup to calculate the precise distance to the target. Regarding calibration, we are using a flat clear surface and OpenCV rectification to handle the fisheye distortion.We have indeed used a checkboard for calibration. We haven’t tested it underwater yet, and we are aware of the added complexity regarding refractive indices, but the results in air are promising. We have decided to move away from SITL/Gazebo for now to focus on hardware integration.

However, we are facing a critical issue with video latency and framerate over RTSP.

Technical Setup & Latency Issues: I am using cv2.VideoCapture(url, cv2.CAP_FFMPEG) with the following RTSP stream from BlueOS: rtsp://192.168.2.2:8554/video_stream__dev_video2.

  • Double Streaming: To maintain manual control, I’ve created two separate streams in BlueOS: one RTSP for my PyMAVLink/OpenCV script and one UDP for QGroundControl. I noticed that without this separation, manual control lag was unbearable for controlling the ROV with only rtsp and joystick with qgroundcontrol. Could this dual-stream setup be saturating the bandwidth and causing our low framerate (~10 FPS)?

  • The Latency Gap: We are experiencing a delay of up to 1 second. It’s puzzling because the QGC stream is perfectly reactive(with udp i guess), while the PyMAVLink/OpenCV/rtsp fetch is not.

  • Threading Implementation: To mitigate processing lag, I have implemented a dedicated threading class to capture frames. This ensures that the cap.read() loop runs independently in the background, allowing the main script to always grab the “freshest” available frame without being blocked by the YOLO inference time. Even with this, the network-induced latency remains.

Future Considerations: I am concerned that this lag will worsen with the stereo camera integration. I considered moving the processing directly to the Raspberry Pi (Onboard), but I doubt the Pi’s CPU/GPU can handle YOLOv8 inference at an acceptable speed. We are using a standard Fathom ROV Tether, so bandwidth shouldn’t be the bottleneck.

Is there a specific configuration in BlueOS or a different capture backend in OpenCV that you would recommend to achieve a “real-time” responsiveness similar to what QGC provides?

Best regards, Nath

Hi @Nath -

The MJPG duplicate stream is going to have HUGE latency because it is uncompressed, in an unsupported format for QGC / Cockpit- I would recommend against using video0 for your application.

In general, syncing the two USB low-light cameras is likely to prove difficult - to get good stereo depth, the images from each camera have to be taken at the same instant (unless your scene is static), which is very difficult to ensure with any non-real time operating system. The latency you’re seeing with the MJPG stream is expected, and in general QGC does not support that format?

You’re best bet is to use a camera like previously linked, or a more advanced model that can handle processing or sync onboard, like the MarineSitu C3 camera. Note that a calibration in air will not be sufficient for operation in water - the changeover will require a fresh calibration. The Pi CPU definitely can’t handle much machine vision processing onboard!

The Fathom-X tether interface provides less bandwidth than standard 100mbit Ethernet cable, and the Fathom tether is not rated as Cat5/Cat6 ethernet cable - this combined with the excessive bandwidth usage of mjpg is definitely limiting you.

In short, I don’t think you’re going to achieve reliable stereo depth imaging with those cameras, as they can’t be sync’d, so even if you use udp or rtsp stream for both the GCS (QGC or Cockpit) and your YOLO inference you’re not going to get accurate depth results…

Hi @tony-white ,

Thank you again for your answer. Okay i didn’t know about the MJPEG latency. Still it does not explain why i do get latency through the rtsp camera link (/dev/video2) when i want to see the stream with openCV (coul reach 1 second of latency sometimes!).

Other thing : i do not get why there is a worse camera connexion when the BlueROV2 is underwater (it is flagrant in term of lag and latency and quality) although it is the same tether connexion underwater and above water…

Thank you for your confirmation about the RasPi ! And for the synchronisation it is a real concern that you are raising we will see what we can do now for the stereo setup.

For the QGroundControl connexion, with h264 udp it is good BUT last day we have tried with my team to make the ROV beyond 60 meters (as we have a 200m tether slim), and we were suprised to see that there were a very bad visualisation of the camera and bad control beyond 60m. One example screenshot at 65 meters (and it was obviously worse under 65 meters):

At this depth, there were clearly a latency of the commands, sometimes not responsive at all

And a comparison with the same day 20 meters :

We are quite puzzled about this…because we have selected the best encoding (h264) and endpoint (in udp).. Maybe it means that we should not go with this father ROV tether (slim 200 meters) beyond a limit although it seems not to be written in the description.. ?

Thank you for your enlightements,

Nath

Hi @Nath -

That is quite unusual. I’ve recently operated my ROV at 175m with the same tether- performance should not change with depth at all! I’d suspect that water is entering the outer jacket and changing performance, or you have a bag connection. Install the tether diagnostic extension from the extension manager and share a screenshot of the result - potentially with the tether spooled up and unspooled similar distance. You may want to also take a screenshot of it when you next dive to that depth again as well!

As for your latency, is that present with both rtsp and udp in your opencv connection? Using both stream types for the same device, video2, isn’t possible… It’s hard to say why you have laugh without reviewing your code- I’d guess whatever processing your doing is adding the latency. In general Cockpit and QGC use pipelines optimized to minimize latency.

Hi @tony-white ,

Here is our first screen with the tether spooled :

We will try in water too. When the ROV is 200 meters away BUT at less than 30 meters the connexion is good though, so maybe it is the pressure of the water that could infiltrate the tether and perturbate the connexion?

As for now, we can only have the opencv stream with RTSP as you adviced me in your first message. I have noticed that the connexion is better when i have only my opencv running without qgroundcontrol. But there is still a latency, as you mentionned that could be only because of my code management of the stream that is not perfect… Do you know a python pipeline that grants a good - without latency/lag- stream with openCV of the bluerov2 camera ?

No @Nath -

That screenshot looks ok.

Try updating to the latest BlueOS 1.5 beta version and see if that makes a difference? It will only show up in pirate mode, under BlueOS version menu. There have been recent improvements to the mavlink camera manager.

Otherwise, the latency is due to your own opencv code and speed of your topside computer in running it?

Okay thank you!

Yeah finally we have figured out that when there is a lack of luminosity, those stripes appear.

Okay I will try the beta version last 1.5 :slight_smile:

Otherwise we have tried the heavy kit and noticed that without the round Heavy buoyancy blocks, the ROV have still a good neutral flottability… is it normal/necessary to have them ?

Buoyancy tuning (with foam and ballast weights) is dependent on your payload and the density of the water you’re operating in. If you don’t need them that’s fine, though you likely want to keep the covers on just for reduced drag.

Since users with the heavy kit often have some kind of payload / extra integration to carry, we include extra buoyancy foam in the kit to help compensate for that :slight_smile:

hi @tony-white , @EliotBR ,

Thank you for your answers.

As a result of your advices, we tried the b et version but we noticed that the beta version was worse for getting the mavlink streams video. In fact we are begining to ask ourselves questions regarding the doability of getting 3 streams (the camera h264 and 2 cameras usb hd bluerobotics) with rstp to exploit them in python. We have a system that works maybe 10 times or sometimes less and then we get an error “fatal python error : segmentation fault” and/or we are struggling sometimes to get the rtsp link with the ROV. and we are forced to reboot the ROV through the blueos interface. Although we are trying to close all the streams properly and reduce the buffer at its minimum we get all the time those errors (maybe not possible to have more than 1/2 streams video with a surface computer linked to the bluerov2? or at least impossible with a tether that is not optical ?).

By any chance do you have a python code for a surface computer that could control three streams cleanly? Maybe do i have to use ROS ? what do you think ?

N.B : Here is the code that we are using it is a class that get the stream

import cv2
import threading
import time

class GolemStream:
    def __init__(self, url):
        self.url  = url
        self.name = name

        # Initialize VideoCapture with FFMPEG backend and timeout settings
        self.cap  = cv2.VideoCapture(url, cv2.CAP_FFMPEG, [
            cv2.CAP_PROP_OPEN_TIMEOUT_MSEC, 5000,  # max time to open stream
            cv2.CAP_PROP_READ_TIMEOUT_MSEC,  2000, # max time to read a frame
        ])

        # Check if stream opened correctly
        if not self.cap.isOpened():
            raise RuntimeError(f"Cannot open stream: {url}")

        # Reduce internal buffer size to minimize latency
        self.cap.set(cv2.CAP_PROP_BUFFERSIZE, 1)

        # Shared data
        self.ret           = False
        self.frame         = None
        self.stopped       = False
        self.frame_counter = 0
        self.t             = None

        # SINGLE lock for both:
        # - frame access
        # - VideoCapture access
        # Ensures that cap.read() and cap.release() never run simultaneously
        self.lock = threading.Lock()

    def start(self):
        # Start the update thread
        self.t = threading.Thread(target=self.update, daemon=True)
        self.t.start()
        return self

    def update(self):
        consecutive_failures = 0

        # Main loop: runs until stopped flag is set
        while not self.stopped:

            # Acquire lock BEFORE accessing VideoCapture
            # stop() also uses the same lock before releasing cap
            # → ensures no overlap between read() and release()
            with self.lock:
                if self.cap is None or not self.cap.isOpened():
                    break
                ret, frame = self.cap.read()

            # Lock released → process result without blocking stop()

            if ret:
                consecutive_failures = 0

                # Update shared frame data under lock
                with self.lock:
                    self.ret           = ret
                    self.frame         = frame
                    self.frame_counter += 1

            else:
                consecutive_failures += 1

                # Print warning every 10 failures (but not every frame)
                if consecutive_failures % 10 == 1:
                    print(f"⚠️  [{self.name}] {consecutive_failures} frames invalides consécutives")

                # Small sleep to avoid busy loop
                time.sleep(0.01)

        print(f"[{self.name}] Thread update() ended properly.")

    def get_latest_frame(self):
        # Thread-safe access to latest frame
        with self.lock:
            return self.ret, self.frame

    def get_frame_identity(self):
        # Returns frame counter (useful to detect new frames)
        with self.lock:
            return self.frame_counter

    def stop(self):
        print(f"⏳ [{self.name}] Stream closing...")

        # Signal thread to stop
        self.stopped = True

        # Acquire lock before releasing VideoCapture
        # If update() is currently in cap.read(), we wait here
        # Then release cap safely
        with self.lock:
            if self.cap is not None:
                self.cap.release()
                self.cap = None

        # After cap is None, update() will exit at next iteration
        # Wait for thread to finish (max 2 seconds)
        if self.t is not None and self.t.is_alive():
            self.t.join(timeout=2.0)

            if self.t.is_alive():
                print(f"⚠️  [{self.name}] Thread still alive")
            else:
                print(f"✅ [{self.name}] Stream closed properly.")
        else:
            print(f"✅ [{self.name}] Stream closed (thread already stoped).")

            print(f"✅ [{self.name}] Stream closed (thread already stoped).")


USAGE : stream_m = GolemStream(video_url_middle).start()

Hi @Nath -

Where are you running your code? The Pi itself doesn’t have a lot of computing horsepower, and dealing with uncompressed MJPEG video can max it out very quickly. A quick check with a LLM seems to confirm my suspicion:

cap.read() does network I/O plus full software decode (almost certainly H.264 via FFmpeg). On a Pi 4, a single 1080p H.264 stream is roughly 30–60% of one core; 720p is closer to 15–30%. The Pi 4's hardware H.264 decoder isn't used by stock OpenCV/FFmpeg builds unless you've specifically built FFmpeg against h264_v4l2m2m, so assume software decode. The Pi 5 removed the hardware decoder entirely, so it's CPU-only there. A Pi Zero 2W will struggle with even one 1080p stream.

The variable name stream_m (middle) hints that you may run several of these in parallel — that's where it gets dangerous. Three concurrent 1080p decodes will likely saturate a Pi 4. If you have control over the source, dropping to 720p or lowering the bitrate buys you a lot of headroom.

You may also be encountering hardware power limits on the 5V bus - I setup 4x exploreHD cameras and found that connecting them directly to the USB ports does not work (beyond 2 connections.) The next two cameras I had to power from a seperate 5V source (that shares ground with the 5V supply used for the Pi.) This was easy - simply removing the 5V wire from the USB connector of those extra cameras and connecting it to another UBEC. In a red-pill terminal, type dmesg to check the kernel logs - if lots of usb disconnects are present, you’re having power supply issues most likely!

Hi @tony-white,

We are not using RaspPi but a surface computer (macbook air puce m4, really efficient in computations). Morevever we do not use MPEG but only H264 as you adviced us. We have the center camera and 2 usb cameras that are on the usb ports and you advice us to take back one usb power wire and connect it elsewhere (another source of power).

How did you manage in term of code to get 4 streams ? Do you have a github code or something that we can try out ? We have the feeling that it is due to our code too…

We have searched for other solutions (for having fluid and good 3 streams) and maybe one could be Media MTX, do you know it ?

Thank you,

Hi @Nath -

MediaMTX is great for streaming around a network, but I’m not sure it’s applicable to your use case.

For this extension, I manually set all 4 streams to RTSP, run the latest BlueOS 1.5 beta 36 (which has lots of updates to Mavlink Camera Manager) and disabled mavlink, lazy, and foxglove for each camera on the video streams page. The extension then streams all 4 h264 streams to a browser (told the LLM to copy Cockpit lite’s approach with webrtc), while simultaneously writing them to disk. It did work with a good SD card on a Pi5, but was not too far away from saturating the write-speed. With a NVME drive installed it works great - hardware will be deployed in the field within the next couple of weeks, and this guide will likely be updated, or have a link added to a dedicated write-up.

That must be quite a large enclosure to fit an M4 air! Is it cylindrical?

While very efficient, it seems like vast overkill - you can fit a Pi5 in a 3" enclosure fairly easily! While the Pi5 might be able to draw a few amps at 5V, I’d think the Mac could draw significantly more current - more efficient, but also much much more powerful after all!

Hi @tony-white ,

Thanks for your quick answer and the github link. I wonder why you disabled mavlink etc ? Where have you found the Cockpit lite’s approach with webrtc by curiosity ?

I think there has been a little quiproquo regarding what we are using/doing : there is a raspPi inside our bluerov2 that has a aluminium enclosure (normal size). We are willing to control the ROV with the visualisation and analyses of its 3 cameras streams. For that we are using a Computer at the surface connected to the ROV with the tether (Fathom-X Topside Interface). The computer is not inside the ROV. In fact all the data are transiting through the tether, than we have the computer at the surface to analyze them. Our issues is that we are struggling to have a stable connexion regarding those streams. We don’t have the Pi5 or other Pi than the Pi frome the Bluerov2. Up to know we were trying to make this system working for the surface computer (Mac). Do you think that it is not possible (we have a slim tether not optical) ?

Hi @Nath -

Since I was running BlueOS on a Pi5 with no navigator, I don’t have a running autopilot process. Disabling mavlink lets mavlink camera manager still handle rtsp stream creation without an autopilot to talk mavlink with about it ( I think?)

Cockpit lite is a part of Cockpit, which is open source! I simply asked the LLM (opus 4.7) to find the and use the same relevant approach from the Cockpit repo.

Sorry for my confusion on your setup - the laptop topside makes much more sense. I don’t think the Pi4 is limiting you, but the bandwidth possible over your Fathom-X likely is. How long is your slim tether? What does the network test give you for speed results - I see in the thread history you were having issues in the water vs. on the bench?

The exploreHD can have its bitrate lowered from the default 10, which I may need to do. Your cameras could be producing more, or your max bandwidth could be less - I’ve not yet wired in the full length of fresh tether for testing but typically 40-80 Mbit is possible.