DARPA Subterranean Challenge (SubT)
Virtual Urban Circuit Simulator Anomoly
Malcolm Stagg
What Happened?
If you look at the official SubT Challenge Virtual Urban Circuit scoreboard, you will see a lot of runs with a score of zero. To some extent, this is expected, since sometimes robots will become immobilized before they can successfully report a valid artifact. However, in this case, there is more going on than meets the eye.
Certain teams, including SODIUM-24 Robotics and Robotika, were greatly impacted by these zero-score runs. A thorough analysis of the logs was conducted to see what happened. In all but three of the 24 runs, the following behavior was seen by the SODIUM-24 Robotics robots:
This behavior of robots "dancing" instead of entering the environment was obviously not intended or expected. A couple similar robot issues had, however, been encountered in the days leading up to the Urban Circuit submission deadline, during which time simulator usage was extremely high. I believed high simulator load to be the culprit, though the SubT Challenge Team maintained that this could not be the cause since the simulator is constructed from isolated networks. It should be impossible for one running simulation to influence another. I agreed that should be true in theory, but the evidence showed that this was probably not the case in practice.
I decided to act on my own and conduct a set of re-runs while simulator usage was lower, to see what the score would have been if this issue had not been encountered. No changes were made to any of the robot configurations from the submitted versions.
The score increased from 7 to 60, almost by an order of magnitude! Although it is not known whether or not rankings would have changed, it is clear that SODIUM-24 Robotics was one of the teams most heavily affected by the anomoly.
Why Did This Happen?
After this occurred, I wanted to figure out why. Since the simulator is AWS-based, I decided to build some network performance tests to try to rule out potential network causes which could occur within the AWS datacenter during periods of high load (such as temporarily increased RTT or reduced bandwidth). The network tests are available here: https://github.com/sodium24/cloudsim_net_test
None of the tests on network latency, loss, or bandwidth caused a significant impact, so it seemed unlikely a network performance problem was the cause.
This led me to further review the CloudSim codebase, where I found that Weave Net was being used with Kubernetes for handling the simulator node networking.
It's important to note that the CloudSim simulator is heavily based on Ignition Transport, which primarily uses unicast TCP for communication between nodes, but uses multicast UDP for service discovery.
While looking at potential issues with Weave Net, I found the following issue: #3272. In this issue, the reporter noticed that Weave Net NPC has an unexpected behavior that multicast packets are sent to all peers, unlike unicast packets which are subject to NPC policies. I realized this could be the cause! Maybe, during periods of high simulator load, multicast isolation was broken between running simulations, and the deluge of multicast UDP packets was causing robots to fail to run correctly.
To test this theory, I first added some additional "ign_saturation" test cases to the network performance testing, simulating a case where there's many incoming multicast service discovery packets. When saturated with enough multicast service discovery packets, the robots exhibited very similar behavior to that observed during the Urban Circuit event, indicating this could potentially be the cause.
Finally, I tried to confirm if this was actually the case: was multicast UDP isolation broken in the CloudSim simulator? To test this, I created the getalltopics_ros docker container which outputs all topics found during service discovery, and uploaded this to the CloudSim simulator. I started up two separate robot simulations, each containing a single robot. These simulations should have been totally isolated. The guids of these simulations were 195e3333-6301-4082-95bd-e26b9fe717e1 and f3b0e3ef-1891-4db1-84f4-ff093d6efd62.
Analyzing the logs from one of these simulations, I saw the following:
Topic List: - @/195e3333-6301-4082-95bd-e26b9fe717e1@/clock - @/195e3333-6301-4082-95bd-e26b9fe717e1@/logs/sdf - @/195e3333-6301-4082-95bd-e26b9fe717e1@/model/X1/battery/linear_battery/state - @/195e3333-6301-4082-95bd-e26b9fe717e1@/model/X1/cmd_vel_relay - @/195e3333-6301-4082-95bd-e26b9fe717e1@/model/X1/gas_detected - @/195e3333-6301-4082-95bd-e26b9fe717e1@/model/X1/odometry - @/195e3333-6301-4082-95bd-e26b9fe717e1@/model/X1/pose <--- from simulation 1 ... - @/f3b0e3ef-1891-4db1-84f4-ff093d6efd62@/clock - @/f3b0e3ef-1891-4db1-84f4-ff093d6efd62@/logs/sdf - @/f3b0e3ef-1891-4db1-84f4-ff093d6efd62@/model/X1/battery/linear_battery/state - @/f3b0e3ef-1891-4db1-84f4-ff093d6efd62@/model/X1/cmd_vel_relay - @/f3b0e3ef-1891-4db1-84f4-ff093d6efd62@/model/X1/gas_detected - @/f3b0e3ef-1891-4db1-84f4-ff093d6efd62@/model/X1/odometry - @/f3b0e3ef-1891-4db1-84f4-ff093d6efd62@/model/X1/pose <--- from simulation 2!!!
Sure enough, isolation between running simulations was broken for multicast! As well as the negative impacts on my robots, this had some other rather serious potential impacts:
- Information Disclosure: it is possible to see other competitors using CloudSim at the same time, and some basic information about their robot configurations.
- Bypassing Competition Rules: this could be used for robots to communicate when they are out of communications range with each other.
- Potential unauthorized access to other sensitive data.
I immediately wrote up a report to let the SubT Challenge Team know about the issue and its impacts, with a fix recommendation that multicast UDP should be disabled, blocked and replaced with unicast UDP, using Ignition Transport's IGN_RELAY flag. I also stopped using the simulator immediately until the issue was fixed, to avoid any appearance of impropriety, and did not disclose details of the issue to other teams until then.
What Was Done About It?
For the Urban Circuit, unfortunately nothing was done to re-run the affected runs to calculate updated scores and see if the rankings would have changed. While I personally believe this would have been the correct response, it is obviously not my decision to make.
Fortunately, the recommended fix of using IGN_RELAY for communication between the nodes was implemented for Cave Circuit. The simulator stability was greatly improved in preparation for the Cave Circuit, and there's no evidence that this type of anomoly occurred again. Hopefully, the issue has now been resolved.
My post summarizing the issue to other teams is available here.