The Proof of Concept

The Sensor

The system utilized a single sensor, the Stereolabs ZED stereo camera system. This system provides a color image and is capable of delivering a point cloud up to 25 meters. Given the constraints of maximum vehicle speed and the desired response time, this was sufficient for the task. By using a single sensor, the setup was kept very simple as there was no cross-calibration or synchronization required. The sensor also performed very well, returning an impressively dense point cloud at 1080p and 30fps.

Processing the Images

As there was no need to accumulate data due to the stereo camera's fairly dense point cloud, the system operated one-shot. This greatly simplifies system documentation like the operating diagram shown below. CPU compute is shown in blue and GPU compute is shown in green.

Images are pulled from the camera by USB. This is performed by the API.
The color image is passed to the GPU.
The image's color space is converted from RGB to HSL for more accurate segmentation.
The sampled color of a region is compared to the rest of the image to extract surfaces that match the sampled region.
The stereo camera system does no on board processing, rather using CUDA via OpenCV for the heavy lifting. Conveniently, the camera's API allows access to the image pointer in the GPU's memory avoiding the expensive I/O to the CPU and back. This is performed by the API.
The direction of the surface normals of a sampled region immediately in front of the vehicle is assumed to be flat with respect to the vehicle. If the rest of the surface normals are within a certain angle of the sampled area, they are considered flat as well.
The two eligible areas are then combined with a logical AND. Filtering is then performed to fill in gaps in the largest blob as well as reduce the number of vertices around the edges.
The edges are extracted from the main blob by removing proportional sections at the top and bottom that don't matter as they are so far that they will not be reached for another second or will be reached by the time the frame is computed, respectively.

This whole system operated at 30fps with 150ms latency from input frame to extracted edges. The system ran on a laptop with an Intel i7-6820HQ and Nvidia M1000M (2GB) using Ubuntu 16.04 LTS. The laptop showed no more than 25% CPU usage while simultaneously recording data to file and displaying the input and final processed images in real time. The cooling fan did not noticeably increase above idle, seemingly implying that the graphics compute load was similarly low for its capabilities.

Room for Improvement

As with any proof of concept, there's a lot to change for the prototype. Since this was pretty much just a pair of cameras and a computer, practically all the complexity is in the computation. There are two haves to the computation though, the image computation as performed by the camera's API and the edge extraction.

The camera produced a surprisingly dense point cloud. Having worked on and with stereo camera vision systems in the past and being somewhat familiar with the state of the art, its performance was impressive and intriguing. While performing some testing indoors, a hint as to how was provided when some strange behavior regarding depth readings on repetitive fabric patterns occurred. On a particular blue an white chevron pattern on a pillow from 10 feet, the camera placed the white part of the pillow at the correct distance, but placed the blue part another 5 feet further.

After some further investigation, it appears the depth measurement was performed using a neural engine. This also explained some of the results where inconsistent depth readings occurred over a fairly uniform but highly textured surface. So while the depth readings were dense, they were not always accurate. A stunning example of this is below where the sampled space contained some erroneous surface normals leading the system to believe the sides of a hill and a wooded area to be flat (the area encircled in red).

Due to the design of the system to also detect color differences, however, the system is able to largely differentiate the road surface from the non-road areas.

Unfortunately, the neural engine meant there was no explainable vision pipeline, though this requirement may change in the future. Attempting to use more traditional, explainable methods did not yield the density or range required for the required vehicle speed. From the experiments performed, it seems unlikely any passive camera system can handle the low texture of fresh snow or fresh asphalt. This seems to imply that for the foreseeable future an active ranging system will be required to get any level of range and reliability to distance measurements. With that in mind, work began on the next iteration.

Dependencies

Ubuntu 16.04 LTS
OpenCV 3.4
CUDA 9.1
ZED 2.3

Inquiries