What really is “Perception” for autonomous vehicles

Perception is the term used to describe the visual cognition process for autonomous cars. Perception software modules are responsible for acquiring raw sensor data from on vehicle sensors such as cameras, LIDAR, and RADAR, and converting this raw data into scene understanding for the autonomous vehicle.

                                           Raw pixel data fed as input to perception
                                      Scene understanding derived from perception

The human visual cognition system is remarkable. Human drivers are able to instantly tell what is around them, such as the important elements in a busy traffic scenario, the locations of relevant traffic signs and traffic lights, the likely response of other road users, alongside a plethora of other pertinent information. The human brain is able to derive all of this insight using only the visual information being acquired by our eyes in split second time. This visual cognition ability extends in a generalised way across numerous types of traffic scenarios in different cities, and even countries. As human drivers, we can easily apply our knowledge from one place to another.

However, visual cognition is incredibly challenging for machines, and the idea of building a generalisable visual cognition is currently the biggest open challenge within the fields of autonomous driving, machine learning, robotics, and computer vision. So, how does perception work for autonomous cars?

Perception technologies can be broken down into two main categories, computer-vision approaches and machine learning approaches. Computer vision techniques seek to formally address problems by using an explicit mathematical formulation to describe the problem, and usually rely on a numerical optimization to find the best solution to the mathematical formulation. Machine learning techniques on the other hand, such as convolutional neural networks (CNNs) take a data-driven approach, where instead, ground-truth data is used to ‘learn’ the best solution to a particular problem by identifying common features in the data associated with the correct response. For example, a CNN trained to identify pedestrians in camera images will extract features that are commonly present in the training data associated with the appearance of pedestrians, such as their shape, size, position, and colour. Both approaches have their merits and disadvantages, and autonomous vehicles rely on a combination of these techniques to build a rich scene understanding of their environment.

Perception is very challenging for autonomous vehicles because it is incredibly difficult to build a generalisable and robust model to describe complex traffic environments, either explicitly or through data. Autonomous vehicles can encounter strange and previously unseen obstacles, new types of traffic signs, or obstacles of a known type in a strange configuration such as a group of children wearing Halloween costumes.

                                                            Challenging obstacles

Similar challenges are present in identifying where it is safe to drive. Deriving a safe driving corridor is fairly straightforward in the presence of well-maintained lane markings on roads that an autonomous vehicle has frequently driven on. But performing the same task on a new road without lane-markings, or a different style of lane markings is a much tougher challenge. There is huge variety in road geometry and road surface types across the world, from motorways to dirt roads, and for a truly automated future, autonomous vehicles will have to be able to contend with all of these conditions.

                                                               Challenging roads

The challenge of perception is further compounded in adverse weather or at night time, where raw sensor data becomes degraded and the perception system needs to parse noisier data to make sense of what is in the environment.

                                Difficulty of perception in low light and adverse weather

Computer vision-based perception approaches usually have a fair performance and are typically generalisable across a wide set of scenarios and conditions, depending on the robustness of the underlying mathematical formulation. On the other hand, machine learning-based approaches are limited based on the data used to train the system, and whilst good performance is achieved if real-world conditions match the training data, performance degrades significantly when the real-world looks different to what the machine learning system has been taught to recognise. This then begs the question that if perception is so challenging, and computer-vision and machine learning have limitations in performance and generalisability, how are autonomous cars today able to contend with real-world driving scenarios. The answer – Mapping. Autonomous cars take the burden away from on-vehicle perception by using a prior 3D survey of roads with annotations identifying important road features. This 3D map, sometimes referred to as a high-definition (HD) map, contains detailed information about each centimetre of every road an autonomous vehicle will operate on, including the precise position of lane markings, curbs, traffic lights, traffic signs, buildings, and other environmental features. By utilising an HD map, autonomous vehicles only need to perceive dynamic elements of a scene, such as pedestrians, other vehicles, and cyclists, for which CNNs are well suited and provide good performance under most scenarios. Computer vision can then be relied upon as a redundant perception technology in case a CNN failure occurs because a strange obstacle is present or an unknown scenario develops.

However, a simple question then comes to the fore, what happens if autonomous vehicles don’t have access to HD maps, or HD maps are outdated. How can an autonomous vehicle drive in these scenarios when it has to only rely on its on-board perception?

At Propelmee, our technologies answer these questions…