Science Behind 3D Vision

Three-dimensional vision systems allow us to measure depth by using one or more cameras or sensors. Vision systems operating in the visible and IR spectrums can be categorised into the four groups shown below.

Stereo Vision Cameras

Humans and most animals see is in through a highly sophisticated 3D vision system. The two eyes (cameras) are separated by a distance and therefore observe the surrounding environment from slightly different perspectives. Due to these different views the brain is able to accurately produce a 3D representation of the environment.

This is 3D stereo vision in its simplest form and is the basis for many stereo 3D camera systems. Such a system takes the form of the system shown below.

Figure 1, Schematic view of a stereo 3D camera system (NI, 2011)

The principle behind stereo vision is that points in a scene will be experienced differently by two cameras due to their relative position. A point in the left cameras image will be shifted by a given number of pixels in the right cameras image. By subtracting the differences you can then produce a disparity map highlighting the difference in views between the two cameras. Figure 2 shows an example of a typical disparity map produced from two different views of a scene.


Figure 2, Left and right camera image and resulting disparity map (Scharstein, D and Pal, C 2007)


Given this disparity map it is easy to determine the depth value of a real world point, given internal parameters of each camera (the ‘properties’ of the camera such as lens focal length) and the relative position and orientation of each camera (it’s ‘external’ parameters).


depth = f (b / d)

 Where f is the cameras focal length, b is the baseline between the cameras and d is the disparity of the given point in pixels


By applying this formula to each point pair as observed by both cameras it is easy to see that a 3D model of the entire scene can be obtained providing the relative position of each point is known in each camera (not always a simple task). This process only involves the capture of a single image with each camera and is therefore ideally suited to analysing the 3D motion of fast moving objects.

Laser Scanning

One of the most common methods of 3D scanning is through the use of a laser projector, camera, and the principal of stereo triangulation as highlighted above.

Figure 3, System diagram for a laser scanner system (NI, 2011)


The laser projector emits a stripe of laser light which is projected onto the scene to be scanned. The objects which are in the scene deform the shape of the laser stripe and as a result change the way the stripe appears to the camera, as can be seen above in figure 3. Objects that are higher will offset the stripe from the base line by an amount equal to their height, as can be seen above. This offset from the baseline can be classed as a disparity from the normal and in the same way as above can be used to calculate the depth, i.e. the height of a given object. By scanning the laser stripe across the object it is easy to obtain a full profile of the object being scanned.

Due to the nature of the laser light, such systems are able to produce very high accuracy 3D models.

They do however have a number of problems and hence limitations. Due to the single stripe pattern that is emitted by the laser it can only profile an area of the scene that corresponds to the width of the emitted stripe. In order to scan a scene like the one shown above in figure 3 the laser/camera pair or the object to be scanned must be moved, taking as long as 10 minutes depending upon the object. This isn’t a problem when scanning static objects, but a problem arises when trying to scan objects moving in a non-uniform manner.

With this in mind they are often used in production lines for measuring the characteristics of produced components to a very high degree of accuracy. They are ideally suited to this environment as the object itself will be moving and the laser/camera pair can remain stationary.

Time of Flight (TOF) Cameras

TOF cameras are typically based on a light source and a camera and work by measuring the change in the light when it bounces back from objects that are in the scene. TOF camera systems work using two different principals: either pulsed light or RF modulated light.

TOF cameras based on the principal of pulsed light sources measure the time that it takes for a light pulse to travel from the emitter to the scene and then back after reflection. As the speed of the light source is known, by using basic mathematics the distance to all of the points on the surface of the object can then be determined. This principal is shown below in figure 4.


Figure 4, System diagram for a pulsed light source TOF camera (NI, 2011)


Each pixel in the camera has its own timer and therefore the depth of each pixel in the cameras view can be determined. Providing that a light source capable of illuminating the entire scene is used then it is possible to determine the depth of all points in the scene with one image capture.

Although based on a similar principal, RF modulated TOF camera systems work in a very different manner. Such camera systems work on the principal of phase shift. Such a system takes the form as shown below in figure 5.

Figure 5, System diagram for an RF modulated light source TOF camera (NI, 2011)


The system consists of a continuous light source that is frequency or amplitude modulated (FM or AM) in correspondence with an RF carrier wave, resulting in a light source of sinusoidal form with a known frequency or amplitude. When the modulated light comes into contact with an object in the field of view then the wave will be phase shifted as a result of the shape of the objects surface. The wave will then be reflected back from the object and a detector used to determine the phase shift. The phase shift in the reflected light will be proportional to the distance between the source and the object. Through simple mathematics a depth for each point in the scene can then be easily determined. In the same manner as the pulsed light derivative considered above, the depth of the entire scene can be determined with a single image capture.

TOF cameras have the significant advantage over laser scanners and calibrated stereo cameras in that they are very quick and easy to setup. Unlike laser scanners, TOF cameras are able to deliver instantaneous dynamic scanning and hence a reasonably high frame rate. Dependent upon the light source and camera used, they are also able to deliver a very wide field of view and hence scanning range. Unfortunately, TOF cameras do have a significant trade off, this being their accuracy. Due to noise, distortion and propagation errors associated with the transmitted light they are unfortunately unable to return the same level of accuracy as would be observed with stereo vision cameras for example. For this reason, applications which use TOF cameras don’t typically demand a high level of accuracy. Such applications are therefore; simple object detection, environment mapping for robot navigation, basic body tracking and gesture recognition.

Structured Light

Structured light 3D scanning systems work by projecting a known light pattern onto a 3D scene and subsequently imaging it using an appropriate camera. The light pattern will be distorted by the objects in the scene  and therefore the pattern observed by the camera will be different to the one originally projected. By determining the shift of each pixel in the pattern the depth of any given point in the cameras field of view can be determined.

The principal behind structured light cameras is therefore very similar in principal to stereo vision cameras as highlighted above. The same principal of disparity is used but with the difference that only one camera is required and that the shift in the reference pattern is used to calculate the disparity map instead of the actual images of the scene. A diagrammatic overview of such a system is shown below in figure 6.


Figure 6, System diagram for a structured light 3D scanning system


An un-modulated constant light source will typically be directed at a diffraction grating, thereby converting the light into a pattern consisting of vertical stripes. The vertical stripes are then projected onto the scene and will subsequently come into contact with objects in the scene. The stripes will therefore be distorted by the objects surface, meaning that the stripes imaged by the camera are shifted from what would be expected. It is this shift, or disparity, that is used as the basis upon which to calculate the 3D characteristics of the scene.

Due to the nature of the projected stripes they are very small in size and therefore such scanners suffer from the same problem as laser scanners, namely that they are only able to calculate depth over a limited area in one scan. In order to scan an entire object, structured light scanners will typically project a range of different light patterns in different orientations in order to successfully determine the depth at each point in the scene. For this reason, scanning with structured light systems is very slow and is therefore only really suited to scanning static objects.

Typical applications for structured light scanners include; gesture recognition, robot navigation, presence detection and scanning of static objects without a great level of detail.

This post gives a focused discussion on the 3D vision systems behind the Microsoft Kinect.

Sean Clarkson