How The Kinect Works

Whilst originally developed as a gaming interface for the Xbox 360, the introduction of the Microsoft Kinect has created significant interest in a number of different areas, particularly focussed within the robotics community. The Kinect serves as a simple and low cost structured light scanner, packaged into a single unit, thereby opening up the potential for a large number of 3D vision applications.

One of the problems with conventional structured light scanners, as highlighted on our Science Behind 3D Vision’ page, was the fact the whilst very useful, structured light scanners are very slow to obtain scans as they require light stripes to be projected onto the scene in different orientations in order to obtain the 3D geometry of the scene.

Although still strictly speaking a structured light scanner, the Kinect works in a very different manner.  Instead of a series of different stripes the Kinect uses a speckle pattern of dots that are projected onto a scene by means of an IR projector, and detected by an IR camera, as shown below.

 

Figure 1, Kinect System Overview (NI, 2011)

 

Hard coded into the Kinect at manufacture is a reference pattern of the speckle pattern.  Each IR dot in the speckle pattern has a unique surrounding area and therefore allows each dot to be easily identified when projected onto a scene.  The processing performed in the Kinect in order to calculate depth is essentially a stereo vision computation.  The mathematical algorithm picks a particular dot in the reference pattern and then looks for that dot in the observed scene by also looking for its eight unique surrounding pixels.  Once found in the scene its disparity can be determined and used in conjunction with the focal length of the IR camera used to detect the speckle pattern and the baseline between the projector and the camera in order to determine the depth of that given point in the scene.  This process is then simply repeated for each point in the reference pattern.

The IR speckles projected by the Kinect are of three different sizes that are optimised for use in different depth ranges, meaning that the Kinect can operate between approximately 1m and 8m (closer in the more recent Kinect for Windows iteration).  The IR source is a constant source, un-modulated emitter that is directed at a diffractive optic element (DOE), as shown below in figure 2.  The majority of the design, development and patenting associated with the Kinect has been in the area of DOE design and high quality manufacturing processes in order to obtain the required level of accuracy, something which previously hasn’t been possible.

 

Figure 2, The Kinect’s DOE and associated speckle output

 

As part of the processing the Kinect initially identifies which range area a particular object lies within and then only the speckles optimised for that particular range will be used for calculating depth in that area.  The ability to project different sized dots in one go is one of the considerable advantages of the Kinect and one of the ways in which it is able to operate over such a wide range with a single scan.

One of the clever parts of the Kinect is that in addition to pure pixel shift the Kinect also compares the observed size of a particular dot with the original size in the reference pattern.  Any change in size or shape is also factored into the depth calculations.  These calculations are all performed on the device in real time as part of a system on chip (SOC) and results in a depth image of 640×480 pixels and a frame rate of approximately 30fps.

With the Kinect, Microsoft has demonstrated that structured light scanners can be used to capture accurate 3D profiles very quickly and has therefore opened up a whole new possibility for a new range of structured light 3D scanners.  With appropriate cameras, there is no reason why Microsoft’s technology couldn’t be extended to operate at much higher frame rates in order to capture 3D profiles of fast moving objects.