RGBDGaze: Gaze Tracking on Smartphones with RGB and Depth Data

Computer interfaces with the ability to track a user’s on-screen gaze location offer the potential for more accessible and powerful multimodal interactions, perhaps one day even supplanting the venerable cursor. While useful for desktop computing, gaze also promises to be a powerful way to interact with phones, especially given the need to adapt to a variety of usage contexts (e.g., inability to use the touchscreen with encumbered hands). Specialized gaze tracking hardware – either worn or placed in the environment – can track gaze with very high resolution i.e., 1.1 mm (0.45°) but, the need for specialized equipment is a significant barrier for consumer adoption. When relying on existing onboard hardware, research has primarily focused on user-facing RGB cameras. Unfortunately, gaze models utilizing this RGB data are too coarse for interactions with many user interface widgets, which are generally small on mobile devices. To help close this gap, researchers have started assessing the value of depth cameras to improve performance, but all research to date has focused on desktop-grade depth cameras (e.g., Microsoft Kinect V2, Intel Real Sense). These sensors are much more capable than the depth cameras seen in smartphones, which must be very thin and comparatively lower powered. Furthermore, much of this prior RGB+Depth (RGBD henceforth) gaze research had users maintain their head position in a highly constrained way (e.g., chin rest). This rigid requirement is at odds with the usual way a typical user interacts with a phone while walking, riding public transport, carrying handbags, etc. Thus, it is important to build a gaze tracker that adapts to a user’s changing context, uses existing hardware, and provides usable resolution. 

This paper presents a gaze tracker that uses an off-the-shelf phone’s front-facing RGB and depth camera. We collected data from and implemented our system in recent Apple iPhones (X and above), which feature a 1080p user-facing camera and Apple’s structured light TrueDepth camera (similar to the technology used in the Kinect V1 and earlier PrimeSense models). Our mobile RGBD dataset of 50 participants is the first of its kind, offering RGBD data paired with user gaze location across a variety of use contexts. We implemented a CNN model based on a spatial weights structure to efficiently fuse the RGB and depth modalities. Our model achieves 1.89 cm on-screen euclidean error on our dataset in a leave-one-participant-out evaluation, showing a significant improvement over existing gaze-tracking methods in mobile settings. This result reaffirms the utility of fusing RGB and depth data, and offers the first benchmark for smartphone-based RGBD gaze tracking while a user is not simply sitting.

Research Team: Riku Arakawa, Mayank Goel, Chris Harrison, Karan Ahuja


Riku Arakawa, Mayank Goel, Chris Harrison, Karan Ahuja. 2022. RGBDGaze: Gaze Tracking on Smartphones with RGB and Depth Data In Proceedings of the 2022 International Conference on Multimodal Interaction (ICMI '22). Association for Computing Machinery, New York, NY, USA.