Crowd-AI Camera Sensing in the Real World

Cameras are becoming pervasive in civic and commercial settings, and are moving into homes with devices like the Nest Cam and Amazon Echo Look. Owing to their high resolution and wide field of view, cameras are the ideal sensors to enable robust, wide-area detection of states and events without having to directly instrument objects and people. Despite this unique advantage, camera streams are rarely actionalized into sensor data, and instead are merely used to view a remote area.

This trend is slowly changing with consumer home cameras offering rudimentary computationally-enhanced functions, such as motion and intruder detection. Perhaps most sophisticated among these consumer offerings is the Amazon Echo Look, which can offer fashion advice. In commercial and municipal camera systems, computer vision has been applied toe.g., count cars and people, read license plates, control quality, analyze sports, recognize faces and monitor road surfaces. In general, these computer-vision-powered systems require extensive training data and on-site tuning to work well. For example, FaceNet achieved human-level face detection accuracy, but required a team of researchers to collect and prepare over 100 million images for training. This is obviously impractical for the long-tailed distribution of scenarios and the many bespoke questions users may wish to ask about their environments.

To flexibly adapt to new questions, researchers have created hybrid crowd- and artificial intelligence (AI)-powered computer vision systems. Rather than requiring an existing corpus of labeled training data, these systems build one on-the-fly, using crowd workers to label data until classifiers can take over. This hybrid approach is highly versatile, able to support a wide range of end user questions, and can start providing real-time answers within seconds. However, prior work falls short of real-world deployment, leaving significant questions about the feasibility of such crowd-AI approaches, both in terms of robustness and cost. Moreover, it is unclear how users feel about such systems in practice, what questions they would formulate, as well as what errors and challenges emerge.

We iteratively built Zensors++, a full-stack crowd-AI camera-based sensing system with the requisite scalability and robustness to serve real-time answers to participants, in uncontrolled settings, over many months of continuous operation. With an early prototype of the system, we performed a discovery deployment with 13 users over 10 weeks to identify scalability problems and pinpoint design issues. Learning from successes and failures, we developed an improved system architecture and feature set, moving significantly beyond prior systems (including Zensors, VizWiz, and VizLens). More specifically, Zensors++ makes the following technical advances: (i) multiple queues to support crowd voting and dynamic worker recruitment, (ii) a dynamic task pool that estimates the capacity of labelers for minimizing end-to-end latency, and (iii) a hybrid labeling workflow that uses crowd labels, perceptual hashing, and continuously-evolving machine learning models.

With our final system, we conducted a second deployment with 17 participants, who created 63 question sensors of interest to them. This study ran for four weeks, resulting in 937,228 labeled sensor question instances (i.e., answers). We investigated the types and sources of errors from e.g., crowd labeling, user-defined questions, and machine learning classifiers. These errors were often interconnected, e.g., when users created questions that were difficult for crowd workers to answer, workers were more likely to answer incorrectly, which in turn provided poor training data for machine learning, ultimately leading to incorrect automated answers. Overall, this experience illuminated new challenges and opportunities in crowd-AI camera-based sensing. We synthesize our findings, which we hope will inform future work in this area.

Research Team: Anhong Guo, Anuraag Jain, Shomiron Ghose, Geirad Laput, Chris Harrison, and Jeffrey P. Bigham


Guo, A., Jain, A., Ghose, S., Laput, G., Harrison, C. and Bigham, J. 2018. Crowd-AI Camera Sensing in the Real World. In Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies. UBICOMP ’18. 2, 3, Article 111 (September 2018). ACM, New York, NY. 20 pages. DOI: