I'm a staff research engineer at Snap Research NYC. My research interests focus on 3D human animation generation from various signals such as text, music, audio, and video, as well as human-centered video generation including audio-conditioned lip-synced video generation and background video generation for animations. I also work on 3D avatar/animal reconstruction and animation, 3D/4D content creation. Additionally, I explore human-centered sensing including sensor-based human pose tracking, hand gesture recognition from multimodal sensors, and data synthesizing using generative models. I am passionate about bridging the gap between multimodal AI understanding and realistic human motion synthesis for applications in entertainment, AR/VR, and digital human creation.

SnapMoGen: Human Motion Generation from Expressive Texts

WebPage ArXiv'25 Code Dataset

SnapMoGen introduces a comprehensive dataset and framework for generating realistic human motions from rich, expressive text descriptions. Our approach addresses the challenge of creating diverse and contextually appropriate human movements by leveraging detailed textual annotations that capture nuanced motion characteristics. The system enables fine-grained control over motion generation, allowing users to specify complex movement patterns through natural language descriptions. This work represents a significant advancement in bridging the gap between textual understanding and physical motion synthesis, opening new possibilities for applications in animation, gaming, and virtual reality.

SceneMI: Motion In-betweening for Modeling Human-Scene Interaction

SceneMI tackles the challenging problem of generating realistic human motions that naturally interact with 3D environments. Our motion in-betweening approach enables seamless transitions between human poses while ensuring physically plausible interactions with scene objects and surfaces. The system understands spatial relationships and geometric constraints, generating motions that respect environmental boundaries and contact points. This work is particularly valuable for creating believable character animations in games, films, and virtual environments where humans must realistically navigate and interact with complex 3D scenes.

DuetGen: Music Driven Two-Person Dance Generation via Hierarchical Masked Modeling

WebPage SIGGRAPH'25 Code Video

DuetGen revolutionizes the creation of synchronized two-person dance performances through advanced AI-driven choreography. Our innovative framework analyzes musical structure and rhythm to generate coordinated dance movements for pairs of dancers, ensuring both individual expression and seamless partner interaction. The system employs a sophisticated hierarchical approach that captures both global dance dynamics and fine-grained movement details. By understanding musical timing, tempo, and emotional content, DuetGen creates compelling duet performances that demonstrate natural coordination, creative choreography, and musical responsiveness across diverse dance styles and genres.

MI-Poser: Human Body Pose Tracking Using Magnetic and Inertial Sensor Fusion with Metal Interference Mitigation

Inside-out tracking of human body poses using wearable sensors holds significant potential for AR/VR applications, such as remote communication through 3D avatars with expressive body language. Current inside-out systems often rely on vision-based methods utilizing handheld controllers or incorporating densely distributed body-worn IMU sensors. The former limits hands-free and occlusion-robust interactions, while the latter is plagued by inadequate accuracy and jittering. We introduce a novel body tracking system, MI-Poser, which employs AR glasses and two wrist-worn electromagnetic field (EMF) sensors to achieve high-fidelity upper-body pose estimation while mitigating metal interference. Our lightweight system demonstrates a minimal error (6.6 cm mean joint position error) with real-world data collected from 10 participants. It remains robust against various upper-body movements and operates efficiently at 60 Hz. Furthermore, by incorporating an IMU sensor co-located with the EMF sensor, MI-Poser presents solutions to counteract the effects of metal interference, which inherently disrupts the EMF signal during tracking. Our evaluation effectively showcases the successful detection and correction of interference using our EMF-IMU fusion approach across environments with diverse metal profiles. Ultimately, MI-Poser offers a practical pose tracking system, particularly suited for body-centric AR applications. Watch the full video here.

AO-Finger: Hands-free Fine-grained Finger Gesture Recognition via Acoustic-Optic Sensor Fusing

Finger gesture recognition is gaining great research interest for wearable device interactions such as smartwatches and AR/VR headsets. In this paper, we propose a hands-free fine-grained finger gesture recognition system AO-Finger based on acoustic-optic sensor fusing. Specifically, we design a wristband with a modified stethoscope microphone and two high-speed optic motion sensors to capture signals generated from finger movements. We propose a set of natural, inconspicuous and effortless micro finger gestures that can be reliably detected from the complementary signals from both sensors. We design a multi-modal CNN-Transformer model for fast gesture recognition (flick/pinch/tap), and a finger swipe contact detection model to enable fine-grained swipe gesture tracking. We built a prototype which achieves an overall accuracy of 94.83% in detecting fast gestures and enables fine-grained continuous swipe gestures tracking. AO-Finger is practical for use as a wearable device and ready to be integrated into existing wrist-worn devices such as smartwatches.

Fine-Grained Visual Recognition for AR Self-Assist Technical Support

Fine-grained visual recognition for augmented reality enables dynamic presentation of right set of visual instructions in the rightcontext by analyzing the hardware state as the repair procedure evolves. (This work is published in IEEE ISMAR'20, accepted to IEEE TVCG special issue, 18 out of 302.)

Swift Python TensorFlow iOS

Acoustic Sensing-based Gesture Detection for Wearable Device Interaction

We explore a novel method for interaction by using bone-conducted sound generated by finger movements while performing gestures. This promising technology can be deployed on existing smartwatches as a low power service at no additional cost.

Swift Python TensorFlow iOS

Active Visual Recognition in Augmented Reality

While existing visual recognition approaches, which rely on 2D images to train their underlying models, work well for object classification, recognizing the changing state of a 3D object requires addressing several additional challenges. This paper proposes an active visual recognition approach to this problem, leveraging camera pose data available on mobile devices. With this approach, the state of a 3D object, which captures its appearance changes, can be recognized in real time. Our novel approach selects informative video frames filtered by 6-DOF camera poses to train a deep learning model to recognize object state. We validate our approach through a prototype for Augmented Reality-assisted hardware maintenance.
Acknowledgement: This work was done during my internship at IBM Research.

Swift Python TensorFlow iOS

EchoPrint: Two-factor Authentication using Acoustics and Vision on Smartphones

We propose a novel user authentication system EchoPrint, which leverages acoustics and vision for secure and convenient user authentication, without requiring any special hardware. EchoPrint actively emits almost inaudible acoustic signals from the earpiece speaker to “illuminate” the user's face and authenticates the user by the unique features extracted from the echoes bouncing off the 3D facial contour. Because the echo features depend on 3D facial geometries, EchoPrint is not easily spoofed by images or videos like 2D visual face recognition systems. It needs only commodity hardware, thus avoiding the extra costs of special sensors in solutions like FaceID.

Java Python TensorFlow Android

EasyFind: Smart Device Controlled Laser Pointer for Fast Object Finding

EZ-Find provides a comprehensive solution for fast object finding and indoor navigation. The enabling techniques are computer vision, augmented reality and mobile computing. The fast object finding feature enables instant object identification from clutters (e.g., a book/medicine from shelf). Indoor navigation is the essential for indoor LBS, and will provide great convenience to people, especially in large scale public places such as airports and train stations.

Swift Python Raspberry Pi iOS

BatTracker: High Precision Infrastructure-free Mobile Device Tracking in Indoor Environments

We propose BatTracker, which incorporates inertial and acoustic data for robust, high precision and infrastructure-free tracking in indoor environments. BatTracker leverages echoes from nearby objects and uses distance measurements from them to correct error accumulation in inertial based device position prediction. It incorporates Doppler shifts and echo amplitudes to reliably identify the association between echoes and objects despite noisy signals from multi-path reflection and cluttered environment. A probabilistic algorithm creates, prunes and evolves multiple hypotheses based on measurement evidences to accommodate uncertainty in device position. Experiments in real environments show that BatTracker can track a mobile device's movements in 3D space at sub-cm level accuracy, comparable to the state-of-the-art infrastructure based approaches, while eliminating the needs of any additional hardware.

Java Python Android

BatMapper: Acoustic Sensing Based Indoor Floor Plan Construction Using Smartphones

In this project, we propose BatMapper, which explores a previously untapped sensing modality - acoustics - for fast, fine grained and low cost floor plan construction. We design sound signals suitable for heterogeneous microphones on commodity smartphones, and acoustic signal processing techniques to produce accurate distance measurements to nearby objects. We further develop robust probabilistic echo-object association, recursive outlier removal and probabilistic resampling algorithms to identify the correspondence between distances and objects, thus the geometry of corridors and rooms. We compensate minute hand sway movements to identify small surface recessions, thus detecting doors automatically.

Java Python MATLAB Android

Knitter: Fast, Resilient Single-User Indoor Floor Plan Construction

Lacking of floor plans is a fundamental obstacle to ubiquitous indoor location-based services. Recent work have made significant progress to accuracy, but they largely rely on slow crowdsensing that may take weeks or even months to collect enough data. In this paper, we propose Knitter that can generate accurate floor maps by a single random user’s one hour data collection efforts, and demonstrate how such maps can be used for indoor navigation. Knitter extracts high quality floor layout information from single images, calibrates user trajectories and filters outliers. It uses a multi-hypothesis map fusion framework that updates landmark positions/orientations and accessible areas incrementally according to evidences from each measurement.

Java Swift MATLAB Android iOS

ICCV'25
SceneMI: Motion In-betweening for Modeling Human-Scene Interaction.
Inwoo Hwang, Bing Zhou*, Young Min Kim, Jian Wang, Chuan Guo*
In ICCV (Highlight), 2025. [* Co-corresponding and co-mentor]
ICCV'25
Ponimator: Unfolding Interactive Pose for Versatile Human-human Interaction Animation.
Shaowei Liu, Chuan Guo, Bing Zhou, Jian Wang
SIGGRAPH'25
DuetGen: Music Driven Two-Person Dance Generation via Hierarchical Masked Modeling.
Anindita Ghosh, Bing Zhou*, Rishabh Dabral, Jian Wang, Vladislav Golyanik, Christian Theobalt, Philipp Slusallek, Chuan Guo*
In Proc. SIGGRAPH, 2025. [* Co-corresponding and co-mentor]
[PDF]
ArXiv'25
SnapMoGen: Human Motion Generation from Expressive Texts.
Chuan Guo, Inwoo Hwang, Jian Wang, Bing Zhou
In ArXiv, 2025.
IMWUT'23
MI-Poser: Human Body Pose Tracking using Magnetic and Inertial Sensor Fusion with Metal Interference Mitigation.
Riku Arakawa, Bing Zhou*, Gurunandan Krishnan, Mayank Goel, and Shree Nayar
Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies. [* Corresponding author]
[PDF]
IMWUT'23
N-euro Predictor: A Neural Network Approach for Smoothing and Predicting Motion Trajectory.
Qijia Shao, Jian Wang, Bing Zhou, Vu An Tran, Gurunandan Krishnan and Shree Nayar
Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies.
[PDF]
CHI'23
AO-Finger: Hands-free Fine-grained Finger Gesture Recognition via Acoustic-Optic Sensor Fusing.
Chenhan Xu, Bing Zhou*, Gurunandan Krishnan and Shree Nayar
The ACM CHI Conference on Human Factors in Computing Systems. [* Corresponding author]
[PDF]
HEALTH'22
Passive and Context-Aware In-Home Vital Signs Monitoring Using Co-Located UWB-Depth Sensor Fusion.
Zongxing Xie, Bing Zhou, Xi Cheng, Elinor Schoenfeld and Fan Ye
ACM Transactions on Computing for Healthcare.

[PDF]
ICHI'21
VitalHub: robust, non-touch multi-user vital signs monitoring using depth camera-aided UWB.
Zongxing Xie, Bing Zhou, Xi Cheng, Elinor Schoenfeld and Fan Ye
2021 IEEE International Conference on Healthcare Informatics (ICHI).

[ Best Paper Award ]

[PDF]
ACM-BCB'21
Signal quality detection towards practical non-touch vital sign monitoring.
Zongxing Xie, Bing Zhou, Fan Ye
Proceedings of the 12th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics.

[ Best Student Paper ]

[PDF]
TMC'21
Robust human face authentication leveraging acoustic sensing on smartphones.
Bing Zhou, Zongxing Xie, Yinuo Zhang, Jay Lohokare, Ruipeng Gao, and Fan Ye
IEEE Transactions on Mobile Computing.
[PDF]
ISMAR'20
Fine-grained visual recognition in mobile augmented reality for technical support.
Bing Zhou, Sinem Guven Kaya
IEEE International Symposium on Mixed and Augmented Reality.

[ Accepted to IEEE TVCG special issue, 18 out of 302, Acceptance rate 6%. ]

[PDF]
ICC'19
Multi-Modal Face Authentication using Deep Visual and Acoustic Features
Bing Zhou, Zongxing Xie, Fan Ye
IEEE International Conference on Communications.
[PDF]
TMC'19
Towards Scalable Indoor Map Construction and Refinement using Acoustics on Smartphones
Bing Zhou, Mohammed Elbadry, Ruipeng Gao, Fan Ye
IEEE Transactions on Mobile Computing.
[PDF]
MobiCom'18
EchoPrint: Two-factor Authentication using Vision and Acoustics on Smartphones
Bing Zhou, Jay Lohokare, Ruipeng Gao, Fan Ye
[PDF]
MobiCom'18 (Poster)
Pose-assisted Active Visual Recognition in Mobile Augmented Reality
Bing Zhou, Sinem Guven, Shu Tao, Fan Ye
[PDF]
MobiCom'18 (Poster)
A Raspberry Pi Based Data-Centric MAC for Robust Multicast in Vehicular Network
Mohammed Elbadry, Bing Zhou, Fan Ye, Peter Milder, YuanYuan Yang
[PDF]
TMC'18
Fast and Resilient Indoor Floor Plan Construction with a Single User
Ruipeng Gao, Bing Zhou, Fan Ye, Yizhou Wang
IEEE Transactions on Mobile Computing.
[PDF]
SenSys'17
BatTracker: High Precision Infrastructure-free Mobile Device Tracking in Indoor Environments
Bing Zhou, Mohammed Salah, Ruipeng Gao, Fan Ye
[PDF]
MobiCom'17 (Demo)
Demo: Acoustic Sensing Based Indoor Floor Plan Construction Using Smartphones
Bing Zhou, Mohammed Salah, Ruipeng Gao, Fan Ye
[PDF]
MobiSys'17
BatMapper: Acoustic Sensing Based Indoor Floor Plan Construction Using Smartphones
Bing Zhou, Mohammed Salah, Ruipeng Gao, Fan Ye
[PDF]
ICC'17
Explore hidden information for indoor floor plan construction
Bing Zhou, Fan Ye
[PDF]
INFOCOM'17
Knitter: Fast, Resilient Single-User Indoor Floor Plan Construction
Ruipeng Gao (co-primary), Bing Zhou (co-primary), Fan Ye, Yizhou Wang
[PDF]
Others
For full publication lists, please refer to my Google Scholar.
Senior Research Engineer Oct 2021 - Present

Research Staff Member May 2019 - Oct 2021

Research Intern Summer 2018

Stony Brook University

Teaching Assistant Fall 2014-Spring 2015