I'm a staff research engineer at Snap Research NYC. My research interests focus on 3D human animation generation from various signals such as text, music, audio, and video, as well as human-centered video generation including audio-conditioned lip-synced video generation and background video generation for animations. I also work on 3D avatar/animal reconstruction and animation, 3D/4D content creation. Additionally, I explore human-centered sensing including sensor-based human pose tracking, hand gesture recognition from multimodal sensors, and data synthesizing using generative models. I am passionate about bridging the gap between multimodal AI understanding and realistic human motion synthesis for applications in entertainment, AR/VR, and digital human creation.
SnapMoGen introduces a comprehensive dataset and framework for generating realistic human motions from rich, expressive text descriptions. Our approach addresses the challenge of creating diverse and contextually appropriate human movements by leveraging detailed textual annotations that capture nuanced motion characteristics. The system enables fine-grained control over motion generation, allowing users to specify complex movement patterns through natural language descriptions. This work represents a significant advancement in bridging the gap between textual understanding and physical motion synthesis, opening new possibilities for applications in animation, gaming, and virtual reality.
SceneMI tackles the challenging problem of generating realistic human motions that naturally interact with 3D environments. Our motion in-betweening approach enables seamless transitions between human poses while ensuring physically plausible interactions with scene objects and surfaces. The system understands spatial relationships and geometric constraints, generating motions that respect environmental boundaries and contact points. This work is particularly valuable for creating believable character animations in games, films, and virtual environments where humans must realistically navigate and interact with complex 3D scenes.
DuetGen revolutionizes the creation of synchronized two-person dance performances through advanced AI-driven choreography. Our innovative framework analyzes musical structure and rhythm to generate coordinated dance movements for pairs of dancers, ensuring both individual expression and seamless partner interaction. The system employs a sophisticated hierarchical approach that captures both global dance dynamics and fine-grained movement details. By understanding musical timing, tempo, and emotional content, DuetGen creates compelling duet performances that demonstrate natural coordination, creative choreography, and musical responsiveness across diverse dance styles and genres.
Inside-out tracking of human body poses using wearable sensors holds significant potential for AR/VR applications, such as remote communication through 3D avatars with expressive body language. Current inside-out systems often rely on vision-based methods utilizing handheld controllers or incorporating densely distributed body-worn IMU sensors. The former limits hands-free and occlusion-robust interactions, while the latter is plagued by inadequate accuracy and jittering. We introduce a novel body tracking system, MI-Poser, which employs AR glasses and two wrist-worn electromagnetic field (EMF) sensors to achieve high-fidelity upper-body pose estimation while mitigating metal interference. Our lightweight system demonstrates a minimal error (6.6 cm mean joint position error) with real-world data collected from 10 participants. It remains robust against various upper-body movements and operates efficiently at 60 Hz. Furthermore, by incorporating an IMU sensor co-located with the EMF sensor, MI-Poser presents solutions to counteract the effects of metal interference, which inherently disrupts the EMF signal during tracking. Our evaluation effectively showcases the successful detection and correction of interference using our EMF-IMU fusion approach across environments with diverse metal profiles. Ultimately, MI-Poser offers a practical pose tracking system, particularly suited for body-centric AR applications. Watch the full video here.
Finger gesture recognition is gaining great research interest for wearable device interactions such as smartwatches and AR/VR headsets. In this paper, we propose a hands-free fine-grained finger gesture recognition system AO-Finger based on acoustic-optic sensor fusing. Specifically, we design a wristband with a modified stethoscope microphone and two high-speed optic motion sensors to capture signals generated from finger movements. We propose a set of natural, inconspicuous and effortless micro finger gestures that can be reliably detected from the complementary signals from both sensors. We design a multi-modal CNN-Transformer model for fast gesture recognition (flick/pinch/tap), and a finger swipe contact detection model to enable fine-grained swipe gesture tracking. We built a prototype which achieves an overall accuracy of 94.83% in detecting fast gestures and enables fine-grained continuous swipe gestures tracking. AO-Finger is practical for use as a wearable device and ready to be integrated into existing wrist-worn devices such as smartwatches.
Fine-grained visual recognition for augmented reality enables dynamic presentation of right set of visual instructions in the rightcontext by analyzing the hardware state as the repair procedure evolves. (This work is published in IEEE ISMAR'20, accepted to IEEE TVCG special issue, 18 out of 302.)
We explore a novel method for interaction by using bone-conducted sound generated by finger movements while performing gestures. This promising technology can be deployed on existing smartwatches as a low power service at no additional cost.
While existing visual recognition approaches, which rely on 2D images to train their underlying models, work well for object classification, recognizing the changing state of a 3D object requires addressing several additional challenges. This paper proposes an active visual recognition approach to this problem, leveraging camera pose data available on mobile devices. With this approach, the state of a 3D object, which captures its appearance changes, can be recognized in real time. Our novel approach selects informative video frames filtered by 6-DOF camera poses to train a deep learning model to recognize object state. We validate our approach through a prototype for Augmented Reality-assisted hardware maintenance.
Acknowledgement: This work was done during my internship at IBM Research.
We propose a novel user authentication system EchoPrint, which leverages acoustics and vision for secure and convenient user authentication, without requiring any special hardware. EchoPrint actively emits almost inaudible acoustic signals from the earpiece speaker to “illuminate” the user's face and authenticates the user by the unique features extracted from the echoes bouncing off the 3D facial contour. Because the echo features depend on 3D facial geometries, EchoPrint is not easily spoofed by images or videos like 2D visual face recognition systems. It needs only commodity hardware, thus avoiding the extra costs of special sensors in solutions like FaceID.
EZ-Find provides a comprehensive solution for fast object finding and indoor navigation. The enabling techniques are computer vision, augmented reality and mobile computing. The fast object finding feature enables instant object identification from clutters (e.g., a book/medicine from shelf). Indoor navigation is the essential for indoor LBS, and will provide great convenience to people, especially in large scale public places such as airports and train stations.
We propose BatTracker, which incorporates inertial and acoustic data for robust, high precision and infrastructure-free tracking in indoor environments. BatTracker leverages echoes from nearby objects and uses distance measurements from them to correct error accumulation in inertial based device position prediction. It incorporates Doppler shifts and echo amplitudes to reliably identify the association between echoes and objects despite noisy signals from multi-path reflection and cluttered environment. A probabilistic algorithm creates, prunes and evolves multiple hypotheses based on measurement evidences to accommodate uncertainty in device position. Experiments in real environments show that BatTracker can track a mobile device's movements in 3D space at sub-cm level accuracy, comparable to the state-of-the-art infrastructure based approaches, while eliminating the needs of any additional hardware.
In this project, we propose BatMapper, which explores a previously untapped sensing modality - acoustics - for fast, fine grained and low cost floor plan construction. We design sound signals suitable for heterogeneous microphones on commodity smartphones, and acoustic signal processing techniques to produce accurate distance measurements to nearby objects. We further develop robust probabilistic echo-object association, recursive outlier removal and probabilistic resampling algorithms to identify the correspondence between distances and objects, thus the geometry of corridors and rooms. We compensate minute hand sway movements to identify small surface recessions, thus detecting doors automatically.
Lacking of floor plans is a fundamental obstacle to ubiquitous indoor location-based services. Recent work have made significant progress to accuracy, but they largely rely on slow crowdsensing that may take weeks or even months to collect enough data. In this paper, we propose Knitter that can generate accurate floor maps by a single random user’s one hour data collection efforts, and demonstrate how such maps can be used for indoor navigation. Knitter extracts high quality floor layout information from single images, calibrates user trajectories and filters outliers. It uses a multi-hypothesis map fusion framework that updates landmark positions/orientations and accessible areas incrementally according to evidences from each measurement.
[ Best Paper Award ]
[PDF][ Best Student Paper ]
[PDF][ Accepted to IEEE TVCG special issue, 18 out of 302, Acceptance rate 6%. ]
[PDF]