David Ross

David Ross
daross at gmail dot com

I co-lead the VIVID research group, at Google DeepMind. Our goal is to advance video understanding & generation, and amplify human capabilities with AI.

Previously I led the YouTube Mix team that built the personalized algorithmic radio feature at the heart of YouTube Music.

I obtained my Ph.D. in Machine Learning and Computer Vision from the University of Toronto, Canada.

Google Scholar | LinkedIn

Current Work

Our work on VideoPoet, a large language model for zero-shot video generation, won Best Paper at ICML 2024. See the great overview by Two Minute Papers, the Google Research blog post, and the VideoPoet website.

Some prior open source releases from my team: the AIST++ Human Motion dataset, TF Object Detection API for TensorFlow 2.x, and TF3D for 3D Scene Understanding.

The results of the 3rd AVA Action Detection challenge are available. This event was held at CVPR 2020, in partnership with the International Challenge on Activity Recognition (ActivityNet) workshop.
My talk Context & Attention for Detecting Objects and Actions in Video at the CVPR'20 LSHVU Tutorial is available on YouTube.

Our work on Capturing Special Video Moments with Google Photos was featured on the Google AI Blog.

Publications

A complete list of my publications and patents at Google Scholar Citations.

MALT Diffusion: Memory-Augmented Latent Transformers for Any-Length Video Generation. Oral Presentation
Sihyun Yu, Meera Hahn, Dan Kondratyuk, Jinwoo Shin, Agrim Gupta, José Lezama, Irfan Essa, David Ross, and Jonathan Huang
CVPR Workshop on AI for Content Creation, 2025
arXiv

Language-Guided Image Tokenization for Generation. Oral Presentation
Kaiwen Zha, Lijun Yu, Alireza Fathi, David A. Ross, Cordelia Schmid, Dina Katabi, and Xiuye Gu
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2025
arXiv

VideoPoet: A large language model for zero-shot video generation. Best paper award!
Dan Kondratyuk, Lijun Yu, Xiuye Gu, José Lezama, Jonathan Huang, Grant Schindler, Rachel Hornung, Vighnesh Birodkar, Jimmy Yan, Ming-Chang Chiu, Krishna Somandepalli, Hassan Akbari, Yair Alon, Yong Cheng, Josh Dillon, Agrim Gupta, Meera Hahn, Anja Hauth, David Hendon, Alonso Martinez, David Minnen, Mikhail Sirotenko, Kihyuk Sohn, Xuan Yang, Hartwig Adam, Ming-Hsuan Yang, Irfan Essa, Huisheng Wang, David A. Ross, Bryan Seybold, and Lu Jiang
Proceedings of International Conference on Machine Learning (ICML), 2024
Google Research blog post, VideoPoet project website, Two Minute Papers video overview, arXiv
Talk by Lijun Yu: ICML, SlidesLive, Slides

Videoprism: A foundational visual encoder for video understanding.
Long Zhao, Nitesh B. Gundavarapu, Liangzhe Yuan, Hao Zhou, Shen Yan, Jennifer J. Sun, Luke Friedman, Rui Qian, Tobias Weyand, Yue Zhao, and others
Proceedings of International Conference on Machine Learning (ICML), 2024
arXiv

Scenecraft: An llm agent for synthesizing 3d scenes as blender code.
Ziniu Hu, Ahmet Iscen, Aashi Jain, Thomas Kipf, Yisong Yue, David A. Ross, Cordelia Schmid, and Alireza Fathi
Proceedings of International Conference on Machine Learning (ICML), 2024
arXiv

Video Foundation Models for Animal Behavior Analysis.
Jennifer J. Sun, Hao Zhou, Long Zhao, Liangzhe Yuan, Bryan Seybold, David Hendon, Florian Schroff, David A. Ross, Hartwig Adam, Bo Hu, and others
bioRxiv, 2024

Reveal: Retrieval-augmented visual-language pre-training with multi-source multimodal knowledge memory.
Ziniu Hu, Ahmet Iscen, Chen Sun, Zirui Wang, Kai-Wei Chang, Yizhou Sun, Cordelia Schmid, David A. Ross, and Alireza Fathi
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023
arXiv

IC3: Image Captioning by Committee Consensus.
David M Chan, Austin Myers, Sudheendra Vijayanarasimhan, David A. Ross, and John Canny
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023
arXiv

Dataseg: Taming a universal multi-dataset multi-task segmentation model.
Xiuye Gu, Yin Cui, Jonathan Huang, Abdullah Rashwan, Xuan Yang, Xingyi Zhou, Golnaz Ghiasi, Weicheng Kuo, Huizhong Chen, Liang-Chieh Chen, and David A. Ross
Advances in Neural Information Processing Systems, 2023
OpenReview

Avis: Autonomous visual information seeking with large language model agent.
Ziniu Hu, Ahmet Iscen, Chen Sun, Kai-Wei Chang, Yizhou Sun, David Ross, Cordelia Schmid, and Alireza Fathi
Advances in Neural Information Processing Systems, 2023
arXiv / Google AI blog

Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation.
Lijun Yu. José Lezama. Nitesh Bharadwaj Gundavarapu. Luca Versari. Kihyuk Sohn. David Minnen. Yong Cheng. Agrim Gupta. Xiuye Gu. Alex Hauptmann. Boqing Gong. Ming-Hsuan Yang. Irfan Essa. David Ross. Lu Jiang.
ICLR, 2024
arXiv

3D mouse pose from single-view video and a new dataset.
Bo Hu, Bryan Seybold, Shan Yang, Avneesh Sud, Yi Liu, Karla Barron, Paulyn Cha, Marcelo Cosino, Ellie Karlsson, Janessa Kite, Ganesh Kolumam, Joseph Preciado, José Zavala-Solorio, Chunlian Zhang, Xiaomeng Zhang, Martin Voorbach, Ann E. Tovcimak, J. Graham Ruby, and David A. Ross
Scientific Reports, 2023
Download the dataset

UnLoc: a unified framework for video localization tasks.
Shen Yan. Xuehan Xiong. Arsha Nagrani. Anurag Arnab. Zhonghao Wang. Weina Ge. David Ross. Cordelia Schmid.
International Conference on Computer Vision (ICCV), 2023
arXiv / open source implementation

SPAE: Semantic Pyramid AutoEncoder for Multimodal Generation with Frozen LLMs.
Lijun Yu. Yong Cheng. Zhiruo Wang. Vivek Kumar. Wolfgang Macherey. Yanping Huang. David Ross. Irfan Essa. Yonatan Bisk. Ming-Hsuan Yang. Kevin Murphy. Alex Hauptmann. Lu Jiang.
NeurIPS, 2023
arXiv

Distribution Aware Metrics for Conditional Natural Language Generation.
David M Chan, Yiming Ni, Austin Myers, Sudheendra Vijayanarasimhan, David A Ross, and John Canny
arXiv preprint, 2022
arXiv

Open-vocabulary temporal action detection with off-the-shelf image-text features.
Vivek Rathod, Bryan Seybold, Sudheendra Vijayanarasimhan, Austin Myers, Xiuye Gu, Vighnesh Birodkar, and David A. Ross
arXiv preprint arXiv:2212.10596, 2022
arXiv

im2nerf: Image to Neural Radiance Field in the Wild.
Lu Mi, Abhijit Kundu, David Ross, Frank Dellaert, Noah Snavely, and Alireza Fathi
arXiv preprint, 2022
arXiv

What’s in a Caption? Dataset-Specific Linguistic Diversity and Its Effect on Visual Description Models and Metrics.
David M. Chan, Austin Myers, Sudheendra Vijayanarasimhan, David A. Ross, Bryan Seybold, John F. Canny
The 1st Workshop on Vision Datasets Understanding, at CVPR 2022
arXiv

Optical Mouse: 3D Mouse Pose From Single-View Video.
Bo Hu, Bryan Seybold, Shan Yang, David Ross, Avneesh Sud, Graham Ruby, and Yi Liu
CV4Animals: Computer Vision for Animal Behavior Tracking and Modeling Workshop, at CVPR 2021
arXiv

AI Choreographer Music Conditioned 3D Dance Generation with AIST++.
Ruilong Li, Shan Yang, David A. Ross, Angjoo Kanazawa
ICCV, 2021
arXiv / project website, dataset

Learning Video Representations from Textual Web Supervision.
Jonathan C. Stroud, David A. Ross, Chen Sun, Jia Deng, Rahul Sukthankar, Cordelia Schmid
arXiv, 2020
arXiv

Active Learning for Video Description With Cluster-Regularized Ensemble Ranking.
David Chan, Sudheendra Vijayanarasimhan, David Ross, John Canny
ACCV, 2020
arXiv / PDF / supplementary

An LSTM Approach to Temporal 3D Object Detection in LiDAR Point Clouds.
Rui Huang, Wanyue Zhang, Tom Funkhouser, Abhijit Kundu, David Ross, Caroline Pantofaru, Alireza Fathi
ECCV, 2020
arXiv

Pillar-based Object Detection for Autonomous Driving.
Yue Wang, Abhijit Kundu, Alireza Fathi, Caroline Pantofaru, David Ross, Justin Solomon, Tom Funkhouser
ECCV, 2020
arXiv

Virtual Multi-view Fusion for 3D Semantic Segmentation.
Abhijit Kundu, Xiaoqi (Michael) Yin, Alireza Fathi, Brew Barrington, David Ross, Tom Funkhouser, Caroline Pantofaru
ECCV, 2020
arXiv

The AVA-Kinetics Localized Human Actions Video Dataset.
Ang Li, Meghana Thotakuri, David A. Ross, João Carreira, Alexander Vostrikov, Andrew Zisserman
arXiv, 2020
arXiv / project website

DOPS: Learning to Detect 3D Objects and Predict their 3D Shapes.
Mahyar Najibi, Guangda Lai, Abhijit Kundu, Zhichao Lu, Vivek Rathod, Thomas Funkhouser, Caroline Pantofaru, David Ross, Larry S. Davis, Alireza Fathi
CVPR, 2020
arXiv

Speech2Action: Cross-modal Supervision for Action Recognition.
Arsha Nagrani, Chen Sun, David Ross, Rahul Sukthankar, Cordelia Schmid, Andrew Zisserman
CVPR, 2020
PDF / arXiv / project page, data

D3D: Distilled 3D Networks for Video Action Recognition.
Jonathan C. Stroud, David A. Ross, Chen Sun, Jia Deng, Rahul Sukthankar
WACV, 2020
arXiv / code and pre-trained models

Rethinking the Faster R-CNN Architecture for Temporal Action Localization.
Yu-Wei Chao, Sudheendra Vijayanarasimhan, Bryan Seybold, David A Ross, Jia Deng, Rahul Sukthankar
CVPR, 2018
arXiv / Google AI blog

AVA: A Video Dataset of Spatio-temporally Localized Atomic Visual Actions.
Chunhui Gu, Chen Sun, David A. Ross, Carl Vondrick, Caroline Pantofaru, Yeqing Li, Sudheendra Vijayanarasimhan, George Toderici, Susanna Ricco, Rahul Sukthankar, Cordelia Schmid, Jitendra Malik
CVPR, 2018
arXiv / project website / Google AI blog

On using nearly-independent feature families for high precision and confidence.
Omid Madani, Manfred Georg, David Ross
Machine Learning Journal, 2013
PDF

The Intervalgram: An audio feature for large-scale melody recognition.
Thomas C. Walters, David Ross, Richard F. Lyon
9th International Symposium on Computer Music Modeling and Retrieval (CMMR 2012)
PDF

On Using Nearly-Independent Feature Families for High Precision and Confidence.
Omid Madani, Manfred Georg, David Ross
4th Asian Conference on Machine Learning (ACML 2012)
PDF

Survey and Evaluation of Audio Fingerprinting Schemes for Mobile Query-by-Example Applications.
Vijay Chandrasekhar, Matt Sharifi, David Ross
12th International Society for Music Information Retrieval Conference (ISMIR 2011)
PDF

The Power of Comparative Reasoning.
Jay Yagnik, Dennis Strelow, David Ross, Ruei-Sung Lin
ICCV, 2011
PDF

Automatic Language Identification in Music Videos with Low Level Audio and Visual Features.
Vijay Chandrasekhar, Mehmet Emre Sargin, and David Ross
ICASSP, 2011
PDF

SPEC Hashing: Similarity Preserving algorithm for Entropy-based Coding.
Ruei-Sung Lin, David Ross, and Jay Yagnik
CVPR, 2010
PDF

Learning Articulated Structure and Motion.
David Ross, Daniel Tarlow, and Richard Zemel
International Journal of Computer Vision, 88 (2), 2010
PDF / project website

Learning Probabilistic Models for Visual Motion.
David Ross
Ph.D. Thesis, University of Toronto, Canada, 2008
PDF / videos

Unsupervised learning of skeletons from motion.
David Ross, Daniel Tarlow, and Richard Zemel
10th European Conference on Computer Vision (ECCV 2008), 2008
PDF / project website

Learning stick-figure models using nonparametric Bayesian priors over trees.
Edward Meeds, David Ross, Richard Zemel, and Sam Roweis
IEEE Conference on Computer Vision and Pattern Recognition, 2008
PDF

Learning Articulated Skeletons From Motion.
David Ross, Daniel Tarlow, and Richard Zemel
Workshop on Dynamical Vision at ICCV, 2007
PDF / project website

Incremental Learning for Robust Visual Tracking.
David Ross, Jongwoo Lim, Ruei-Sung Lin, Ming-Hsuan Yang
In the International Journal of Computer Vision, Special Issue: Learning for Vision, 2008
PS.GZ / PDF / project website

Inducing Features from Visual Noise.
Andrew Cohen, Richard Shiffrin, Jason Gold, David Ross, and Michael Ross
Journal of Vision, 7(8):15, 2007
PDF

Learning Parts-Based Representations of Data.
David Ross and Richard Zemel
Journal of Machine Learning Research, 7(Nov):2369-2397, 2006
PDF / project website

Combining Discriminative Features to Infer Complex Trajectories.
David Ross, Simon Osindero, and Richard Zemel
In Proceedings of the Twenty-Third International Conference on Machine Learning, 2006
PS.GZ / PDF / project website

Incremental Learning for Visual Tracking.
Jongwoo Lim, David Ross, Ruei-Sung Lin, Ming-Hsuan Yang
In L. Saul, Y. Weiss, and L. Bottou, editors, Advances in Neural Information Processing Systems 17, MIT Press, 2005
PS.GZ / PDF / project website

Adaptive Discriminative Generative Model and Its Applications.
Ruei-Sung Lin, David Ross, Jongwoo Lim, Ming-Hsuan Yang
In L. Saul, Y. Weiss, and L. Bottou, editors, Advances in Neural Information Processing Systems 17, MIT Press, 2005
PS.GZ / PDF / project website

Adaptive Probabilistic Visual Tracking with Incremental Subspace Update.
David Ross, Jongwoo Lim, Ming-Hsuan Yang
In T. Pajdla and J. Matas, editors, Proc. Eighth European Conference on Computer Vision (ECCV 2004), 2004
PS.GZ / PDF / project website

Multiple Cause Vector Quantization.
David Ross and Richard Zemel
In S. Becker, S. Thrun, and K. Obermayer, editors, Advances in Neural Information Processing Systems 15, MIT Press, 2003
PS.GZ / PDF / project website

Learning Parts-Based Representations of Data (thesis version).
David Ross
University of Toronto, M.Sc. Thesis, 2003
PS.GZ / PDF / project website

BibTeX entries for all of the above are available here.

Code

D3D: Distilled 3D Networks TensorFlow code and pre-trained model checkpoints can be found here.

AVA Atomic Visual Actions evaluation code can be found in the ActivityNet GitHub repo. Find the data here.

The source code for most of my older research projects is available for download here. Included are MATLAB implementations of a number of machine learning & computer vision algorithms, but there are also a few other hacks.

Parallel Computing: Here is some code I've written/modified, as well as some getting-started tips for parallel computing using MATLAB.

The code for the "Combining Discriminative Features" learning/tracking algorithm is available. cdf_2007-07-13.zip

Last Updated: May 2025.