David Ross
daross at gmail dot com
I co-lead the VIVID research group, at
Google DeepMind. Our goal
is to advance video understanding & generation, and amplify human
capabilities with AI.
Previously I led the YouTube Mix team that built the personalized algorithmic radio feature at the heart of YouTube Music.
I obtained my Ph.D. in Machine Learning and Computer Vision from the University of Toronto, Canada.
Google Scholar
|
LinkedIn
|
|
Publications
A complete list of my publications and patents at Google Scholar Citations.
|
MALT Diffusion: Memory-Augmented Latent Transformers for Any-Length Video Generation.
Oral Presentation
Sihyun Yu, Meera Hahn, Dan Kondratyuk, Jinwoo Shin, Agrim Gupta, José Lezama, Irfan Essa, David Ross, and Jonathan Huang
CVPR Workshop on AI for Content Creation, 2025
arXiv
|
Language-Guided Image Tokenization for Generation.
Oral Presentation
Kaiwen Zha, Lijun Yu, Alireza Fathi, David A. Ross, Cordelia Schmid, Dina Katabi, and Xiuye Gu
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2025
arXiv
|
VideoPoet: A large language model for zero-shot video generation.
Best paper award!
Dan Kondratyuk, Lijun Yu, Xiuye Gu, José Lezama, Jonathan Huang, Grant Schindler, Rachel Hornung, Vighnesh Birodkar, Jimmy Yan, Ming-Chang Chiu, Krishna Somandepalli, Hassan Akbari, Yair Alon, Yong Cheng, Josh Dillon, Agrim Gupta, Meera Hahn, Anja Hauth, David Hendon, Alonso Martinez, David Minnen, Mikhail Sirotenko, Kihyuk Sohn, Xuan Yang, Hartwig Adam, Ming-Hsuan Yang, Irfan Essa, Huisheng Wang, David A. Ross, Bryan Seybold, and Lu Jiang
Proceedings of International Conference on Machine Learning (ICML), 2024
Google Research blog post,
VideoPoet project website,
Two Minute Papers video overview,
arXiv
Talk by Lijun Yu: ICML, SlidesLive, Slides
|
Videoprism: A foundational visual encoder for video understanding.
Long Zhao, Nitesh B. Gundavarapu, Liangzhe Yuan, Hao Zhou, Shen Yan, Jennifer J. Sun, Luke Friedman, Rui Qian, Tobias Weyand, Yue Zhao, and others
Proceedings of International Conference on Machine Learning (ICML), 2024
arXiv
|
Scenecraft: An llm agent for synthesizing 3d scenes as blender code.
Ziniu Hu, Ahmet Iscen, Aashi Jain, Thomas Kipf, Yisong Yue, David A. Ross, Cordelia Schmid, and Alireza Fathi
Proceedings of International Conference on Machine Learning (ICML), 2024
arXiv
|
Video Foundation Models for Animal Behavior Analysis.
Jennifer J. Sun, Hao Zhou, Long Zhao, Liangzhe Yuan, Bryan Seybold, David Hendon, Florian Schroff, David A. Ross, Hartwig Adam, Bo Hu, and others
bioRxiv, 2024
|
Reveal: Retrieval-augmented visual-language pre-training with multi-source multimodal knowledge memory.
Ziniu Hu, Ahmet Iscen, Chen Sun, Zirui Wang, Kai-Wei Chang, Yizhou Sun, Cordelia Schmid, David A. Ross, and Alireza Fathi
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023
arXiv
|
IC3: Image Captioning by Committee Consensus.
David M Chan, Austin Myers, Sudheendra Vijayanarasimhan, David A. Ross, and John Canny
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023
arXiv
|
Dataseg: Taming a universal multi-dataset multi-task segmentation model.
Xiuye Gu, Yin Cui, Jonathan Huang, Abdullah Rashwan, Xuan Yang, Xingyi Zhou, Golnaz Ghiasi, Weicheng Kuo, Huizhong Chen, Liang-Chieh Chen, and David A. Ross
Advances in Neural Information Processing Systems, 2023
OpenReview
|
Avis: Autonomous visual information seeking with large language model agent.
Ziniu Hu, Ahmet Iscen, Chen Sun, Kai-Wei Chang, Yizhou Sun, David Ross, Cordelia Schmid, and Alireza Fathi
Advances in Neural Information Processing Systems, 2023
arXiv / Google AI blog
|
Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation.
Lijun Yu. José Lezama. Nitesh Bharadwaj Gundavarapu. Luca Versari. Kihyuk Sohn. David Minnen. Yong Cheng. Agrim Gupta. Xiuye Gu. Alex Hauptmann. Boqing Gong. Ming-Hsuan Yang. Irfan Essa. David Ross. Lu Jiang.
ICLR, 2024
arXiv
|
3D mouse pose from single-view video and a new dataset.
Bo Hu, Bryan Seybold, Shan Yang, Avneesh Sud, Yi Liu, Karla Barron, Paulyn Cha, Marcelo Cosino, Ellie Karlsson, Janessa Kite, Ganesh Kolumam, Joseph Preciado, José Zavala-Solorio, Chunlian Zhang, Xiaomeng Zhang, Martin Voorbach, Ann E. Tovcimak, J. Graham Ruby, and David A. Ross
Scientific Reports, 2023
Download the dataset
|
UnLoc: a unified framework for video localization tasks.
Shen Yan. Xuehan Xiong. Arsha Nagrani. Anurag Arnab. Zhonghao Wang. Weina Ge. David Ross. Cordelia Schmid.
International Conference on Computer Vision (ICCV), 2023
arXiv / open source implementation
|
SPAE: Semantic Pyramid AutoEncoder for Multimodal Generation with Frozen LLMs.
Lijun Yu. Yong Cheng. Zhiruo Wang. Vivek Kumar. Wolfgang Macherey. Yanping Huang. David Ross. Irfan Essa. Yonatan Bisk. Ming-Hsuan Yang. Kevin Murphy. Alex Hauptmann. Lu Jiang.
NeurIPS, 2023
arXiv
|
Distribution Aware Metrics for Conditional Natural Language Generation.
David M Chan, Yiming Ni, Austin Myers, Sudheendra Vijayanarasimhan, David A Ross, and John Canny
arXiv preprint, 2022
arXiv
|
Open-vocabulary temporal action detection with off-the-shelf image-text features.
Vivek Rathod, Bryan Seybold, Sudheendra Vijayanarasimhan, Austin Myers, Xiuye Gu, Vighnesh Birodkar, and David A. Ross
arXiv preprint arXiv:2212.10596, 2022
arXiv
|
im2nerf: Image to Neural Radiance Field in the Wild.
Lu Mi, Abhijit Kundu, David Ross, Frank Dellaert, Noah Snavely, and Alireza Fathi
arXiv preprint, 2022
arXiv
|
What’s in a Caption? Dataset-Specific Linguistic Diversity and Its Effect on Visual Description Models and Metrics.
David M. Chan, Austin Myers, Sudheendra Vijayanarasimhan, David A. Ross, Bryan Seybold, John F. Canny
The 1st Workshop on Vision Datasets Understanding, at CVPR 2022
arXiv
|
Optical Mouse: 3D Mouse Pose From Single-View Video.
Bo Hu, Bryan Seybold, Shan Yang, David Ross, Avneesh Sud, Graham Ruby, and Yi Liu
CV4Animals: Computer Vision for Animal Behavior Tracking and Modeling Workshop, at CVPR 2021
arXiv
|
AI Choreographer Music Conditioned 3D Dance Generation with AIST++.
Ruilong Li, Shan Yang, David A. Ross, Angjoo Kanazawa
ICCV, 2021
arXiv / project website, dataset
|
Learning Video Representations from Textual Web Supervision.
Jonathan C. Stroud, David A. Ross, Chen Sun, Jia Deng, Rahul Sukthankar, Cordelia Schmid
arXiv, 2020
arXiv
|
Active Learning for Video Description With Cluster-Regularized Ensemble Ranking.
David Chan, Sudheendra Vijayanarasimhan, David Ross, John Canny
ACCV, 2020
arXiv / PDF / supplementary
|
An LSTM Approach to Temporal 3D Object Detection in LiDAR Point Clouds.
Rui Huang, Wanyue Zhang, Tom Funkhouser, Abhijit Kundu, David Ross, Caroline Pantofaru, Alireza Fathi
ECCV, 2020
arXiv
|
Pillar-based Object Detection for Autonomous Driving.
Yue Wang, Abhijit Kundu, Alireza Fathi, Caroline Pantofaru, David Ross, Justin Solomon, Tom Funkhouser
ECCV, 2020
arXiv
|
Virtual Multi-view Fusion for 3D Semantic Segmentation.
Abhijit Kundu, Xiaoqi (Michael) Yin, Alireza Fathi, Brew Barrington, David Ross, Tom Funkhouser, Caroline Pantofaru
ECCV, 2020
arXiv
|
The AVA-Kinetics Localized Human Actions Video Dataset.
Ang Li, Meghana Thotakuri, David A. Ross, João Carreira, Alexander Vostrikov, Andrew Zisserman
arXiv, 2020
arXiv / project website
|
DOPS: Learning to Detect 3D Objects and Predict their 3D Shapes.
Mahyar Najibi, Guangda Lai, Abhijit Kundu, Zhichao Lu, Vivek Rathod, Thomas Funkhouser, Caroline Pantofaru, David Ross, Larry S. Davis, Alireza Fathi
CVPR, 2020
arXiv
|
Speech2Action: Cross-modal Supervision for Action Recognition.
Arsha Nagrani, Chen Sun, David Ross, Rahul Sukthankar, Cordelia Schmid, Andrew Zisserman
CVPR, 2020
PDF / arXiv / project page, data
|
D3D: Distilled 3D Networks for Video Action Recognition.
Jonathan C. Stroud, David A. Ross, Chen Sun, Jia Deng, Rahul Sukthankar
WACV, 2020
arXiv / code and pre-trained models
|
Rethinking the Faster R-CNN Architecture for Temporal Action Localization.
Yu-Wei Chao, Sudheendra Vijayanarasimhan, Bryan Seybold, David A Ross, Jia Deng, Rahul Sukthankar
CVPR, 2018
arXiv / Google AI blog
|
AVA: A Video Dataset of Spatio-temporally Localized Atomic Visual Actions.
Chunhui Gu, Chen Sun, David A. Ross, Carl Vondrick, Caroline Pantofaru, Yeqing Li, Sudheendra Vijayanarasimhan, George Toderici, Susanna Ricco, Rahul Sukthankar, Cordelia Schmid, Jitendra Malik
CVPR, 2018
arXiv / project website / Google AI blog
|
On using nearly-independent feature families for high precision and confidence.
Omid Madani, Manfred Georg, David Ross
Machine Learning Journal, 2013
PDF
|
The Intervalgram: An audio feature for large-scale melody recognition.
Thomas C. Walters, David Ross, Richard F. Lyon
9th International Symposium on Computer Music Modeling and Retrieval (CMMR 2012)
PDF
|
On Using Nearly-Independent Feature Families for High Precision and Confidence.
Omid Madani, Manfred Georg, David Ross
4th Asian Conference on Machine Learning (ACML 2012)
PDF
|
Survey and Evaluation of Audio Fingerprinting Schemes for Mobile Query-by-Example Applications.
Vijay Chandrasekhar, Matt Sharifi, David Ross
12th International Society for Music Information Retrieval Conference (ISMIR 2011)
PDF
|
The Power of Comparative Reasoning.
Jay Yagnik, Dennis Strelow, David Ross, Ruei-Sung Lin
ICCV, 2011
PDF
|
Automatic Language Identification in Music Videos with Low Level Audio and Visual Features.
Vijay Chandrasekhar, Mehmet Emre Sargin, and David Ross
ICASSP, 2011
PDF
|
SPEC Hashing: Similarity Preserving algorithm for Entropy-based Coding.
Ruei-Sung Lin, David Ross, and Jay Yagnik
CVPR, 2010
PDF
|
Learning Articulated Structure and Motion.
David Ross, Daniel Tarlow, and Richard Zemel
International Journal of Computer Vision, 88 (2), 2010
PDF / project website
|
Learning Probabilistic Models for Visual Motion.
David Ross
Ph.D. Thesis, University of Toronto, Canada, 2008
PDF / videos
|
Unsupervised learning of skeletons from motion.
David Ross, Daniel Tarlow, and Richard Zemel
10th European Conference on Computer Vision (ECCV 2008), 2008
PDF / project website
|
Learning stick-figure models using nonparametric Bayesian priors over trees.
Edward Meeds, David Ross, Richard Zemel, and Sam Roweis
IEEE Conference on Computer Vision and Pattern Recognition, 2008
PDF
|
Learning Articulated Skeletons From Motion.
David Ross, Daniel Tarlow, and Richard Zemel
Workshop on Dynamical Vision at ICCV, 2007
PDF / project website
|
Incremental Learning for Robust Visual Tracking.
David Ross, Jongwoo Lim, Ruei-Sung Lin, Ming-Hsuan Yang
In the International Journal of Computer Vision, Special Issue: Learning for Vision, 2008
PS.GZ / PDF / project website
|
Inducing Features from Visual Noise.
Andrew Cohen, Richard Shiffrin, Jason Gold, David Ross, and Michael Ross
Journal of Vision, 7(8):15, 2007
PDF
|
Learning Parts-Based Representations of Data.
David Ross and Richard Zemel
Journal of Machine Learning Research, 7(Nov):2369-2397, 2006
PDF / project website
|
Combining Discriminative Features to Infer Complex Trajectories.
David Ross, Simon Osindero, and Richard Zemel
In Proceedings of the Twenty-Third International Conference on Machine Learning, 2006
PS.GZ / PDF / project website
|
Incremental Learning for Visual Tracking.
Jongwoo Lim, David Ross, Ruei-Sung Lin, Ming-Hsuan Yang
In L. Saul, Y. Weiss, and L. Bottou, editors, Advances in Neural Information Processing Systems 17, MIT Press, 2005
PS.GZ / PDF / project website
|
Adaptive Discriminative Generative Model and Its Applications.
Ruei-Sung Lin, David Ross, Jongwoo Lim, Ming-Hsuan Yang
In L. Saul, Y. Weiss, and L. Bottou, editors, Advances in Neural Information Processing Systems 17, MIT Press, 2005
PS.GZ / PDF / project website
|
Adaptive Probabilistic Visual Tracking with Incremental Subspace Update.
David Ross, Jongwoo Lim, Ming-Hsuan Yang
In T. Pajdla and J. Matas, editors, Proc. Eighth European Conference on Computer Vision (ECCV 2004), 2004
PS.GZ / PDF / project website
|
Multiple Cause Vector Quantization.
David Ross and Richard Zemel
In S. Becker, S. Thrun, and K. Obermayer, editors, Advances in Neural Information Processing Systems 15, MIT Press, 2003
PS.GZ / PDF / project website
|
Learning Parts-Based Representations of Data (thesis version).
David Ross
University of Toronto, M.Sc. Thesis, 2003
PS.GZ / PDF / project website
|
BibTeX entries for all of the above are available here.
|
|