Tete (Jason) Xiao
Ph.D. Student
University of California, Berkeley
jasonhsiao97 [at]
In Deep.
The next AI revolution will not be supervised.

A Motto,
Berkeley Artificial Intelligence Research


I'm a Ph.D. student in the Berkeley Artificial Intelligence Research (BAIR) Lab at University of California, Berkeley, advised by Prof. Trevor Darrell. My research interests lie in the field of Computer Vision and Machine Learning, particularly in enabling intelligent agents perceive, comprehend and reason as we humans do, as well as in understanding how intelligence truly emerges w/o supervision. I graduated from Peking University (PKU) in 2019, the most progressive university in China, summa cum laude with a Bachelor's degree. I received China National Scholarship in 2016 and was selected as a 2019 Snap Research Scholar for my research and school courses. [ Résumé ]

Before joining UC Berkeley, I collaborated with and learned from Dr. Bolei Zhou (now Assistant Professor at CUHK) on a series of works. Prior to that I was a senior research intern at Megvii (Face++) Inc. (a leading Chinese AI start-up) for two and a half years mentored by Mr. Yuning Jiang and supervised by Dr. Jian Sun.

I started computer programming in primary school and have been participating in programming contest involving complicated algorithms and data structures since high school. I was admitted to 2014 Shandong Province Team as the first-place winner to compete in National Olympiad in Informatics (NOI), where I won a bronze medal. Later in college I won two gold medals in 2016 and 2017 ACM-ICPC Asia Regional Contest for PKU.

I have a great fondness for arts besides my academic work. I'm deeply passionate about musical works, especially those in classical music (both instrumental & vocal) and jazz. I'm also into laws & public policies. I'm a political activist in both U.S. and my home country.


[Jan. 2021] Two papers (including one selected for oral presentation) accepted to ICLR21!
[Mar. 2020] One paper on compositional action recognition accepted to CVPR20! Check out the project page!
[July 2019] One paper accepted to ICCV19 in Seoul, South Korea!
[May 2019] I will join the wonderful Berkeley Artificial Intelligence Research (BAIR) Lab as a Ph.D. student at the lovely UC Berkeley in August 2019. Go Bears!
[Nov. 2018] One paper accepted to IJCV! Compared to the original CVPR paper, we included a variety of interesting applications on ADE20K plus the study of synchronized batch norm for semantic segmentation. Check out the paper for more details!
[July 2018] Two papers (including one oral) accepted to ECCV18 in München, Germany! (Paper and code released!)
[May 2018] One paper accepted to COLING18! (paper and codes are released.)
[Apr. 2018] A PyTorch implementation of scene parsing networks trained on ADE20K with SOTA performance is released in conjunction with MIT CSAIL. Check out our code, it's popular!
[Feb. 2018] Two papers accepted to CVPR18!
[Oct. 2017] As a team member of Megvii (Face++), we won the premier challenge for object detection-COCO and Places Challenges 2017: the 1st places of COCO Detection, COCO Keypoint and Places Instance Segmentation, as well as the 2nd place of COCO Instance Segmentation. I was invited to present at COCO & Places Joint Workshop at ICCV17 in Venice, Italy.
[Slides]        [Media coverage]: ChinaNews (in Chinese)


Sept. 2019 -
PhD @ UC Berkeley
Artificial Intelligence
July 2018 - Dec. 2018
Intern @ MIT-IBM Watson AI Lab
Video Analytics
Collaborators: (IBM) Dr. Dan Gutfreund and Dr. Quanfu Fan;
(MIT) Dr. Aude Oliva and Dr. Bolei Zhou
July 2016 - July 2017
Machine Intelligence and Computer Vision
Supervisor: Dr. Yadong Mu
Dec. 2015 - May 2018
Intern @ Megvii (Face++) Inc.
Computer Vision
Supervisor: Dr. Jian Sun
Sept. 2015 - June 2019
Undergrad @ PKU
B.S. (summa cum laude) in Intelligence Science


Something-Else: Compositional Action Recognition with Spatial-Temporal Interaction Networks

We study the compositionality of action by looking into the dynamics of subject-object interactions. We propose a novel model which can explicitly reason about the geometric relations between constituent objects and an agent performing an action. We collect dense object box annotations on the Something-Something dataset. We propose a novel compositional action recognition task where the training combinations of verbs and nouns do not overlap with the test set.
†: equal advising

CVPR20 in Seattle, Washington
Reasoning About Human-Object Interactions Through Dual Attention Networks

Objects are entities we act upon, where the functionality of an object is determined by how we interact with it. We propose a Dual Attention Network model which reasons about human-object interactions. The recognition of objects and actions mutually benefit each other. It can also perform weak spatiotemporal localization and affordance segmentation, despite being trained only with video-level labels. The model not only finds when an action is happening and which object is being manipulated, but also identifies which part of the object is being interacted with.

ICCV19 in Seoul, Korea
Semantic Understanding of Scenes through the ADE20K Dataset

We present a densely annotated dataset ADE20K, which spans diverse annotations of scenes, objects, parts of objects, and in some cases even parts of parts. We construct benchmarks for scene parsing and instance segmentation. We provide baseline performances on both of the benchmarks and re-implement the state-of-the-art models for open source. We further evaluate the effect of synchronized batch normalization and find that a reasonably large batch size is crucial for the semantic segmentation performance. We show that the networks trained on ADE20K is able to segment a wide variety of scenes and objects.

IJCV Springer
Unified Perceptual Parsing for Scene Understanding

In this paper, we study a new task called Unified Perceptual Parsing, which requires the machine vision systems to recognize as many visual concepts as possible from a given image. A multi-task framework called UPerNet and a training strategy are developed to learn from heterogeneous image annotations. We show that the network is able to effectively segment a wide range of concepts from images. It is further applied to discover visual knowledge in natural scenes.
*: equal contribution

ECCV18 in München, Germany
Acquisition of Localization Confidence for Accurate Object Detection

Modern CNN-based object detectors rely on bounding box regression and non-maximum suppression (NMS) to localize objects. While the probabilities for class labels naturally reflect classification confidence, localization confidence is absent. We propose IoU-Net learning to predict the IoU between each detected bounding box and the matched ground-truth. Furthermore, we formulate the predicted IoU as an optimization objective and propose an operator to enable the optimization via gradient descent.

ECCV18 (Oral) in München, Germany
Learning Visually-Grounded Semantics from Contrastive Adversarial Samples

We study the problem of grounding distributional representations of texts on the visual domain, namely visual-semantic embeddings (VSE for short). We show that previous model is restricted to establish the link between textual semantics and visual concepts by adversarial attacking. We alleviate this problem by augmenting the MS-COCO image captioning datasets with synthesized textual contrastive adversarial samples.
*: equal contribution

COLING18 in Santa Fe, New Mexico
MegDet: A Large Mini-Batch Object Detector

Mini-batch size, a key factor in the training, has not been well studied in previous works on object detection. We propose a Large Mini-Batch Object Detector to enable the training with much larger mini-batch size than previous work. Our detector is trained in much shorter time yet achieves better accuracy. The MegDet is the backbone of our submission to COCO 2017 Challenge, where we won the 1st place of Detection task.
*: equal contribution

CVPR18 (Spotlight) in Salt Lake City, Utah
Repulsion Loss: Detecting Pedestrians in a Crowd

Detecting individual pedestrians in a crowd remains a challenging problem. We explore how a state-of-the-art pedestrian detector is harmed by crowd occlusion and propose a novel crowd-robust regression loss specifically designed for crowd scenes.

CVPR18 in Salt Lake City, Utah
What Can Help Pedestrian Detection?

We explore various kinds of channel features to examine their outcome in CNN-based pedestrian detection frameworks, and propose a network architecture to jointly learn pedestrian detection as well as the given extra feature.
*: equal contribution

CVPR17 in Honolulu, Hawaii


A Cascaded Fully Convolutional Neural Network for Object Detection
China (In process) CN201711161464
A Multi-step System for Fast and Efficient Face Detection
China (In process) CN201710855131
A Framework and Device for Object Detection; A Framework and Device for Training Neural Network
U.S. and China (In process) US15814125/CN201611161693
A Framework and Device for Pedestrian Detection
China (In process) CN201710064338.7
A Framework and Device for Collecting Large-scale Training Data
China (In process) CN201611010103.1