Tete (Jason) Xiao [cv]
Undergrad [at] Peking University,
P.R. China (class of 2019)
jasonhsiao97 [at]
In Deep.
I am passionte with Computer Vision, Natural Language Processing and Machine Learning. I aim to combine the great power of vision and language.


Tete Xiao is an undergraduate student majoring in Intelligence Science at Peking University (PKU), one of the two best universities in China. He is now interning at MIT-IBM Watson AI Lab as Research Intern based in Cambridge, MA. He was a senior research intern at Megvii (Face++) Inc. (a leading Chinese AI start-up) mentored by Mr. Yuning Jiang and supervised by Dr. Jian Sun. He was also a research assistant at Institute of Computer Science and Technology (ICST), Peking University, supervised by Dr. Yadong Mu. His research interests lie in the field of Computer Vision and Machine Learning, particularly in making intelligent agents understand the real world. He has received China National Scholarship for his excellent performance in courses and research. 

Tete started computer programming in primary school and has been participating in programming contest involving complicated algorithms and data structures since high school. He was admitted to Shandong Province Team of 2014 at that time as first place winner to compete in National Olympiad in Informatics (NOI), where he won a bronze medal. Later in college he won two gold medals in 2016 and 2017 ACM-ICPC Asia Regional Contest for PKU.


[Summer 2018] I will apply for Ph.D. program this fall in the U.S and Europe.
[Nov. 2018] One paper accepted to IJCV! Compared to the original CVPR paper, we included a variety of interesting applications on ADE20K plus the study of synchronized batch norm for semantic segmentation. Check out the paper for more details!
[Jul. 2018] Two papers (including one oral) accepted to ECCV18 in München, Germany! (Paper and code released!)
[May 2018] One paper accepted to COLING18 in Santa Fe, The Land of Enchantment! It's my very first paper of vision-language! (paper and codes are released.)
[Apr. 2018] A flexible PyTorch implementation of scene parsing networks trained on ADE20K with SOTA performance is released in conjuction with MIT CSAIL.
[Feb. 2018] Two papers accepted to CVPR18 in Salt Lake City, Beehive State!
[Feb. 2018] I'm thrilled that I will be joining IBM Research as Research Intern in AI based in Cambridge, Massachusetts this summer. During the time, besides the extraordinary researchers at IBM, I will also work with friends and scientists at MIT. See you soon!
[Oct. 2017] As a team member of Megvii (Face++), we won the premier challange for object detection-COCO and Places Challenges 2017: the 1st places of COCO Detection, COCO Keypoint and Places Instance Segmentation, as well as the 2nd place of COCO Instance Segmentation. I was invited to present at COCO & Places Joint Workshop at ICCV17 in Venice, Italy.
[Slides]        [Media coverage]: ChinaNews (in Chinese)
[Feb. 2017] One paper accepted to CVPR17 in Honolulu, Aloha State! Wonderful experience working with my brilliant collaborator Jiayuan Mao.


July. 2018 -
Research Intern @ MIT-IBM Watson AI Lab
Video Analytics
Collaborators: (IBM) Dr. Dan Gutfreund and Dr. Quanfu Fan;
(MIT) Dr. Aude Oliva and Dr. Bolei Zhou
July. 2016 - July. 2017
Research Assistant @ ICST, PKU
Machine Intelligence and Computer Vision
Supervisor: Dr. Yadong Mu
Dec. 2015 -
Research Intern @ Megvii (Face++) Inc.
Object Detection and Computer Vision
Supervisor: Dr. Jian Sun
Sept. 2015 -
Undergrad @ PKU
B.S. in Intelligence Science


Semantic Understanding of Scenes through the ADE20K Dataset

We present a densely annotated dataset ADE20K, which spans diverse annotations of scenes, objects, parts of objects, and in some cases even parts of parts. Totally there are 25k images of the complex everyday scenes containing a variety of objects in their natural spatial context. We construct benchmarks for scene parsing and instance segmentation. We provide baseline performances on both of the benchmarks and re-implement the state-of-the-art models for open source. We further evaluate the effect of synchronized batch normalization and find that a reasonably large batch size is crucial for the semantic segmentation performance. We show that the networks trained on ADE20K is able to segment a wide variety of scenes and objects.

IJCV Springer
Unified Perceptual Parsing for Scene Understanding

Humans recognize the visual world at multiple levels: we effortlessly categorize scenes and detect objects inside, while also identifying the textures and surfaces of the objects along with their different compositional parts. In this paper, we study a new task called Unified Perceptual Parsing, which requires the machine vision systems to recognize as many visual concepts as possible from a given image. A multi-task framework called UPerNet and a training strategy are developed to learn from heterogeneous image annotations. We benchmark our framework on Unified Perceptual Parsing and show that it is able to effectively segment a wide range of concepts from images. The trained networks are further applied to discover visual knowledge in natural scenes.
*: Equal contribution

ECCV2018 in München, Germany
Acquisition of Localization Confidence for Accurate Object Detection

Modern CNN-based object detectors rely on bounding box regression and non-maximum suppression (NMS) to localize objects. While the probabilities for class labels naturally reflect classification confidence, localization confidence is absent. This makes properly localized bounding boxes degenerate during iterative regression or even suppressed during NMS. We propose IoU-Net learning to predict the IoU between each detected bounding box and the matched ground-truth. The network acquires this confidence of localization, which improves NMS procedure by preserving accurately localized bounding boxes. Furthermore, formulating the predicted IoU as an optimization objective provides monotonic improvement in bounding box localization. Extensive experiments on the MS-COCO dataset show the advance of IoU-Net and its compatibility and adaptivity to several state-of-the-art object detectors.
*: Equal contribution

ECCV2018 (Oral) in München, Germany
Learning Visually-Grounded Semantics from Contrastive Adversarial Samples

We study the problem of grounding distributional representations of texts on the visual domain, namely visual-semantic embeddings (VSE for short). Begin with an insightful adversarial attack on VSE embeddings, we show the limitation of current frameworks and image-text datasets (e.g., MS-COCO). We show that the model is restricted to establish the link between textual semantics and visual concepts. We alleviate this problem by augmenting the MS-COCO image captioning datasets with synthesized textual contrastive adversarial samples. The samples enforce the model to ground learned embeddings to concrete concepts within the image. This simple but powerful technique brings a noticeable improvement over the baselines on a diverse set of downstream tasks, in addition to defending known-type adversarial attacks.
*: Equal contribution

COLING 2018 in Santa Fe, New Mexico
MegDet: A Large Mini-Batch Object Detector

Mini-batch size, a key factor in the training, has not been well studied in previous works on object detection. In this paper, we propose a Large Mini-Batch Object Detector (MegDet) to enable the training with much larger mini-batch size than previous work (i.e., from 16 to 256), so that we are able to effectively utilize multiple GPUs (up to 128 in our experiments) to significantly accelerate training procedure. Our detector is trained in much shorter time yet achieves better accuracy. The MegDet is the backbone of our submission (mmAP 52.5%) to COCO 2017 Challenge, where we won the 1st place of Detection task.
*: Equal contribution

CVPR 2018 (Spotlight) in Salt Lake City, Utah
Repulsion Loss: Detecting Pedestrians in a Crowd

Detecting individual pedestrians in a crowd remains a challenging problem. In this paper, we explore how a state-of-the-art pedestrian detector is harmed by crowd occlusion and propose a novel crowd-robust regression loss specifically designed for crowd scenes. This loss is driven by two motivations: the attraction by target, and the repulsion by other surrounding objects. Our detector trained by repulsion loss outperforms all the state-of-the-art methods with a significant improvement in occlusion cases.

CVPR 2018 in Salt Lake City, Utah
What Can Help Pedestrian Detection?

Aggregating extra features has been considered as an effective approach to boost traditional pedestrian detection methods. The first contribution of this paper is exploring this in CNN-based pedestrian detection frameworks by evaluating the effects of different kinds of extra features quantitatively. Moreover, we propose a novel network architecture to jointly learn pedestrian detection as well as the given extra feature. By multi-task training, HyperLearner is able to utilize the information of given features and improve detection performance without extra inputs in inference.
*: Equal contribution

CVPR 2017 in Honolulu, Hawaii


A Cascaded Fully Convolutional Neural Network for Object Detection
China (In process) CN201711161464
A Multi-step System for Fast and Efficient Face Detection
China (In process) CN201710855131
A Framework and Device for Object Detection; A Framework and Device for Training Neural Network
U.S. and China (In process) US15814125/CN201611161693
A Framework and Device for Pedestrian Detection
China (In process) CN201710064338.7
A Framework and Device for Collecting Large-scale Training Data
China (In process) CN201611010103.1