I'm an undergraduate student majoring in Intelligence Science at Peking University (PKU), one of the two preeminent universities for higher education in China. My research interests lie in the field of Computer Vision, Machine Learning and Robotics, particularly in enabling intelligent agents perceive, comprehend and reason as we humans do, as well as in understanding how intelligence truly emerges. I received China National Scholarship in 2016 and was selected as a 2019 Snap Research Scholar for my research and school courses. [ Résumé ]
I interned at MIT-IBM Watson AI Lab as Research Intern based in Cambridge, MA in the year of 2018, during which I fell in love with the great city of Boston. Prior to that I was a senior research intern at Megvii (Face++) Inc. (a leading Chinese AI start-up) for two and a half years mentored by Mr. Yuning Jiang and supervised by Dr. Jian Sun. I was also a research assistant at Institute of Computer Science and Technology (ICST), Peking University, supervised by Dr. Yadong Mu.
I started computer programming in primary school and have been participating in programming contest involving complicated algorithms and data structures since high school. I was admitted to 2014 Shandong Province Team as the first-place winner to compete in National Olympiad in Informatics (NOI), where I won a bronze medal. Later in college I won two gold medals in 2016 and 2017 ACM-ICPC Asia Regional Contest for PKU.
I have a great fondness for arts besides my academia work. I'm deeply passionate about musical works, especially those in classical music (both instrumental & vocal) and jazz. I'm also into literatures, paintings and movies. I travel constantly each year, mostly alone in the U.S. and European countries. My most wonderful journey so far was travelling from Berlin to Leipzig, Eisenach, Dresden, Bayreuth and Heidelberg following the trace of some great composers such as J.S. Bach, Felix Mendelssohn, Franz Liszt and Richard Wagner. The journey ended up in Paris, the city of art (and love), where writers like Ernest Hemingway and F. Scott Fitzgerald have lived their lives.
News:[May 2019] I will join the wonderful Berkeley Artificial Intelligence Research (BAIR) Lab as a Ph.D. student at the lovely UC Berkeley in August 2019. Go Bears!
[Nov. 2018] One paper accepted to IJCV! Compared to the original CVPR paper, we included a variety of interesting applications on ADE20K plus the study of synchronized batch norm for semantic segmentation. Check out the paper for more details!
[July 2018] Two papers (including one oral) accepted to ECCV18 in München, Germany! (Paper and code released!)
[May 2018] One paper accepted to COLING18 in Santa Fe, The Land of Enchantment! It's my very first paper of vision-language! (paper and codes are released.)
[Apr. 2018] A flexible PyTorch implementation of scene parsing networks trained on ADE20K with SOTA performance is released in conjuction with MIT CSAIL.
[Feb. 2018] Two papers accepted to CVPR18 in Salt Lake City, Beehive State!
[Feb. 2018] I'm thrilled that I will be joining IBM Research as Research Intern in AI based in Cambridge, Massachusetts this summer. During the time, besides the extraordinary researchers at IBM, I will also work with friends and scientists at MIT. See you soon!
[Oct. 2017] As a team member of Megvii (Face++), we won the premier challange for object detection-COCO and Places Challenges 2017: the 1st places of COCO Detection, COCO Keypoint and Places Instance Segmentation, as well as the 2nd place of COCO Instance Segmentation. I was invited to present at COCO & Places Joint Workshop at ICCV17 in Venice, Italy.
[Slides]        [Media coverage]: ChinaNews (in Chinese)
[Feb. 2017] One paper accepted to CVPR17 in Honolulu, Aloha State! Wonderful experience working with my brilliant collaborator Jiayuan Mao.
We present a densely annotated dataset ADE20K, which spans diverse annotations of scenes, objects, parts of objects, and in some cases even parts of parts. Totally there are 25k images of the complex everyday scenes containing a variety of objects in their natural spatial context. We construct benchmarks for scene parsing and instance segmentation. We provide baseline performances on both of the benchmarks and re-implement the state-of-the-art models for open source. We further evaluate the effect of synchronized batch normalization and find that a reasonably large batch size is crucial for the semantic segmentation performance. We show that the networks trained on ADE20K is able to segment a wide variety of scenes and objects.
Humans recognize the visual world at multiple levels: we effortlessly categorize scenes and detect objects inside, while also identifying the textures and surfaces of the objects along with their different compositional parts. In this paper, we study a new task called Unified Perceptual Parsing, which requires the machine vision systems to recognize as many visual concepts as possible from a given image. A multi-task framework called UPerNet and a training strategy are developed to learn from heterogeneous image annotations. We benchmark our framework on Unified Perceptual Parsing and show that it is able to effectively segment a wide range of concepts from images. The trained networks are further applied to discover visual knowledge in natural scenes.
*: Equal contribution
Modern CNN-based object detectors rely on bounding box regression and non-maximum suppression (NMS) to localize objects. While the probabilities for class labels naturally reflect classification confidence, localization confidence is absent. This makes properly localized bounding boxes degenerate during iterative regression or even suppressed during NMS. We propose IoU-Net learning to predict the IoU between each detected bounding box and the matched ground-truth. The network acquires this confidence of localization, which improves NMS procedure by preserving accurately localized bounding boxes. Furthermore, formulating the predicted IoU as an optimization objective provides monotonic improvement in bounding box localization. Extensive experiments on the MS-COCO dataset show the advance of IoU-Net and its compatibility and adaptivity to several state-of-the-art object detectors.
*: Equal contribution
We study the problem of grounding distributional representations of texts on the visual domain, namely visual-semantic embeddings (VSE for short). Begin with an insightful adversarial attack on VSE embeddings, we show the limitation of current frameworks and image-text datasets (e.g., MS-COCO). We show that the model is restricted to establish the link between textual semantics and visual concepts. We alleviate this problem by augmenting the MS-COCO image captioning datasets with synthesized textual contrastive adversarial samples. The samples enforce the model to ground learned embeddings to concrete concepts within the image. This simple but powerful technique brings a noticeable improvement over the baselines on a diverse set of downstream tasks, in addition to defending known-type adversarial attacks.
*: Equal contribution
Mini-batch size, a key factor in the training, has not been well studied in previous works on object detection. In this paper, we propose a Large Mini-Batch Object Detector (MegDet) to enable the training with much larger mini-batch size than previous work (i.e., from 16 to 256), so that we are able to effectively utilize multiple GPUs (up to 128 in our experiments) to significantly accelerate training procedure. Our detector is trained in much shorter time yet achieves better accuracy. The MegDet is the backbone of our submission (mmAP 52.5%) to COCO 2017 Challenge, where we won the 1st place of Detection task.
*: Equal contribution
Detecting individual pedestrians in a crowd remains a challenging problem. In this paper, we explore how a state-of-the-art pedestrian detector is harmed by crowd occlusion and propose a novel crowd-robust regression loss specifically designed for crowd scenes. This loss is driven by two motivations: the attraction by target, and the repulsion by other surrounding objects. Our detector trained by repulsion loss outperforms all the state-of-the-art methods with a significant improvement in occlusion cases.
Aggregating extra features has been considered as an effective approach to boost traditional pedestrian detection methods. The first contribution of this paper is exploring this in CNN-based pedestrian detection frameworks by evaluating the effects of different kinds of extra features quantitatively. Moreover, we propose a novel network architecture to jointly learn pedestrian detection as well as the given extra feature. By multi-task training, HyperLearner is able to utilize the information of given features and improve detection performance without extra inputs in inference.
*: Equal contribution