I'm a Ph.D. student in the Berkeley Artificial Intelligence Research (BAIR) Lab at University of California, Berkeley, advised by Prof. Trevor Darrell. My research interests lie in the field of Computer Vision, Machine Learning and Robotics, particularly in enabling intelligent agents perceive, comprehend and reason as we humans do, as well as in understanding how intelligence truly emerges w/o supervision. I graduated from Peking University (PKU) in 2019, the most progressive university in China, summa cum laude with a Bachelor's degree. I received China National Scholarship in 2016 and was selected as a 2019 Snap Research Scholar for my research and school courses. [ Résumé ]
Before joining UC Berkeley, I collaborated with and learned from Dr. Bolei Zhou (now Assistant Professor at CUHK) on a series of works. Prior to that I was a senior research intern at Megvii (Face++) Inc. (a leading Chinese AI start-up) for two and a half years mentored by Mr. Yuning Jiang and supervised by Dr. Jian Sun.
I started computer programming in primary school and have been participating in programming contest involving complicated algorithms and data structures since high school. I was admitted to 2014 Shandong Province Team as the first-place winner to compete in National Olympiad in Informatics (NOI), where I won a bronze medal. Later in college I won two gold medals in 2016 and 2017 ACM-ICPC Asia Regional Contest for PKU.
I have a great fondness for arts besides my academia work. I'm deeply passionate about musical works, especially those in classical music (both instrumental & vocal) and jazz. I'm also into literatures and visual arts. My most wonderful journey so far was travelling from Berlin to Leipzig, Eisenach, Dresden, Bayreuth and Heidelberg following the trace of some great composers such as J.S. Bach, Felix Mendelssohn, Franz Liszt and Richard Wagner. The journey ended up in Paris, the city of art (and love), where writers like Ernest Hemingway and F. Scott Fitzgerald have lived their lives.
I'm a political activist in both U.S. and my home country.
News:[Mar. 2020] One paper on compositional action recognition accepted to CVPR20 in Seattle, Washington! Check out the project page!
[July 2019] One paper accepted to ICCV19 in Seoul, South Korea!
[May 2019] I will join the wonderful Berkeley Artificial Intelligence Research (BAIR) Lab as a Ph.D. student at the lovely UC Berkeley in August 2019. Go Bears!
[Nov. 2018] One paper accepted to IJCV! Compared to the original CVPR paper, we included a variety of interesting applications on ADE20K plus the study of synchronized batch norm for semantic segmentation. Check out the paper for more details!
[July 2018] Two papers (including one oral) accepted to ECCV18 in München, Germany! (Paper and code released!)
[May 2018] One paper accepted to COLING18 in Santa Fe, the Land of Enchantment! It's my very first paper of vision-language! (paper and codes are released.)
[Apr. 2018] A PyTorch implementation of scene parsing networks trained on ADE20K with SOTA performance is released in conjunction with MIT CSAIL. Check out our code, it's popular!
[Feb. 2018] Two papers accepted to CVPR18 in Salt Lake City, Beehive State!
[Oct. 2017] As a team member of Megvii (Face++), we won the premier challenge for object detection-COCO and Places Challenges 2017: the 1st places of COCO Detection, COCO Keypoint and Places Instance Segmentation, as well as the 2nd place of COCO Instance Segmentation. I was invited to present at COCO & Places Joint Workshop at ICCV17 in Venice, Italy.
[Slides]        [Media coverage]: ChinaNews (in Chinese)
We study the compositionality of action by looking into the dynamics of subject-object interactions. We propose a novel model which can explicitly reason about the geometric relations between constituent objects and an agent performing an action. We collect dense object box annotations on the Something-Something dataset. We propose a novel compositional action recognition task where the training combinations of verbs and nouns do not overlap with the test set.
†: equal advising
Objects are entities we act upon, where the functionality of an object is determined by how we interact with it. We propose a Dual Attention Network model which reasons about human-object interactions. The recognition of objects and actions mutually benefit each other. It can also perform weak spatiotemporal localization and affordance segmentation, despite being trained only with video-level labels. The model not only finds when an action is happening and which object is being manipulated, but also identifies which part of the object is being interacted with.
We present a densely annotated dataset ADE20K, which spans diverse annotations of scenes, objects, parts of objects, and in some cases even parts of parts. We construct benchmarks for scene parsing and instance segmentation. We provide baseline performances on both of the benchmarks and re-implement the state-of-the-art models for open source. We further evaluate the effect of synchronized batch normalization and find that a reasonably large batch size is crucial for the semantic segmentation performance. We show that the networks trained on ADE20K is able to segment a wide variety of scenes and objects.
In this paper, we study a new task called Unified Perceptual Parsing, which requires the machine vision systems to recognize as many visual concepts as possible from a given image. A multi-task framework called UPerNet and a training strategy are developed to learn from heterogeneous image annotations. We show that the network is able to effectively segment a wide range of concepts from images. It is further applied to discover visual knowledge in natural scenes.
*: equal contribution
Modern CNN-based object detectors rely on bounding box regression and non-maximum suppression (NMS) to localize objects. While the probabilities for class labels naturally reflect classification confidence, localization confidence is absent. We propose IoU-Net learning to predict the IoU between each detected bounding box and the matched ground-truth. Furthermore, we formulate the predicted IoU as an optimization objective and propose an operator to enable the optimization via gradient descent.
We study the problem of grounding distributional representations of texts on the visual domain, namely visual-semantic embeddings (VSE for short). We show that previous model is restricted to establish the link between textual semantics and visual concepts by adversarial attacking. We alleviate this problem by augmenting the MS-COCO image captioning datasets with synthesized textual contrastive adversarial samples.
*: equal contribution
Mini-batch size, a key factor in the training, has not been well studied in previous works on object detection. We propose a Large Mini-Batch Object Detector to enable the training with much larger mini-batch size than previous work. Our detector is trained in much shorter time yet achieves better accuracy. The MegDet is the backbone of our submission to COCO 2017 Challenge, where we won the 1st place of Detection task.
*: equal contribution
Detecting individual pedestrians in a crowd remains a challenging problem. We explore how a state-of-the-art pedestrian detector is harmed by crowd occlusion and propose a novel crowd-robust regression loss specifically designed for crowd scenes.
We explore various kinds of channel features to examine their outcome in CNN-based pedestrian detection frameworks, and propose a network architecture to jointly learn pedestrian detection as well as the given extra feature.
*: equal contribution