Sports Field Registration via
Keypoints-aware Label Condition

Yen-Jui Chu

National Tsing Hua University

Jheng-Wei Su

National Tsing Hua University

Kai-Wen Hsiao

National Tsing Hua University

Chi-Yu Lien

National Tsing Hua University

Shu-Ho Fan

National Tsing Hua University

Min-Chun Hu

National Tsing Hua University

Ruen-Rone Lee

Delta Research Center

Chih-Yuan Yao

National Taiwan University of Science and Technology

Hung-Kuo Chu

National Tsing Hua University

CVPRW 2022


We propose a novel deep learning framework for sports field registration. The typical algorithmic flow for sports field registration involves extracting field-specific features (e.g., corners, lines, etc.) from field image and estimating the homography matrix between a 2D field template and the field image using the extracted features. Unlike previous methods that strive to extract sparse field features from field images with uniform appearance, we tackle the problem differently. First, we use a grid of uniformly distributed keypoints as our field-specific features to increase the likelihood of having sufficient field features under various camera poses. Then we formulate the keypoints detection problem as an instance segmentation with dynamic filter learning. In our model, the convolution filters are generated dynamically, conditioned on the field image and associated keypoint identity, thus improving the robustness of prediction results. To extensively evaluate our method, we introduce a new soccer dataset, called TS-WorldCup, with detailed field markings on 3812 time-sequence images from 43 videos of Soccer World Cup 2014 and 2018. The experimental results demonstrate that our method outperforms state-of-the-arts on the TS-WorldCup dataset in both quantitative and qualitative evaluation.

Overview Video


Architecture overview. Our proposed model consists of two parts, namely the standard encoder-decoder architecture and the keypoints-aware label condition module. Given the field image Iin as input, we perform symmetric encoder-decoder to extract the feature maps from encoder E and decoder D. We then generate the parameters of the dynamic head S using the keypoints-specific controller G fed by the extracted output feature of the encoder E (green vector) and the keypoints encoding vector Ki (orange vector). Then, the dynamic head S outputs the i-th heatmap Hpredi. Finally, we employ soft aggregation to merge all the predicted heatmaps {Hpredi}Ni=1 into the final output {Mpredi}Ni=1, and estimate the predicted homography Rpred using DLT and RANSAC.