As a critical task of the urban traffic services, fine-grained urban flow inference (FUFI) benefits in many fields including intelligent transportation management, urban planning, public safety. FUFI is a technique that focuses on inferring fine-grained urban flows depending solely on observed coarse-grained data. However, existing methods always require massive learnable parameters and the complex network structures. To reduce these defects, we formulate a contrastive self-supervision method to predict fine-grained urban flows taking into account all correlated spatial and temporal contrastive patterns. Through several well-designed self-supervised tasks, uncomplicated networks have a strong ability to capture high-level representations from flow data. Then, a fine-tuning network combining with three pre-training encoder networks is proposed. We conduct experiments to evaluate our model and compare with other state-of-the-art methods by using two real-world datasets. All the empirical results not only show the superiority of our model against other comparative models, but also demonstrate its effectiveness in the resource-limited environment.