Community data resources that aggregate datasets across studies are critical infrastructure for modern biomedical research, enabling large-scale analysis and the development of Artificial Intelligence (AI) models. However, building these resources involves a fundamental tension: the desire for a large corpus is often at odds with the need for richness and quality in both data and metadata. We detail how the collaborative submission model - where data contributors partner with dedicated resource curators - has enabled CZ CELLxGENE Discover to become a rapidly growing, widely used community resource for training and testing AI models, performing integrative analysis, validating findings, and generating hypotheses. This partnership leverages contributors intimate study knowledge and curators focus on data reuse and expertise in standardization to improve data quality, metadata accuracy, and contextual richness. This is achieved by motivating researcher participation through tangible benefits while minimizing submission burden. We contrast this collaborative model with contributor-driven and resource-driven approaches, highlighting tradeoffs in scalability, quality assurance, and sustainability. The principles and practices we describe provide a framework for building sustainable, high-quality community data resources across diverse biological data types.
Hilton, J. A. et al. · CC-BY 4.0