The loss function is an integral component of any successful deep neural network training; it guides the optimization process by reducing all aspects of a model into a single number that must best capture the overall objective of the learning. Recently, the maximum-likelihood parameter estimation principle has grown to become the default framework for selecting loss functions, hence resulting in the prevalence of the cross-entropy for classification and the mean-squared error for regression applications (Goodfellow et al., 2016). Loss functions can however be tailored further to convey prior knowledge about the task or the dataset at hand to the training process (e.g., class imbalances (Huang et al., 2016a; Cui et al., 2019), perceptual consistency (Reed et al., 2014), and attribute awareness (Jiang et al., 2019)). Overall, by designing loss functions that account for known priors, a more targeted supervision can be achieved with often improved performance. In this work, we focus on the ubiquitous prior of prediction sparsity, which underlines many applications that involve probability estimation. More precisely, while the iterative nature of gradient descent learning often requires models to be able to continuously reach any probability estimates between 0 and 1 during training, the optimal solution to the optimization problem (w.r.t. the groundtruth) is often sparse with clear-cut probabilities (i.e., either converging towards 1 or 0). For instance, in object detection, the decision that must be made by the models to either keep or discard estimated bounding-boxes for final predictions (e.g., non-maximum suppression) is binary. Similarly, in music onset detection, the optimal predictions are sparse: it is known that only a few points in time should be assigned a high likelihood, while no probability mass should be allocated to all other timesteps. In these applications, incorporating this important prior directly in the training process through the design of the loss function would offer a more tailored supervision, that better captures the underlying objective. To that effect, this work introduces a novel loss function that relies on instance counting to achieve prediction sparsity. More precisely, as shown in the theoretical part of this work, modeling occurrence counts as a Poisson-binomial distribution results in a differentiable training objective that has the unique intrinsic ability to converge probability estimates towards sparsity. In this setting, sparsity is thus not attained through an explicit sparsity-inducing operation, but is rather implicitly learned by the model as a byproduct of learning to count instances. We demonstrate that this cost function can be leveraged as a standalone loss function (e.g., for the weakly-supervised learning of temporal localization) as well as a sparsity regularization in conjunction with other more targeted loss functions to enforce sparsity constraints in an end-to-end fashion. By design, the proposed approach finds use in the many applications where the optimal predictions are known to be sparse. We thus prove the validity of the loss function on a wide array of tasks including weakly-supervised drum detection, piano onset detection, single-molecule localization microscopy, and robust event detection in videos or in wearable sensors time series. Overall, the experiments conducted in this work not only highlight the effectiveness and the relevance of Poisson-binomial counting as a means of supervision, but also demonstrate that integrating prediction sparsity directly in the learning process can have a significant impact on generalization capability, noise robustness, and detection accuracy.