We propose a novel model for temporal detection and localization which allows the training of deep neural networks using only counts of event occur rences as training labels. This powerful weakly- supervised framework alleviates the burden of the imprecise and time-consuming process of annotating event locations in temporal data. Unlike existing methods, in which localization is explic itly achieved by design, our model learns local ization implicitly as a byproduct of learning to count instances. This unique feature is a direct consequence of the model’s theoretical properties. We validate the effectiveness of our approach in a number of experiments (drum hit and piano onset detection in audio, digit detection in images) and demonstrate performance comparable to that of fully-supervised state-of-the-art methods, despite much weaker training requirements.