Perceptual Organisation in Humans and Animals

István Winkler, Georg Klump, Susan Denham


We conduct human, animal and modelling experiments to understand:

  • how information relating to different sound sources is organised on the fly to form stable representations of auditory objects;

  • what conditions lead to the formation of new auditory objects;

  • what conditions lead to the modification and maintenance of existing objects;

  • how the system deals with ambiguity;

  • what are the neural bases of perceptual sound organisation; and

  • what are the computational principles underlying auditory perception.


General introduction

Under normal circumstances, we experience the world organized in terms of objects and their interactions. We impose this order on the wealth of information arriving at our sensory organs, because our survival depends on correctly identifying those entities in our environment, which allow us to satisfy our needs or present us some danger. In the auditory modality, the set of brain processes determining the probable sound sources and their behaviour in the environment has been termed Auditory Scene Analysis by Albert Bregman. The processing strategies which allow the brain to group together those parts of the auditory input, which are likely to have been generated by the same object(s) and event and to segregate those of separate origin have been extensively investigated (for recent reviews, see Snyder & Alain, 2007, Carlyon, 2004 and Ciocca, 2008).  

Sound grouping strategies fall into two classes, concurrent (used to assign simultaneously active features to one or more objects) and sequential (used to form associations between discrete sound events). Harmonicity and common onset are primary concurrent grouping cues. Sequential grouping is based on heuristic principles discovered almost a century ago by Gestalt  psychologists, such as similarity (similar sounds are likely to have been emitted by the same source), good continuation (many natural sources gradually change their emission pattern), etc. Cues for concurrent grouping are typically processed very fast and they provide an initial sense of the putative sources active in the environment. Sequential cues often require more complex processing and they provide information about the behaviour of the sound sources. The two types of grouping processes interact  in organizing the sound input. Ecologically this makes sense as most informative sounds, especially communication sounds, are intermittent, and it is necessary to form associations between events which may be separated in time by fairly long intervals. Thus there is a trade-off between global and local decisions, and the global context constrains local decisions.

Sequential grouping has often been investigated using the auditory streaming paradigm (see Figure I below) to determine the physical parameters which govern the associations formed between alternating sounds. The importance of this approach is that the same sequence of sounds can be perceived in (typically two) different ways depending on the sequential grouping decision, and there are salient perceptual differences between the different groupings. For example, if all sounds illustrated in the figure below are currently assessed as belonging to the same coherent sound sequence (termed auditory stream), then listeners perceive and report a galloping rhythm (the integrated percept); however, if the sounds marked red are evaluated as forming a separate stream from the sounds marked green, then the galloping rhythm is no longer heard, and one sound sequence pops into the perceptual foreground, while the other falls into the background (the segregated percept). The latter perceptual phenomenon is termed auditory streaming. Virtually any type of detectable difference can trigger streaming. There is also a trade-off between featural differences and the time intervals between successive sounds, with shorter intervals increasing the tendency to report streaming. Auditory streams can be regarded as proto object representations in the brain. They primarily describe the temporally synchronized behaviour of real world sound sources. That is, when a sound source acts independently of other sources, the corresponding auditory stream matches the real world object. However, when multiple sound sources act in a synchronized way (such as the musical instruments while playing an orchestral piece), the auditory streams correspond to such groups of sound sources.

The auditory streaming paradigm.jpg

Figure I. The auditory streaming paradigm. The same sequence of alternating sounds can be perceived as belonging to a single perceptual object (top) or to two separate objects (bottom), one occupying the foreground and the other the background .


For more information, please see our following publications.


Theoretical papers:


Winkler, I., Denham, S.L., & Nelken, I. (2009). Modeling the auditory scene: predictive regularity representations and perceptual objects. Trends in Cognitive Sciences, 13, 532-540.

  • A review proposing that auditory streams are derived from multi-scale regularities in the acoustic signal and that the regularity representations underlying auditory streams provide predictions for the likely continuation of the sound sequence. Alternative sound organizations are evaluated on the basis of the accuracy of the predictions of the corresponding regularity representations with the vest description appearing in perception.

Winkler, I. (2010). In search for auditory object representations. In I. Czigler & I. Winkler (Eds), Unconscious Memory Representations in Perception: Processes and Mechanisms in the Brain (pp. 71-106). John Benjamin: Amsterdam and Philadelphia.

  • A review of the event related brain potential evidence showing that predictive representations of auditory regularities are formed and maintained in the human brain. It is shown that the characteristics of these representations match those expected of object representations.

Näätänen, R., Kujala, T., & Winkler, I. (2011). Auditory processing that leads to conscious perception: a unique window to central auditory processing opened by the mismatch negativity (MMN) and related responses. Psychophysiology, 48, 4-22.

  • A review of the event related brain potential evidence separating conscious and non-conscious stimulus processing in the brain with an updated box model of change and deviance detection processes.


Empirical studies:

Denham, S.L., Gyimesi, K., Stefanics, G., & Winkler, I. (under revision). Perceptual bi-stability in auditory streaming: How much do stimulus features matter?

  • We found that when listeners are presented with relatively long (4 minutes) sequences of the auditory streaming paradigm (ABA-… see General Introduction), perceptual switching occurs for a wide range of frequency separation and presentation rates (the ∆f-SOA [Stimulus Onset Asynchrony] space), even in regions of the parameter space previously thought to be stable. Moreover, two subsequent phases of auditory stream segregation were observed: The first phase consists of the first reported percept, which has a relatively long duration and statistics of the perceived organization are sensitive to ∆f and ∆t; in the second phase perception switches between alternative sound organisation and as time goes on, statistics of the perceived organization become more and more independent of the ∆f and ∆t parameters.

Denham, S.L., Gyimesi, K., Stefanics, G., & Winkler, I. (2010). Stability of perceptual organisation in auditory streaming. In E. A. Lopez-Poveda, A. R. Palmer & R. Meddis (Eds), Advances in Auditory Research: Physiology, Psychophysics and Models. Springer: New York, 477-488. 

  • Three types of cues were tested in search of finding what would stabilise perception in the auditory streaming paradigm (ABA-… see General Introduction). Neither jittering both ∆f and ∆t (frequency separation and the inter sound intervals), nor adding location difference on top of frequency separation, nor inserting silent intervals into the sequence did significantly reduce perceptual switching.

Bendixen, A., Denham, S.L., Gyimesi, K., & Winkler, I. (2010). Regular patterns stabilize auditory streams. Journal of the Acoustical Society of America, 128, 3658–3666.

  • We show that regular temporal sound patterns are utilized in auditory scene analysis. It appears that the role of this cue lies in stabilizing streams once they have been formed on the basis of primary acoustic cues, rather than in the initial segregation of auditory streams.

Bendixen, A., Schröger, E., & Winkler, I. (2009). I heard that coming: Event-related potential evidence for stimulus-driven prediction in the auditory system. The Journal of Neuroscience, 29, 8447-8451.

  • We demonstrate that the auditory system prepares for processing fully predictable sounds. Electrical signals measured from the scalp for fully predictable sounds appeared to be very similar during the first 50 ms from the expected sound onset whether the sound was actually delivered or not.

Bendixen, A., Jones, S.J., Klump, G., & Winkler, I. (2010). Probability dependence and functional separation of object-related and mismatch negativity event-related potential components. Neuroimage, 50, 285-290.

  • We tested the effects of event probability on an event related brain potential (ERP) indicator of concurrent object segregation, the Object Related Negativity – ORN. We found that the ORN amplitude was sensitive to the within sequence probability of the occurrence of two concurrent objects. In addition, this ERP index of concurrent object segregation was distinguished from the Mismatch Negativity (MMN) ERP, which is a index the detection sequential rule violations and has been linked with the maintenance of auditory streams.

Horváth, J., Sussman, E., Winkler, I., & Schröger, E. (2011). Preventing distraction: assessing stimulus-specific and general effects of the predictive cueing of deviant auditory events. Biological Psychology, in press; doi:10.1016/j.biopsycho.2011.01.011.

  • We tested whether visual information is taken into account in predictive sound processing. We found that visual cues delivering stimulus-specific information about upcoming sounds are utilized in auditory stimulus processing and they allow the listener to avoid being distracted by the cued acoustic events.

Itatani, N., & Klump, G.M. (2009). Auditory streaming of amplitude-modulated sounds in the songbird forebrain. Journal of Neurophysiology, 101, 3212-3225.

  •  We present results from multiunit recordings in the auditory forebrain of awake European starlings (Sturnus vulgaris) on the representation of sinusoidally amplitude modulated (SAM) tones to investigate the effect of temporal envelope structure on neural stream segregation.

Klinge, A., & Klump, G.M. (2009). Frequency difference limens of pure tones and harmonics within complex stimuli in Mongolian gerbils and humans. Journal of the Acoustical Society of America, 125, 304-314.

  • Gerbils are far less sensitive to frequency differences in pure tones than humans are. On the other hand they are far more sensitive to mistuned harmonics than humans, and appear to rely for their sensitivity more on temporal cues than humans do. The results are discussed with regard to possible processing mechanisms for pure tone frequency discrimination and for detecting mistuning in harmonic complex stimuli.

Klink, K.B., Dierker, H., Beutelmann, R., & Klump, G.M. (2010). Comodulation masking release determined in the mouse (Mus musculus) using a flanking-band paradigm. The Journal of the Association for Research in Otolaryngology, 11, 79-88.

  •  Comodulation masking release (CMR) has been attributed to auditory processing within one auditory channel (within-channel cues) and/or across several auditory channels (across-channel cues). The present flanking-band (FB) experiment was designed to separate the amount of CMR due to within- and across-channel cues and to investigate the role of temporal cues on the size of within-channel CMR. The auditory system of mice might be able to use the change in modulation depth at a beating frequency of 100 Hz as a cue for signal detection, while being unable to detect changes in modulation depth at high modulation frequencies. These results are consistent with other experiments and model predictions for CMR in humans which suggested that the main contribution to the CMR effect stems from processing of within-channel cues.

Modelling work:

Denham, S.L., Dura-Bernal, S., Coath, M., & Balaguer-Ballester, E. (2010). Neurocomputational models of perceptual organization. In I. Czigler & I. Winkler (Eds), Unconscious Memory Representations in Perception: Processes and Mechanisms in the Brain (pp. 147-178). John Benjamin: Amsterdam and Philadelphia.

  • We consider models of perceptual organisation in the visual and auditory modalities and motivate our view of perception as a process of inference and verification through a number of specific examples.

Mill, R., Coath, M., Wennekers, T., Denham, S.L. (2010), 'Abstract Stimulus-Specific Adaptation Models', Neural Computation, 23(2): 435:76.

  • Stimulus-specific adaptation  (SSA) refers to a decrease in the spiking of a neuron in response to a repetitive stimulus. We address the computational problem of SSA when inputs are encoded as Poisson spike trains. How should a system—biological or artificial—maximise its response to rare stimuli and minimise its response to common ones? Detailed treatment of this question will be helpful to others designing computational or hardware models of SSA that receive Poisson inputs.

Robert Mill, Tamás Bőhm, Alexandra Bendixen, István Winkler, and Susan L. Denham (2011). 'CHAINS: Competition and Cooperation between Fragmentary Event Predictors in a Model of Auditory Scene Analysis', 45th International Conference on Information Sciences and Systems, Johns Hopkins University, USA, 23-25 March 2011

  • This paper presents an algorithm called CHAINS for separating temporal patterns of events that are mixed together. The algorithm is motivated by the task the auditory system faces when it attempts to analyse an acoustic mixture to determine the sources that contribute to it, and in particular, sources that emit regular sequences.


Other models of sensory processing :

  • Hierarchical generative models and Bayesian belief propagation have been shown to provide a theoretical framework that can account for perceptual processes, including feedback modulation. The framework explains both psychophysical and physiological experimental data and maps well onto the hierarchical distributed cortical anatomy. We propose a novel methodology to implement selectivity and invariance using belief propagation on Bayesian networks, to combine feedback information from multiple parents, significantly reducing the number of parameters and operations, and to deal with loops using loopy belief propagation and different sampling methods.

Lanyon, L.L., Denham, S.L. (2010). ‘Modelling Visual Neglect: Computational Insights into Conscious Perception’, PLoS One, June 2010 | Volume 5 | Issue 6 | e11128. 

  • The aim of this work was to examine the effects of parietal and frontal lesion in an existing computational model of visual attention and search and simulate visual search behaviour under lesion conditions. We find that unilateral parietal lesion in this model leads to symptoms of visual neglect in simulated search scan paths, including an inhibition of return (IOR) deficit, while frontal lesion leads to milder neglect and to more severe deficits in IOR and perseveration in the scan path. 

Coath, M., Denham, S.L., Smith, L.M., Honing, H., Hazan, A., Holonowicz, P., Purwins, H. (2009). An auditory model for the detection of perceptual onsets and beat tracking in singing. Connection Science, 21:2, 193-205 

  • We describe a biophysically motivated model of auditory salience and show that the derived measure of salience can be used to identify the position of perceptual onsets in a musical stimulus, and track and predict rhythmic structure. 

  • Pitch is one of the most important features of natural sounds, underlying the perception of melody in music and prosody in speech. However, the temporal dynamics of pitch processing are still poorly understood. We describe a neurocomputational model, which provides for the first time a unified account of the multiple time scales observed in pitch perception. The model contains a hierarchy of integration stages and uses feedback to adapt the effective time scales of processing at each stage in response to changes in the input stimulus. The model has features in common with a hierarchical generative process and suggests a key role for efferent connections from central to sub-cortical areas in controlling the temporal dynamics of pitch processing.

Ballaguer-Ballester, E., Denham, S.L., Meddis, R. (2008), 'A cascade autocorrelation model of pitch perception', J. Acoust. Soc. Am., 124(4), 2186-2195.

  • Autocorrelation algorithms, in combination with computational models of the auditory periphery, have been successfully used to predict the pitch of a wide range of complex stimuli. However, new stimuli are frequently offered as counterexamples to the viability of this approach. This study addresses the issue of whether in the light of these challenges the predictive power of autocorrelation can be preserved by changes to the peripheral model and the computational algorithm. 
Document Actions