(1) To simply download the dataset as a zipfile, please click the "Download Full Dataset" button below. It will download a zipped folder with all the sounds, norms and data, along with the accompanying scripts and a README file.
(2) To better understand the dataset, we recommend reading through all the sections below that discuss the sounds, their acoustics, semantics and norms in more detail. Each section can be expanded after clicking the "Read more" button. It also contains a corresponding "Download" button that will allow you to download the relevant data discussed in a given section.
Sound categories included in the database were selected based on the sounds' overall ecological frequency of occurrence (Ballas, 1993), and membership in distinct subclasses of sounds (Schaefer, 1977) at different levels of abstraction to represent common and ‘easy to distinguish’ sound classes (Gemmeke et al., 2017). The top-level classes comprise: (1) human, (2) manmade or mechanical, and (3) sounds of nature or natural sounds, indicated in the number of reviewed taxonomies as the most general division between sound classes. These main groups can be further divided into multiple subcategories at a higher level of categorical abstraction, derived from consulting specialised lists of sounds (e.g., ornithology, entomology, and bioacoustics) and lists of sound categories and their annotations generously provided by FreeSound.org and BBC Archive. Each subcategory was composed of more basic-level classes (e.g., cars, cats, drills) chosen to reflect the most frequently occurring in environment and familiar agents and objects. For example, when selecting animals, cats and dogs were selected over donkeys or lions as these are more likely to appear in our everyday environment and expose listeners to natural acoustic variability within that class. Where possible, each basic class also consists of multiple types of sounds produced by the same source and uniquely associated with it.
Category selection process included the following 4 steps. (1) Identification included a thorough review of environmental sound taxonomies to identify a list of possible categories and their organisational criteria. (2) Synthsesis focused on extracting common structural characteristics to select categories that were most consistently indicated across various taxonomies. (3) Specialization ivolved exploring representative subcategories and consulting specialistic sound taxonomies and lists of classes from otehr databases to identify sounds produced by the same source. For example, although cats or dogs were included in most previous studies, both animals have more than one vocalization type in their repertoire. To correctly identify species- or source-specific sounds, appropriate literature was consulted (e.g., sounds produced by animals – cats, Pandeya & Lee, 2018; dogs, Molnar et al., 2008; birds, Briggs et al., 2012; sounds produced by cars, Morel et al., 2012; Park et al., 2019). (4) Filtering refers to the process of selecting the most frequently occuring and the most characteristic sounds for each class. First, lists of ecological frequency of everyday sounds (Ballas, 1993) and familiarity ratings (e.g., Marcell et al., 2000; Hocking et al., 2013; Burns & Rajan, 2019) were consulted. For example, familiarity ratings (with lower ranks representing more familiar sounds) indicated the sounds produced by a dog (M=1.3), cat (M=1.22), horse (M=1.27) and cow (M=1.54) are more familiar than other animals such as donkey (M=2.6), bear (M=3.31), goat (M=2.04) or lion (M=2.42; a full list of familiarity ratings can be found in Hocking et al., 2013). Then, for each of those selected objects or agents, we selected the most characteristic sounds. For example, sounds of dogs included sounds of barking or howling, but not sounds of snoring or walking, which are not distinctive for a dog (i.e., other animals might be snoring or walking).
The labels were constructed to be consistent grammatically and as generically descriptive as possible. They are all linguistic phrases derived from the most common labels occurring in the reviewed databases. Given that neither nouns nor verbs are sufficient to distinguish between certain sound classes (e.g., a 'dog' label does not differentiate between barking and growling; at the same time growling might not be indicative of a dog; Saygin, Dick, & Bates, 2005), we opted to include both, noun and verb (or its derivative) constructions in our labels. Their grammatical complexity was kept uniform by transforming all phrases into ‘noun + gerund’ forms. A similar approach was successfully exercised in previous research (e.g., Saygin, Dick, Wilson, Dronkers, & Bates, 2003). The same syntactic frame was used to construct the labels for all the sounds.
Stimuli 530 are natural sounds downloaded and manually edited from the following databases: Free Sound, BBC Sound Effects, Sound Bible, Zapsplat, Orange Free Sounds, Adobe Audition Sound Effects Library and YouTube.
Selection of the acoustic features was inspired by previous research in audio content analysis (Lerch, 2012), with emphasis on the applications in environmental sound recognition and classification (e.g., Keller & Berger, 2001; Peltonen et al., 2002; Cai et al., 2006; Muhammad & Alghatabar, 2009; Leaver & Rauschecker, 2010; Velero & Alias, 2010; for review see: Alias, Socoro, & Sevillano, 2016; Serizel, Bisot, Essid, & Richard, 2017). We included features which were shown to perform well in describing and parameterising environmental sounds. The list of selected features is not exhaustive, but based on the usefulness of those features in previous research it should provide a good starting point for considering variability in the acoustic structure of environmental sounds.
To see the full list of acoustic features and read about functions used for extraction, see below.
Each sound file in EnviSounds was analysed with librosa 0.11. Features can be grouped into four categories: time-domain, spectral, cepstral, and quality. The extracted features are stored as JSON files and a separate CSV summarises each feature with descriptive statistics per sound file.
Time-domain features
librosa.feature.rms. RMS energy is proportional to the perceived loudness of a frame and is a standard measure of signal power over time.librosa.feature.zero_crossing_rate. High ZCR indicates rapid sign changes typical of noise-like or percussive content, while low ZCR reflects tonal, slowly oscillating waveforms.Spectral features
librosa.feature.spectral_centroid. A high centroid reflects a bright, high-frequency dominated sound (e.g. birdsong, squealing tyres), while a low centroid reflects more dull or bass-heavy sounds (e.g. thunder, engine).librosa.feature.spectral_rolloff(roll_percent=0.85). Rolloff captures the upper frequency extent of a sound and helps separate broadband noise (high rolloff) from low-frequency sources such as distant thunder (low rolloff).librosa.feature.spectral_bandwidth. Broad bandwidth is characteristic of noise-like sounds (rain, wind) while narrow bandwidth indicates tonal content (whistling, bird calls, car horns).librosa.feature.spectral_flatness. Values near 1 indicate a white-noise-like, uniformly distributed spectrum; values near 0 indicate a peaked, tonal spectrum. This is a key discriminator between texture-type sounds (rain, wind, running water) and tonal sources (birdsong, horns, bells).librosa.feature.spectral_contrast(default parameters; 7 bands × frames). High contrast in the low bands indicates strong harmonic structure with clear valleys between harmonics; low contrast indicates a more noise-like spectrum. The feature captures aspects of timbre related to harmonic richness across the frequency range.librosa.feature.melspectrogram, then converted to dB with librosa.power_to_db.librosa.feature.melspectrogram (n_mels=8) followed by librosa.power_to_db. The 8 bands provide an interpretable, low-dimensional frequency-band summary suitable for direct comparison across sound categories, e.g. low-band energy is high for thunder and engine sounds, while high-band energy is high for birdsong and squealing tyres.Cepstral features
librosa.feature.mfcc. MFCCs are the most widely used timbre descriptor in environmental sound recognition. The lower coefficients capture the broad spectral shape (related to vocal-tract or source colour), while higher coefficients capture finer spectral detail.librosa.feature.delta using a 9-frame window. Delta coefficients encode how fast the spectral shape is changing from frame to frame, capturing the temporal dynamics of a sound's texture. They complement the static MFCCs and are especially informative for distinguishing periodic textures (sawing, galloping) from stochastic ones (rain, wind).Quality features
librosa.effects.hpss, and then computed RMS energy for both components. High positive HNR values indicate strongly tonal sources (birdsong, bells); values near 0 or negative indicate noise-dominated texture-like sounds (wind, rain, crowd noise).Here you can explore visualisations of various aspects of the dataset. Please select from the list of available visualisations below:
Distributions of features across categories - COMING SOON!
Last modified 2026/04/04
Designed by MKachlicka
© 2018-2026 MKachlicka. All rights reserved.