EnviSounds Dataset

How to navigate the dataset?

(1) To simply download the dataset as a zipfile, please click the "Download Full Dataset" button below. It will download a zipped folder with all the sounds, norms and data, along with the accompanying scripts and a README file.

(2) To better understand the dataset, we recommend reading through all the sections below that discuss the sounds, their acoustics, semantics and norms in more detail. Each section can be expanded after clicking the "Read more" button. It also contains a corresponding "Download" button that will allow you to download the relevant data discussed in a given section.

Introduction

Sound categories included in the database were selected based on the sounds' overall ecological frequency of occurrence (Ballas, 1993), and membership in distinct subclasses of sounds (Schaefer, 1977) at different levels of abstraction to represent common and ‘easy to distinguish’ sound classes (Gemmeke et al., 2017). The top-level classes comprise: (1) human, (2) manmade or mechanical, and (3) sounds of nature or natural sounds, indicated in the number of reviewed taxonomies as the most general division between sound classes. These main groups can be further divided into multiple subcategories at a higher level of categorical abstraction, derived from consulting specialised lists of sounds (e.g., ornithology, entomology, and bioacoustics) and lists of sound categories and their annotations generously provided by FreeSound.org and BBC Archive. Each subcategory was composed of more basic-level classes (e.g., cars, cats, drills) chosen to reflect the most frequently occurring in environment and familiar agents and objects. For example, when selecting animals, cats and dogs were selected over donkeys or lions as these are more likely to appear in our everyday environment and expose listeners to natural acoustic variability within that class. Where possible, each basic class also consists of multiple types of sounds produced by the same source and uniquely associated with it.

Category selection process included the following 4 steps. (1) Identification included a thorough review of environmental sound taxonomies to identify a list of possible categories and their organisational criteria. (2) Synthsesis focused on extracting common structural characteristics to select categories that were most consistently indicated across various taxonomies. (3) Specialization ivolved exploring representative subcategories and consulting specialistic sound taxonomies and lists of classes from otehr databases to identify sounds produced by the same source. For example, although cats or dogs were included in most previous studies, both animals have more than one vocalization type in their repertoire. To correctly identify species- or source-specific sounds, appropriate literature was consulted (e.g., sounds produced by animals – cats, Pandeya & Lee, 2018; dogs, Molnar et al., 2008; birds, Briggs et al., 2012; sounds produced by cars, Morel et al., 2012; Park et al., 2019). (4) Filtering refers to the process of selecting the most frequently occuring and the most characteristic sounds for each class. First, lists of ecological frequency of everyday sounds (Ballas, 1993) and familiarity ratings (e.g., Marcell et al., 2000; Hocking et al., 2013; Burns & Rajan, 2019) were consulted. For example, familiarity ratings (with lower ranks representing more familiar sounds) indicated the sounds produced by a dog (M=1.3), cat (M=1.22), horse (M=1.27) and cow (M=1.54) are more familiar than other animals such as donkey (M=2.6), bear (M=3.31), goat (M=2.04) or lion (M=2.42; a full list of familiarity ratings can be found in Hocking et al., 2013). Then, for each of those selected objects or agents, we selected the most characteristic sounds. For example, sounds of dogs included sounds of barking or howling, but not sounds of snoring or walking, which are not distinctive for a dog (i.e., other animals might be snoring or walking).

The labels were constructed to be consistent grammatically and as generically descriptive as possible. They are all linguistic phrases derived from the most common labels occurring in the reviewed databases. Given that neither nouns nor verbs are sufficient to distinguish between certain sound classes (e.g., a 'dog' label does not differentiate between barking and growling; at the same time growling might not be indicative of a dog; Saygin, Dick, & Bates, 2005), we opted to include both, noun and verb (or its derivative) constructions in our labels. Their grammatical complexity was kept uniform by transforming all phrases into ‘noun + gerund’ forms. A similar approach was successfully exercised in previous research (e.g., Saygin, Dick, Wilson, Dronkers, & Bates, 2003). The same syntactic frame was used to construct the labels for all the sounds.

Sounds

Stimuli 530 are natural sounds downloaded and manually edited from the following databases: Free Sound, BBC Sound Effects, Sound Bible, Zapsplat, Orange Free Sounds, Adobe Audition Sound Effects Library and YouTube.

Acoustics

Selection of the acoustic features was inspired by previous research in audio content analysis (Lerch, 2012), with emphasis on the applications in environmental sound recognition and classification (e.g., Keller & Berger, 2001; Peltonen et al., 2002; Cai et al., 2006; Muhammad & Alghatabar, 2009; Leaver & Rauschecker, 2010; Velero & Alias, 2010; for review see: Alias, Socoro, & Sevillano, 2016; Serizel, Bisot, Essid, & Richard, 2017). We included features which were shown to perform well in describing and parameterising environmental sounds. The list of selected features is not exhaustive, but based on the usefulness of those features in previous research it should provide a good starting point for considering variability in the acoustic structure of environmental sounds.

To see the full list of acoustic features and read about functions used for extraction, see below.

Each sound file in EnviSounds was analysed with librosa 0.11. Features can be grouped into four categories: time-domain, spectral, cepstral, and quality. The extracted features are stored as JSON files and a separate CSV summarises each feature with descriptive statistics per sound file.

Time-domain features

RMS energy — root-mean-square energy per frame, computed with `librosa.feature.rms`. RMS energy is proportional to the perceived loudness of a frame and is a standard measure of signal power over time.

Zero-crossing rate (ZCR) — a rate at which the signal changes from positve to negative is computed with `librosa.feature.zero_crossing_rate`. High ZCR indicates rapid sign changes typical of noise-like or percussive content, while low ZCR reflects tonal, slowly oscillating waveforms.

Temporal centroid — indicates where in time most of the signal's energy occurs. A low value indicates an impulsive sound (most energy at the onset), while higher values indicate more sustained sounds. This feature can be useful for distinguishing a single clap from continuous applause, for instance.

Spectral features

Spectral centroid — describes the centre of gravity of the spectral energy of sound (i.e., its dominant frequency), computed with `librosa.feature.spectral_centroid`. A high centroid reflects a bright, high-frequency dominated sound (e.g. birdsong, squealing tyres), while a low centroid reflects more dull or bass-heavy sounds (e.g. thunder, engine).

Spectral rolloff — the frequency below which 85 % of the total spectral energy is contained, computed with `librosa.feature.spectral_rolloff`(`roll_percent=0.85`). Rolloff captures the upper frequency extent of a sound and helps separate broadband noise (high rolloff) from low-frequency sources such as distant thunder (low rolloff).

Spectral bandwidth — the weighted standard deviation of frequencies around the spectral centroid, computed with `librosa.feature.spectral_bandwidth`. Broad bandwidth is characteristic of noise-like sounds (rain, wind) while narrow bandwidth indicates tonal content (whistling, bird calls, car horns).

Spectral flatness — measures uniformity in the frequency distribution of the power spectrum and is calculated as the ratio of the geometric mean to the arithmetic mean of the power spectrum per frame, computed with `librosa.feature.spectral_flatness`. Values near 1 indicate a white-noise-like, uniformly distributed spectrum; values near 0 indicate a peaked, tonal spectrum. This is a key discriminator between texture-type sounds (rain, wind, running water) and tonal sources (birdsong, horns, bells).

Spectral contrast — The difference in spectral energy between peaks and valleys in seven sub-bands, computed with `librosa.feature.spectral_contrast`(default parameters; 7 bands × frames). High contrast in the low bands indicates strong harmonic structure with clear valleys between harmonics; low contrast indicates a more noise-like spectrum. The feature captures aspects of timbre related to harmonic richness across the frequency range.

Spectral flux — is defined as the sum of squared frame-to-frame differences in spectral magnitude and describes sudden changes in the frequency distributions of sounds (i.e. its dynamic variation). High spectral flux indicates rapidly changing spectral content, as found in percussive or transient sounds (footsteps, hammering, thunder). Low flux is characteristic of steady textures (engine drone, rain, running water).

Spectral kurtosis — describes the flatness of the spectral distribution around its mean. High kurtosis indicates a strongly peaked spectrum (tonal or impulsive content), while low kurtosis indicates a flatter, more uniform spectral distribution. Unlike spectral flatness, which measures overall uniformity, spectral kurtosis is sensitive to the presence of sharp spectral peaks.

Mel spectrogram — the STFT power spectrum mapped onto 128 mel-scale frequency bands using `librosa.feature.melspectrogram`, then converted to dB with `librosa.power_to_db`.

Mean power spectrum — the dB-magnitude spectrogram averaged across all frames.

Log-Mel band energies — a coarser version of the mel spectrogram using only 8 broad mel bands, computed with `librosa.feature.melspectrogram` (`n_mels=8`) followed by `librosa.power_to_db`. The 8 bands provide an interpretable, low-dimensional frequency-band summary suitable for direct comparison across sound categories, e.g. low-band energy is high for thunder and engine sounds, while high-band energy is high for birdsong and squealing tyres.

Cepstral features

MFCCs (Mel-Frequency Cepstral Coefficients) — the first 13 cepstral coefficients derived from the log mel spectrogram, computed with `librosa.feature.mfcc`. MFCCs are the most widely used timbre descriptor in environmental sound recognition. The lower coefficients capture the broad spectral shape (related to vocal-tract or source colour), while higher coefficients capture finer spectral detail.

MFCC Deltas — First-order temporal derivatives of the MFCC trajectories, computed with `librosa.feature.delta` using a 9-frame window. Delta coefficients encode how fast the spectral shape is changing from frame to frame, capturing the temporal dynamics of a sound's texture. They complement the static MFCCs and are especially informative for distinguishing periodic textures (sawing, galloping) from stochastic ones (rain, wind).

Quality features

Harmonic-to-Noise Ratio (HNR) — per-frame ratio of harmonic energy to residual noise energy, estimated via harmonic/percussive source separation. First, we decomposed the signal with `librosa.effects.hpss`, and then computed RMS energy for both components. High positive HNR values indicate strongly tonal sources (birdsong, bells); values near 0 or negative indicate noise-dominated texture-like sounds (wind, rain, crowd noise).

Semantics

COMING SOON!

Norms

COMING SOON!

Explore

Here you can explore visualisations of various aspects of the dataset. Please select from the list of available visualisations below:

Within-category visualisations →

Audio feature visualisations →

Distributions of features across categories - COMING SOON!

Last modified 2026/04/04

Designed by MKachlicka

© 2018-2026 MKachlicka. All rights reserved.