Spatial Audio

Our project is a spatial audio map of Collegetown that allows the user to use a joystick to virtually travel around the Collegetown crossing area and hear surrounding, directional sound.

Our project takes inspiration from Street View, an interactive technology featured in Google Maps that provides users the ability to visually explore the world through millions of panoramic images along many streets across the globe. With a Virtual Reality headset, a person can have the virtual experience of physically being in that environment. However, this currently lacks an element that we take for granted in our everyday lives: sound. We wanted to make this technology more immersive by offering the user a digital 3D sound experience in a street-view environment.SPATIAL AUDIOMAP

While currently there are JavaScript-based open-source projects that blend the Web Audio API and Street View together, our project is unique in that we perform all computations that generate the output spatial audio in a microcontroller, as opposed to on the Cloud or some other more powerful but power-hungry microcontroller. In particular, low-power generated spatial audio systems have a promising future in the consumer electronics industry. Battery capacity is a bottleneck for tech companies to put more advanced features into wearable devices such as AR glasses. Generating spatial audio with a microcontroller will directly contribute to reducing the power consumption of the device.

High Level Design

High Level Block Diagram
High Level Block Diagram

Logical Structure

Our project consists of four major functional blocks: (1) the TFT Design (Collegetown map), (2) joystick control, (3) synthesized audio, and (4) spatial audio.

Our user interface consists of a 2D map of one of the major intersections of Collegetown on the TFT display. The user can navigate through the map with a joystick, controlling the position of a green dot that represents the “human” in the map. The position (in terms of x/y coordinates) is then used to determine what sounds the user hears, depending on how close one is to different sound sources.

We implemented three sound sources through Direct Digital Synthesis: the sound of a car engine, a two-toned chime representing Oishii Bowl, and the chirp sound representing a small bird standing on the corner of the intersection. To allow the user to have an immersive spatial audio experience, we used the x/y position from the joystick as an input to a simple head-related transfer function (for directional hearing) and also a sound intensity decay function the further one gets from a sound source. We tuned the sound intensity decay separately for each sound source so a source like the car has a much larger range where it can be heard than something like the small bird. Additionally, we don’t play any sounds when one is moving around the map. We instead have a button on the joystick to indicate for one to recalculate the spatial audio function and resume playing sounds again.

We then output the calculated sound for our left and right ears to our two DAC channels. By attaching an audio jack and headphones to these two channels, we can hear the directional and spatial audio. Further background information on spatial audio and our synthesized sounds is below.

Hardware/Software Tradeoff

To perform direct digital synthesis on three different audio sources, the major tradeoff is between the sound synthesis frequency and the PIC32 processing speed. As in the end all three localized audios need to be summed together and output into the DAC, the computation needs to happen in the same interrupt service routine. The DDS Interrupt frequency depends on how much computation we need for the spatial audio and determines the highest resolution the audio output can be. To synthesize more complicated sound, we need to reduce the ISR frequency which would affect the audio quality. Initially we were planning to use external storage devices like SD cards or SRAMs to store the raw audio file and perform calibration on the data. However, we encountered some difficulties on the devices, leading us to use direct digital synthesis. We will talk about our experience with SD cards/SRAMs in later sections. The end result we got is 3 different sound sources with an audio sampling frequency of 15KHz.

Background Math

Spatial Audio

Instead of implementing the complete version of the Head-Related Transfer Function (HRTF), we implemented a simplified version that only takes into account both the delay between left and right channels or interaural time differences (ITD) and the amplitude difference, or interaural level differences(ILD). Two main vectors contribute to our math: one from the user body center to the sound source and one from the user body center to their anterior. According to the 2015 project Sound Navigation, by roughly modeling the user’s head as a sphere with two ears diametrically opposite to each other, we can use the angle between these two vectors to calculate that each characteristic interaural difference.Spatial Audio

In the timer thread, when the joystick button is pressed, we first calculate the distance between the user’s relative angle to all the sound sources (car, bird, Oishii Bowl). Then plug that number into the two equations above to obtain the amplitude ratio and delay between the two ears. We also use the equation: |Intensity@locA -Intensity@locB| = 20log(dA/dB) to model the intensity of sound decay over distance. We made the assumption that the synthesized audio has an original amplitude that is the amplitude one would hear within one meter of the sound source and then use that as a reference to calculate the intensity of sound at any other spot on the map. To prove the eligibility of the algorithms, we started by Implementing them with Python on Jupyter notebook. We generated sound to mimic a point sound source from different angles, and, with slight tunings of the head dimension parameters for different users, all experiment users were able to distinguish the sample sound directions.

Synthesized audio

We synthesized our sounds through direct digital synthesis (DDS), decomposing each into digital synthesizable primitives. In particular, we implemented three sounds: the rumble of a car engine, a two-toned bell sound to represent opening the door to Oishii Bowl, and a chirp to represent the presence of a small bird. To mathematically represent the relationship between frequency and sample time for each segment, we approximated each sound’s frequency waveform and amplitude envelope.

To guarantee a steady output of a sound wave, we iterated through the audio sample at a constant speed (with the use of a timer ISR). The ISR we use for DDS is triggered by Timer 2 at ~15kHz synthesis sample rate. This is much lower than the 44kHz we used in the birdsong lab (Lab 1) because of the increased computation time within the interrupt. While this slower sample rate does decrease waveform resolution and mean that we are limited in the highest frequency we can generate, we are producing relatively simple sounds that can tolerate these decreases in resolution. One drawback was that we were not able to generate a high-frequency bell sound and instead opted for a two-tone chime, which created a similar effect.

We simulated the audio output of each of our sounds via python scripts within Jupyter notebook before implementing them on the PIC32.

(1) Bird: For the sound of a bird, we used the chirp sound we synthesized during Lab1 which can be approximated as the rising half of a quadratic function as follows. The duration of the chirp is 5720 samples:Synthesized audio

The amplitude envelope for the chirp sound is also the same as Lab 1. It has both an attack time and decay time of 1000 samples as well as a sustain time sample count of 3720.amplitude envelope

(2) Oishii Bowl Bell: Our two-toned chime sound was 8000 samples long, with half of the sound having a frequency of 2093 Hz (corresponding to note C) and the other half having a frequency of 1661 Hz (for note G#). We implemented the following amplitude envelope, with the bell having a shorter attack time of 1000 samples, a longer sustain time of 5000 samples, and a decay time of 3000 samples.decay time of 3000 samples

(3) Car: For the car, we have a higher frequency sawtooth frequency waveform with a lower frequency sawtooth amplitude envelope that is slightly out of phase. This allows for the irregularities in the resulting combined audio waveform that resemble the sound of a car engine. For our implementation on the PIC32, we can use piecewise functions to create our sawtooth waveforms. Additionally, the lower resolution from our lower sampling rate actually added some additional distortion that made the sound more realistic (since this sound actually sounds better the more noise there is in the waveform).Software Block Diagram

In our final project, we designed the software system based on user interaction. The Timer Thread is responsible for reading user input every 500ms. It firstly checks the potentiometer output value on the joystick to detect movements. When the user moves the joystick, the audio output stops and the green dot representing the user on the TFT map moves according to user input. When the user moves the dot to his/her destination, he/she clicks the button on the joystick to start the spatial audio output. Once a button click is detected, the Timer Thread calculates the time delay, amplitude ratio, and intensity difference between the left and right channel for the three sound sources independently. These changes will be written into global variables which will be picked up by the Timer 2, 3, 4, and 5 Interrupt. Timer 2 Interrupt handles the direct digital synthesis and DAC audio output for the three sound sources. The left and right channel of each sound is calculated independently and summed together to generate the final audio. Because of the spatial audio feature, there will be delay between the left and right channel. Timer 3, 4, and 5 Interrupts are used to signal the start of the delayed channel for the bird, car, and bell sound.

When the button click is detected, the Timer Thread calculates the parameters related to the spatial audio. Based on the hardcoded sound source positions and the updated “human” position, we are able to determine which ear/channel is further from the sound source, calculate the amplitude ratio and time delay between left and right channel and the intensity difference between the current “human” position and the sound source following the algorithms mentioned above. The intensity difference is represented by the max amplitude of the left or right channel audio depending on which one is louder and the amplitude ratio is represented by the max amplitude ratio between the further channel and the closer channel. These two values are used to generate the sound amplitude envelope, which will be used in the Timer 2 Interrupt for direct digital synthesis calculation.

The time delay between the two channels is slightly more complicated. To implement the delay, we used timer interrupts to delay the audio starting time of the further channel. This is done by calculating the real time delay between the left and right ears, converting it to cycle count within the 40MHz PIC32, and setting the interrupt timer to the corresponding value. Every time we finish the audio calibration process in the Timer Thread, a separate timer is started for the delayed audio channel of each sound source when the other channel starts to output the audio signal. Once the timer finishes counting down, an interrupt service routine is triggered to start audio on the delayed channel. To make sure that we can start the audio on time, we set the DDS Timer 2 ISR to priority 2 and the sound localization ISRs to priority 1 so that the DDS calculation will not interrupt the audio calibration. Although this is a simple version of the spatial audio only involving intensity and time difference, it can already create the illusion of a sound moving from one side to the other.

In the Timer 2 ISR, we first clear the interrupt flag and compute the frequency functions for each of the three sounds. As stated above, for the bird we approximate a chirp sound with a quadratic function, for the car we use a linear piecewise function to represent a sawtooth waveform, and for the bell we implement two frequencies to represent a two-toned chime. For each synthesized audio we call on the DDS algorithm to index into a pre-generated sine table to account for the phase and this sine table entry value is then written into the DAC output. Since frequency has to be positive to produce sound, we leveled the DDS output to make sure it stays at the upper half of the DAC range (0-4096). It is important to note that we compute the frequency functions for each side (left and right) separately as the audio for each ear is different due to spatial audio.

In order to make our synthesized sounds more realistic and also to avoid non-natural clicks, we then layer a different amplitude envelope over each of our sounds as shown in our Synthesized Audio section. We first define an attack time, sustain time, and decay time for each of the three sounds. We then use linear ramp functions to ramp up, ramp down, or sustain the amplitude for a certain number of samples. Like with the frequency functions, we do this for each ear separately. This is also how we implement the intensity difference and the amplitude ratio features of the sound localization. Knowing the max DDS amplitude for each channel, we pre-calculate the ramping up/down speed in the calibration stage in the Timer Thread and the increment/decrement rate is used here to tune the amplitude.

Lastly, we write these outputs over SPI to DACA and DACB, looping through these sounds until the “human” is moved, which then causes sound to stop being produced until the button on the joystick is pressed again.TFT Map

We implemented our map with the tft_gfx library which gave us functions that can draw text and simple geometric shapes such as circles, rectangles, triangles, and lines. To make artistic icons, we strategically overlaid them. For example, the sign for the construction site is two rectangles overlayed on a yellow triangle, the bird is a brown triangle on the side of a big yellow circle with a small black circle for the eye. For the Oishii bowl icon, we first drew a big red circle, then covered half of it with a grey rectangle, and then drew a smaller white circle (as the rice), and then covered the second half of that circle with red. For the map, we used two big black rectangles as the road and decorated them with dashed lines for the median strip and long skinny white rectangles for the crosswalk. To pick custom colors for our icons, we used an online 16-bit color generator to find the hexadecimal 16-bit color value.

In terms of implementation, we have a function for drawing and initializing the entire map at the beginning of the program. We then have a function for updating the map whenever the “human” moves in order to avoid leaving traces on the road. For example, we redraw the crosswalk and the road centerline as well as the green traffic light.

Hardware Design

The hardware of the project includes the Course Development Board, its onboard components (the PIC32 microcontroller, MCP4822 DAC, and TFT Display), the PicKit3 Programmer, the audio jack, a pair of headphones, and the joystick.

The PIC32 is a 32-bit peripheral interface controller. It has a high-performance RISCV core, 2GB of user space memory, up to 5 external interrupts, and support for multiple communication protocols including UART, SPI, and I2C. With the PicKit3 Programmer, we can connect the MCU to a PC and load programs with the MPLABX IDE and XC32 compiler.

The MCP4822 is the dual channel 12-bit Digital-to-Analog Converter (DAC) we use to convert the digital sounds synthesized in the PIC32 to analog audio signals. The DAC receives digital value from the microcontroller through the SPI channel, converts it into an analog waveform, and outputs the signal onto the DACA and DACB pins. We use the audio socket to play the sounds in a pair of headphones, specifically tuning our DAC outputs to work for our specific headphones. We are also able to visualize the output on an oscilloscope for debugging.

The TFT display is a 2.2″ 16-bit color TFT LCD that communicates with the PIC32 through an SPI channel. We use this for displaying our map of Collegetown, and we also used it for printing out debug messages throughout our project as well.

To have our “human” icon navigate throughout the map displayed on the TFT display, we connected a joystick which contains two independent potentiometers (one for each axis, X and Y) as well as an internal pushbutton. The potentiometers are connected to two analog inputs on the PIC32. We added a pullup resistor to make it active low. The pushbutton is connected as a digital input and pressing it indicates to the program that the spatial audio calculations should be recomputed. We use it to recalibrate the sound for the “human’s” change in position on the map. Debouncing is not required since pressing the button multiple times will not affect how our system functions.

Initial + Unsuccessful Attempts and Lessons Learned

Initially, instead of opting for direct digital synthesis, we attempted to store our audio data in the form of .txt files containing analog values of sounds (recorded by us near the Collegetown intersection) that we could write to the DAC. We intended to store these .txt files on a FAT-32 formatted SD card and read the data over SPI using an SD card library created by a former student, Tahmid, for the PIC32. Unfortunately, while we could write into and connect to the SD card, we ran into issues reading more than 512 bytes, which corresponds to one sector in the SD card. Once the pointer for our reader got to 512 bytes, it was unable to continue reading from the next sector of data and instead looped again through the first 512 bytes.

This is an issue since when we created a sample .txt file containing analog values of a 4 second drum beat, that file was already 7MB. There are ways to decrease the size of the file including lowering down the sampling rate (which we did) or shortening the length of the sound. However, the gap between what we would like to store and what 512 bytes could provide is still big.

Additionally, it took a significant amount of time to figure out how to have both the TFT and SD card reader to work on the SPI channel, as well as setup the library. We conclude that reading/writing from an SD card is a possible approach to one’s project if you only need to store a limited amount of data (up to 512 bytes), though it is a bit tricky to implement correctly. Tahmid’s implementation as well as a previous project using an SD card (PICboy32) are very helpful resources if you choose to use an SD card.

After our attempt with the SD card didn’t work, we attempted to integrate our code with external RAM over SPI. We ran into issues as well, unable to read/write the correct values. Additionally, the RAM is only 1 kilobits which creates limitations for the audio samples we want to store. We were also unsure of how to post-process this analog data to account for spatial audio after reading it in, so direct digital synthesis ended up being a much better approach.

One small drawback of our spatial audio implementation is that our simplified head-related transfer function does not take into account the difference between sounds from the front and back. Such a difference is related to human ear structure and it is therefore much harder to model with the limited computational resources we have. However, the overall audio effect was still successful and we took into account this limitation by just having the human travel in one direction. Additionally, we could not perform real-time localization and audio processing while the figure was moving around on the map. Our solution was to disable all those functions until the figure was stationary and the button on the joystick was pressed.

According to feedback from our users, the synthesized sound accurately described the intended sound source. We synthesized three sounds in total, a car engine sound, a chip, and an electric two-toned doorbell sound. Originally, we wanted to synthesize an actual bell sound, but since we had to reduce the sampling frequency from 44KHz to 15KHz (22 kHz to accomodate for 2 sound sources and 15 kHz for 3 sources), the high-frequency bell waveform became distorted. The sounds all came out quite realistic , especially the car engine sound. We were satisfied with all the functions we included in the project and tried our best to optimize all of them during the implementation process. In terms of safety relating to audio, we set amplitude limits for each sound so that the max amplitudes would not exceed noise levels that would cause physical discomfort.

In terms of graphics, the TFT screen animation appeared minimalistic and clean. With basic but deliberately arranged geometries, we were able to create icons that could be easily understood. With the help of the joystick, even new users could guide themselves through the map pretty smoothly. With computational adjustments such as the simplified head-related transfer function as well as only updating a small subset of areas in the map every time the human moves, the TFT had no significant flickers or lags as the program was being executed.

Waveform Results

(1) The person is on the left side of the car. We can see the FFT waveform reflects the difference in the left and right channel in terms of amplitude and phase since it shows amplitude over different frequencies. The blue wave represents the closer ear (the right ear), which therefore has a higher amplitude and it is slightly ahead of the yellow wave (the left ear) due to sound reaching the right ear first.

FFT (Car Right)
FFT (Car Right)
FFT Car Front
FFT (Car Front)
FFT (Bird on the left and car in the background)
Spectrum (Car, Chirp, and Bell)
Spectrum (Car, Chirp, and Bell)


Based on the waveforms and user feedback, we are satisfied in how well our design did in meeting our expectations and creating an immersive, spatial audio experience.

While there is not much we would change in our implementation of the TFT design and synthesized/spatio audio, if we were to start our project from the beginning, we would not have spent two weeks getting the SD card to work. While we did end up with a good understanding on how the library works, it was challenging and time-consuming to work on the SD card library without much previous knowledge, and it was also ultimately unsuccessful. While this attempt at using the SD card did not work, if we were to expand this project on a larger scale, we might still need some way of accessing external memory. By processing three sounds at the same time, we were already at the limit of the maximum computational power of PIC. Adding any additional sounds would mean lowering the sampling frequency or having the samples be shorter without the addition of external data storage.

Intellectual Property Considerations

We generated all the synthesized sound and designed all graphics from scratch so we own the entire intellectual property of media in the project. We implemented a majority of the code besides some open source libraries, the Protothreads library, and example code from class.

Ethical Considerations

The project is consistent with the IEEE Code of Ethics. The main ethical concern was of user safety of the device since the noise was initially quite loud, but it has been tuned down to a safe amplitude for user safety and comfort. We also upheld other safety, ethical, and welfare considerations, had no conflicts of interest, and avoided unlawful conduct.


Leave a Comment

Your email address will not be published. Required fields are marked *