This article is part of a Technology and Innovation Insights series paid for by Samsung.
Similar to the relationship between an engine and oil, data and artificial intelligence (AI) are symbiotic. Data fuels AI, and AI helps us to understand the data available to us. Data and AI are two of the biggest topics in technology in recent years, as both work together to shape our lives on a daily basis. The sheer amount of data available right now is staggering and it doubles every two years. However, we currently only use about 2 percent of the data available to us. Much like when oil was first discovered, it is taking time for humans to figure out what to do with the new data available to us and how to make it useful.
Whether pulled from the cloud, your phone, TV, or an IoT device, the vast range of connected streams provide data on just about everything that goes on in our daily lives. But what do we do with it?
Earlier this month, HARMAN’s Chairman Young Sohn sat down with international journalist Ali Aslan in Berlin, Germany at the “New Data Economy and its Consequences” video symposium held by Global Bridges. Young and Ali discussed the importance of data, why AI without data is useless, and what needs to be considered when we look at the ethical use of data and AI — including bias, privacy, and security.
Bias
Unlike humans, technology and data are not inherently bias. As the old adage goes — data never lies. Bias in data and AI comes into play when humans train an AI algorithm or interpret data. Much of what we are consuming is influenced based on where the data is coming from and what data is going into the system. Understanding and eliminating our bias are essential to ensuring a neutral algorithm and system.
Controlling data access and permissions are a key first step to remove bias. Having a diverse and inclusive team when developing algorithms and systems is essential. Not everyone has lived the same experiences and backgrounders. Diversity in both can help curb biases by providing different ways of interpreting data inputs and outputs.
Privacy
Permission and access are paramount when we look at the privacy aspect of data. Privacy is extremely important in our increasingly digital society. As such, consumers should have a choice at the beginning of a relationship with an organization and be asked whether they want to opt-in, rather than having to opt-out. GDPR has been a good first step in helping to protect consumers in regards to the capture and use of their data. While GDPR has many well-designed and important initiatives, the legislation could be more efficient.
Security
Whereas data privacy is more of a concern to consumers and individuals, data security has become a global concern for consumers, organizations, and nation-states.
It seems like every day we are reading about another cyber-attack or threat that we should be aware of. Chief among these concerns are the influx of ransomware attacks. Companies and individuals are paying increasingly large amounts of money to bad actors in an attempt to mitigate risk, attention, and embarrassment. These attacks are being carried out by individuals, collectives, and even nation-states in an attempt to cripple the systems of enemies, gather classified information, or garner capital gains.
So how do we trust our data and information is safe and what can we do to be better protected? While there may be bad actors using technology and data for their own nefarious devices, there are also many positive uses for technology. The amount of education and investments being made in the cybersecurity space have helped many organizations to train employees and invest in technologies that are designed to prevent cybercrime at the source — human error. And while we may not be able to stop all cybercrime, we are making progress.
Data and AI for good
While data — both from a collection and storage viewpoint — and AI have gotten negative press around biases, privacy, and security, both can also be used to do an immense amount of good. For example, both data and AI have been crucial in the biomedical and agtech industries. Whether it’s COVID-19 detection and vaccine creation or the creation of biomes and removal of toxins in soil, data and AI have incredible potential. However, one cannot move forward without the other. A solid and stable infrastructure and network are also needed to ensure that we can make use of the other 98 percent of the global data available.
VB Lab Insights content is created in collaboration with a company that is either paying for the post or has a business relationship with VentureBeat, and they’re always clearly marked. Content produced by our editorial team is never influenced by advertisers or sponsors in any way. For more information, contact sales@venturebeat.com.
Elevate your enterprise data technology and strategy at Transform 2021.
Humans understand events in the world contextually, performing what’s called multimodal reasoning across time to make inferences about the past, present, and future. Given text and an image that seem innocuous when considered apart — e.g., “Look how many people love you” and a picture of a barren desert — people recognize that these elements take on potentially hurtful connotations when they’re paired or juxtaposed, for example.
Even the best AI systems struggle in this area. But there’s been progress, most recently from a team at the Allen Institute for Artificial Intelligence and the University of Washington’s Paul G. Allen School of Computer Science & Engineering. In a preprint paper published this month, the researchers detail Multimodal Neural Script Knowledge Models (Merlot), a system that learns to match images in videos with words and even follow events globally over time by watching millions of YouTube videos with transcribed speech. It does all this in an unsupervised manner, meaning that the videos haven’t been labeled or categorized — forcing the system to learn from the videos’ inherent structures.
Learning from videos
Our capacity for commonsense reasoning is shaped by how we experience causes and effects. Teaching machines this type of “script knowledge” is a significant challenge, in part because of the amount of data it requires. For example, even a single photo of people dining at a restaurant can imply a wealth of information, like the fact that the people had to meet up, agree where to go, and enter the restaurant before sitting down.
Merlot attempts to internalize these concepts by watching YouTube videos. Lots of YouTube videos. Drawing on a dataset of 6 million videos, the researchers trained the model to match individual frames with a contextualized representation of the video transcripts, divided into segments. The dataset contained instructional videos, lifestyle vlogs of everyday events, and YouTube’s auto-suggested videos for popular topics like “science” and “home improvement,” each selected explicitly to encourage the model to learn about a broad range of objects, actions, and scenes.
The goal was to teach Merlot to contextualize the frame-level representations over time and over spoken words, so that it could reorder scrambled video frames and make sense of “noisy” transcripts — including those with erroneously lowercase text, missing punctuation, and filler words like “umm,” “hmm,” and “yeah.” The researchers largely accomplished this. They that in a series of qualitative and quantitative tests, Merlot had a strong “out-of-the-box” understanding of everyday events and situations, enabling it to take a scrambled sequence of events from a video and order the frames to match the captions in a coherent narrative, like people riding a carousel.
Future work
Merlot is only the latest work on video understanding in the AI research community. In 2019, researchers at Georgia Institute of Technology and the University of Alberta created a system that could automatically generate commentary for “let’s play” videos of video games. More recently, researchers at Microsoft published a preprint paper describing a system that could determine whether statements about video clips were true, by learning from visual and textual clues. And Facebook has trained a computer vision system that can automatically learn audio, textual, and visual representations from publicly available Facebook videos.
Above: Merlot can understand the sequence of events in videos, as demonstrated here.
The Allen Institute and University of Washington researchers note that, like previous work, Merlot has limitations, some owing to the data selected to train the model. For example, Merlot could exhibit undesirable biases because it was only trained on English data and largely local news segments, which can spend a lot of time covering crime stories in a sensationalized way. It’s “very likely” that training models like Merlot on mostly news content could cause them to learn racist patterns as well as sexist patterns, the researchers concede, given that the most popular YouTubers in most countries are men. Studies have demonstrated a correlation between watching local news and having more explicit, racialized beliefs about crime.
For these reasons, the team advises against deploying Merlot into a production environment. But they say that Merlot is still a promising step for future work in multimodal understanding. “We hope that Merlot can inspire future work for learning vision+language representations in a more human-like fashion compared to learning from literal captions and their corresponding images,” the coauthors wrote. “The model achieves strong performance on tasks requiring event-level reasoning over videos and static images.”
VentureBeat
VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative technology and transact.
Our site delivers essential information on data technologies and strategies to guide you as you lead your organizations. We invite you to become a member of our community, to access:
up-to-date information on the subjects of interest to you
our newsletters
gated thought-leader content and discounted access to our prized events, such as Transform 2021: Learn More
The answer should be obvious. If you like the way it sounds, then it is good. I’m not here to tell you to stop enjoying what you like. But I am here to help you make more educated purchases.
Speakers don’t exist in isolation; most of us want to know we’re getting the best sound for our budget and setup. So how can you tell if one speaker is better than another without direct comparison? How do you know your impressions — or those of reviewers — aren’t being influenced by expectations about a speaker’s price and reputation? And what do you do when you don’t have a chance to listen to a speaker at all before buying it?
This is where speaker measurements and objective data come in. Knowing how to understand frequency response graphs is one of the most important skill an audiophile can have.
Lucky for us, speaker engineers and psychoacoustics researchers have been studying the nature of ‘good sound’ for decades. This research has led to powerful insights which show that, to a substantial degree, your preference for one speaker over another can be predicted by data — frequency response measurements in particular.
So by the end of this article, you should be able to look at a graph like this…
…and know whether it describes a decent speaker, as well as understand what some of its audible flaws might be.
Most of what I know comes from reading what I consider the most important book for any science-loving audiophile: Sound Reproduction: The Acoustics and Psychoacoustics of Loudspeakers and Rooms. Written by Dr. Floyd Toole, perhaps the most renowned expert on the psychoacoustics of speakers, it summarizes decades of research on acoustics and listener preferences.
I’ve since measured dozens of speakers and have found a remarkable correlation between my listening impressions and measurements, which are almost always performed after weeks of hearing the speaker in my own living room. This guide will hopefully help you understand how to correlate that data with your own impressions too.
Okay, so why should I care about measurements? Can’t I just read the review?
Some audiophiles believe listening to a speaker is the only way to know if a speaker is any good. We all have different tastes in music, after all, so surely speakers are the same?
The problem is, when it comes to soundreproduction, not music, you’re probably not that special.
Research suggests that a significant majority of people will rank speakers similarly once you eliminate variables like a speaker’s price, reputation, or aesthetics. The gold standard of this preference research is the double-blind comparison.
Credit: Sean Olive/ Toole & Olive 1984In this test, you can see how differently people rate speakers during sighted and blind listening.
In these listening tests, 2-4 speakers are placed behind an acoustically transparent screen, and neither the listeners nor researchers can see which speaker is playing music. In the best versions of these tests, listeners can switch speakers on the fly, and a machine will automatically reposition the speakers. This video gives you an idea of how Harman Audio performs its double-blind tests:
Two of the most important studies on speaker preference were published in 2004 by Harman researcher Dr. Sean Olive (who worked with the aforementioned Dr. Toole). In the first one, a tightly controlled study with 13 speaker models, Olive found preferences could be correlated with comprehensive on and off-axis measurements to essentially 100% accuracy.
A second, more generalized study with 70 speakers found it could predict speaker preference from measurements to approximately 86% accuracy. And these are just two studies of many over the past few decades.
Though the research isn’t without flaws, these are remarkable results. Imagine if you had that kind of predictive power with all of your purchasing decisions — if you could look at a graph or two and be reasonably confident you would prefer a certain phone, TV, laptop, microwave, bike, or whathaveyou over another. While you can of course compare the specs for any of these devices, it’s rare that we have data tying those specs directly to preference.
With speakers, we have that rare luxury. I’d go so far as to say that speaker measurements are more important than written reviews — if you know how to interpret them. I’d be completely happy if the people who read my own reviews skim over my written listening impressions and jump straight to the measurements and analysis.
Although measurements are of course most useful when combined with listening, Listening impressions alone are fickle and subject to biases — and even your mood that specific day. Properly done measurements can be replicated across different measurement rigs to a high degree of repeatability. So knowing how to read measurements gives you a much better chance of buying something that you know you’ll like.
So which speakers are the best?
These double-blind tests have consistently shown the best performing speakers tend to exhibit three qualities:
A flat-ish frequency response on the primary listening axis. This is the frequency response measured under anechoic conditions (free of reflections) in order to isolate the sound that travels directly from the speaker to your ears in a line-of-sight. The is often called the ‘on-axis‘ sound.
Smooth ‘directivity’ or ‘dispersion.’ This is how the speaker’s frequency response changes at angles away from the primary listening axis. This is important because a speaker’s sound e is affected by both the direct sound and the sound that reflects off our walls. This is often called the ‘off-axis‘ sound.
Ample bass extension. Few speakers extend all the way down to 20Hz, so it’s an improvement when these frequencies are present. You could argue this is just an another facet of a flattish frequency response, since missing bass means the frequency response is no longer flat.
There are other things that can have an influence at the highest levels of performance, but these three qualities are by far the most important.
It usually doesn’t matter what type of listener you are. Engineers, researchers, audiophiles, reviewers, and everyday consumers tend rank speakers similarly in double-blind tests, even if their listening skills differ.
Credit: Audio Musings by Sean OliveIn this study, 16 different groups of listeners ranked four speakers similarly, even if the rating given for each speaker varied.
Personal taste doesn’t completely disappear, but the best speakers do trend towards having the above qualities. It doesn’t matter if they’re studio monitors or hi-fi speakers.
What does a great frequency response look like?
A flat line.
That’s it?
I mean, basically.
So why aren’t all speakers flat?
Because it’s really hard to do right. Also some designers don’t agree with the science mentioned so far and prefer to tune things ‘by ear.’ And most speakers can’t extend all the way down to the lowest bass frequencies.
Still, aiming for flat is a good goal, one that means that the speaker is likely to reproduce the recording accurately. To give you a more realistic idea of what to expect, here’s an actual speaker with one of the flattest responses I’ve measured.
This would be considered fantastic performance; flatter than this and you are pushing the limitations of my measurement system. Here’s another one, albeit with less bass output:
There are some things you should know though.
When I say ‘frequency response,’ I mean the anechoic on-axis response — a measurement that does not include the effects of room reflections. This is the direct sound that heads from the speaker straight to your ears in a direct line of sight.
The most popular methods for capturing an anechoic response include an anechoic chamber, a fancy robot called the Klippel Near-Field Scanner, or a less accurate DIY method called a ‘quasi-anechoic‘ measurement to eliminate reflections from data. I use the latter method because I ain’t got that kind of money.
The anechoic response is important because although rooms do affect the sound of a speaker, our ears are quite good at hearing a speaker ‘through’ a room (especially above the bass frequencies). This is much the same way you can tell your friend’s voice apart from someone else’s whether you are in your apartment, a restaurant, or an airport.
Indeed, studies show that even if speakers sound a bit different in different rooms, people will tend to rank them similarly regardless of the listening space.
How do I understand deviations from flat?
Frequency response is typically divided into the lows/bass, mids, and highs/treble. Different sources divide the frequency ranges differently, but generally speaking, 20-250 Hz covers the bass, 250 to 2,000-4,000 Hz covers the mids, and everything above is the treble.
If the frequency response dips in a certain region, it means that part of the sound will sound quieter during playback. If it is higher, that region will be more audible. For example, this fictional (very bad) speaker…
…might be described as having an exaggerated bass, a recessed midrange, and/or bright treble. A bit of deviation in any single region isn’t necessarily a bad thing — sometimes a bit of extra bass is even enjoyable for the extra tactility — but combining so many issues is a problem.
If you want to get more nitty-gritty about how deviations in different frequency ranges affect different parts of sound, here’s a handy chart from DIY-Audio-Heaven:
And here are how different instruments correspond to different frequency ranges, courtesy of the Independent Recording Network (an interactive version here):
(In case you’re wondering why the piano only goes up to the mids, it’s worth noting that ‘treble’ as used in acoustics does not necessarily line up with treble as used in music.)
A few more things to keep in mind. Firstly, it’s worth noting peaks in the frequency response are generally a bit more audible than similarly-sized dips.
Second, a large, shallow dip or dip or bump in the frequency response is often more be audible than a narrow dip or peak. This is because while we can be very sensitive to changes less than 1dB in the frequency response, we’re more likely to hear them when the deviation covers a wide range of frequencies.
Lastly, most people can’t hear very well beyond 10kHz, with our hearing getting worse the older we get and the louder we listen to music. There also isn’t much music content up there. While it doesn’t hurt to keep this region flat too, deviations in this region are less likely to be problematic than most other places in the frequency response.
I saw a manufacturer post a super smooth frequency response! That’s good then?
It might be good. There are two things you should be particularly careful about when looking at a frequency response, especially when it’s posted by a non-independent party like a manufacturer.
Vertical scale
One of the easiest ways of tricking someone into thinking a speaker is better than it is by using an exaggerated vertical scale. A typical standard, and the one I use for all of my measurements, requires 50 dB on the Y axis to be equal length to the distance between 20Hz and 2kHz on the X axis.
This aspect ratio highlights flaws more than most measurements posted online. But now here is the same exact measurement with a compressed Y-scale:
That looks way better than the speaker actually is. You should always look at the Y-Axis scale before making an assessment about a measurement.
Smoothing
It’s exactly what it sounds like: smoothing out the frequency response. For example, I use 1/24 octave, and generally don’t like to see anechoic responses using more than 1/12 octave smoothing.
But it’s common to see measurements at 1/6 to 1/2 smoothing to hide flaws or just because it looks ‘prettier.’ Combine a tall vertical scale with smoothing, and you can easily clean up a messy speaker. For an exaggerated example:
Smoothing can be useful for assessing trends while ignoring harmless minor deviations, especially for in-room measurements. But when it comes to anechoic measurements, it should only be used in conjunction with higher resolution measurements.
Luckily, it’s usually pretty easy to spot smoothed measurements because they look unnaturally smooth. Even the best speakers have some jagged bits to their measurements.
Tell me more about that smooth directivity stuff
When we listen to speakers in a room, we don’t just hear the sound that travels in a straight line towards our ears — what we call the direct sound. Our ear-brain systems make it such that, for a small window of time, delayed sounds actually contribute to the sound of a speaker. The strongest contributions typically come from the very first bounces off of your walls, floor, and ceiling; we call these the ‘early reflections.’
It makes sense if you think about it — and not just with speakers. After all, why don’t we hear all the reflections when someone is speaking in a small room?
Well, because that’d be annoying and we could never understand each other or know where sounds are coming from. So instead, our brains ‘add’ these loudest reflections to the direct sound to create a single apparent sound source (thanks brain!). It is not until the reflections are delayed much more that we hear them as distinct sounds, (think echoes and reverb in a large venue). This is related to something called the precedence effect.
So you might imagine why having high-quality early reflections is important if these are being added to the direct sound. If the direct sound is flat but the cumulative early reflections have a large dip in the midrange, for example, the speaker will sound like it is recessed in the midrange to some degree. Likewise, the soundstage will likely become fuzzier and less stable when the early reflections do not resemble the direct sound closely.
One important thing to note: Because our ears are horizontally aligned, having smooth horizontal directivity is important for both soundstage and tonality. That’s why you’ll usually find vertical directivity is worse than horizontal directivity; it has less of an impact on the soundstage, although it still makes important contributions to tonality.
Nonetheless, when people say ‘smooth directivity,’ they’re usually focusing more on the horizontal portion.
What does smooth directivity look like?
There are many ways of displaying a speaker’s directivity performance. The most basic way — and my preferred method — is by simply graphing a speaker’s frequency response at different angles.
A typical speaker will have a frequency response that tilts downward as you move further away from the on-axis sound, but it should maintain the same basic shape. Here’s a fictional speaker at 0 and 60 degrees off-axis:
Now here is a more typical off-axis graph of what is considered a very good real speaker:
You can see the response changes smoothly as you move further away from the on-axis sound. Usually, only the front hemisphere is included because it makes the largest contribution to the sound.
Now here is a speaker that does not have good horizontal directivity:
(You don’t usually see speakers quite this bad nowadays, but I have seen them.)
Even though it maintains a fairly linear direct sound, that off-axis dip suggests that the soundstage will be fuzzy and/or unstable, likely to fall apart if you are not perfectly centered in front of the speaker.
Here’s one more situation. What happens if the direct sound is awful, but the directivity is still good? It might look something like this:
This speaker will clearly have compromised tonality due to the large dips in its response, however, the relationship between the on-axis and off-axis curve is nonetheless still smooth. In practice, this would likely lead to a speaker with uneven tonality but a good soundstage.
For reasons beyond the scope of this piece, the good thing about such as speaker is that you might very well be able to apply EQ to them to fix their tonality. A bad frequency response can be fixed with EQ, but bad directivity cannot; it is inherent to the speaker’s design.
It’s worth noting that a speaker can have good directivity in different ways. For example, some speaker will opt for ‘wide directivity’ which shows up as measurements that tilt less off-axis. Other speakers have ‘narrow’ directivity, which means a quieter, more tilted off-axis response.
There aren’t strict definitions for what is narrow and what is wide, so these terms are best used when comparing two speakers. Neither is better than the other, and this is just a matter of preference and interaction with your room.
Wider directivity speakers mean louder reflections, which tends to mean a larger soundstage at the expense of some imaging precision. Narrower directivity speakers may have more focused imaging, but a smaller soundstage.
For example, this speaker…
…has wider directivity than this speaker…
…as the response on the former doesn’t tilt as much. You can see how by 60 degrees (the pink line), the latter speaker’s response has tilted downward much more. I’m simplifying things a bit, but this suggests the wall reflections will be quieter, and the speaker will likely have a narrower, but perhaps more ‘precise’ soundstage.
There are other ways of demonstrating directivity, with the most common alternative being a polar map (sometimes called a contour plot, heat map, or beamwidth graph). Here’s an example of a great speaker:
Many times directivity measurements will also be ‘normalized’ to the on-axis response meaning that rather than showing the true response at off-axis angles, they show how much the measurement varies compared to the on-axis.
Frequency response and directivity together: What is a ‘spinorama?’
The single most important graph you’ll see me use in my reviews is called a ‘spinorama,’ so-called because creating one requires rotating a speaker about its horizontal and vertical axes to capture the frequency response at 70 angles.
It’s a measurement format that has become increasingly popular in the past few years among objective reviewers, developed by the researchers at Harman, and now part of the ANSI/CTA-2034-A standard for speaker measurements.
This is a spinorama for a very competent speaker.
It is basically a speaker measurement Cliff’s Notes, summarizing a speaker’s frequency response and directivity in one handy image. Although analyzing a speaker’s performance can require some nuance, and comparing two very good speakers may require more information than present in the spinorama, this singular graph is usually enough to separate the ‘good’ speakers from the ‘bad’ speakers.
Here’s my summary of what each of the above lines means (note that the colors are not standardized, these are just the ones I typically use).
The On-Axis (green) and Listening Window (white) curves represent the ‘direct’ sound of the speaker before any reflections, and they should be relatively flat.
The On-Axis is measured with the speaker aimed directly at the microphone. The Listening Window is an average of 9 angles (H represents horizontal, and V represents vertical): 0°, ± H10°, ± H20°, ± H30°, ± V10°.
The Listening Window accounts for the fact most people don’t sit perfectly still or centered, so it is generally the more important of the two, especially for living room listening. It also helps eliminate inaudible deviations in the frequency response that sometimes only show up when the microphone is exactly on-axis.
As the first and loudest sound to arrive at our ears, the direct sound has a huge impact on our perception of tonality. The other curves, meanwhile, represent the ‘off-axis’ sound — the sound that will reflect off your walls.
The Early Reflections curve (blue, top) is particularly important, as discussed earlier. It calculated by taking the average of five averages, each representing the sounds that are likely to reflect off the walls in most rooms:
The ER curve should generally tilt down a few dB from 20Hz to 20kHz; how much will depend on the speaker’s directivity characteristics. The most important thing is that its shape roughly matches the direct sound, indicating the reflected sounds are similar in character.
The Sound Power curve (red, top) represents an average of the speaker‘s sound in all directions. It’s not as useful as the other curves for speakers that mostly radiate sound forward, but it should generally look like an even steeper version of the ER curve.
The Predicted In-Room Response curve (purple) estimates how a speaker will measure in a real room by combining data from the LW, ER, and SP curves. It is, in a way, a refinement of the ER curve. For the majority of speakers, the PIR curve looks very similar to the Early Reflections curve but tilting a tiny bit more, so it is often omitted.
If there is a bump in the response that persists in each of the top curves, it is likely to be a resonance in a speaker. Resonances are bad, as they tend to be extra audible, often the cause for a specific type of boominess or harshness in a speaker.
An exaggerated example of a resonance. They are usually subtler than this, but can still be very audible.
The Directivity Index (red, bottom) and Early Reflections DI (blue, bottom) curves tell us how similar the off-axis sound is to the direct sound. These are calculated by subtracting the Sound Power and Early Reflections curves from the Listening Window, respectively.
Smooth DI curves are a quick and easy way to assess directivity performance. However, one flaw of the spinorama standard is that it does not distinguish between horizontal and vertical performance when the latter is more important for the soundstage.
For this, I personally choose to calculate a ‘Horizontal ERDI’ (yellow, dashed) which only considers the horizontal elements of the ERDI curve. This line in particular should be very smooth for a good soundstage.
An ideal spinorama might look something close to this:
But again, no speaker is quite this good. Meanwhile, a godawful spinorama might look something like this:
Thankfully, I haven’t seen a speaker quite this bad.
So are frequency response and directivity everything?
No, but they sure get you really close to the full story.
You might’ve noticed I’ve said nothing about distortion so far, the classic audiophile bugbear. That’s because quantifying what qualifies as ‘bad’ distortion is really hard. There’s no reliable research I’m aware of that shows a clear link between a certain amount of distortion and speaker preference.
In the aforementioned preference studies, distortion was measured for each speaker, but found to have few links to preference.
Moreover, the types of distortion measurements available to most reviewers are fairly rudimentary. And while distortion may sometimes correlate with some artifact you’re hearing, more often than not it can be described by something in the frequency and directivity.
To quote Dr. Toole on the subject:
“The result of this is that traditional measures of harmonic or intermodulation distortion are almost meaningless. They do not quantify distortion in a way that can, with any reliability, predict a human response to it while listening to music or movies. They do not correlate because they ignore any characteristics of the human receptor, itself an outrageously non-linear device. The excessive simplicity of the signals also remains a problem. Music and movies offer an infinite variety of input signals and therefore an infinite variety of distorted outputs. The only meaningful target for conventional distortion metrics is “zero.” Above that, somebody, sometime, listening to something, may be aware of distortion, but we cannot define it in advance.”
When it comes to distortion, I personally only worry about it if I can clearly hear it.
Although not distortion in the traditional sense, one type of deviation to keep in mind is what is often colloquially referred to as speaker ‘compression’ or a ‘limiter.’
Many modern speakers with built-in amplifiers use DSP to push bass performance beyond what they’d be able to do in a traditional design. In these situations, the DSP is programmed to reduce bass output once you turn up the volume beyond the speaker’s comfort zone. This means that the speaker’s frequency response will actually change significantly at different volumes (directivity remains the same). In these situations, I will typically capture a frequency response at different levels to give you an idea of the speaker’s output.
Lastly, I want to reiterate that this article is about a speaker’s anechoic performance. Below 300-500Hz, the room begins to have a larger effect, and it has a massive impact on a speaker’s bass performance. I’ll have a separate write-up on optimizing a speaker’s performance in-room.
I’d like to learn more!
Even if you don’t agree that a flat speaker with smooth directivity sounds best, you still benefit from knowing how to interpret measurements. Let’s say you know you prefer a little more treble and bass than most, perhaps because you have hearing issues or like to listen very quietly (bass is harder to hear at low volumes); measurements can still tell you that too!
So if you want to learn more, I can’t recommend Dr. Toole’s book enough. It provides an incredible wealth of knowledge with myriad citations. If you have an hour to kill, you can check out this lecture which summarizes many of the concepts:
My friend Erin over at Erin’s Audio Corner just released an excellent and comprehensive series of videos describing measurements and more:
If you want to know where to find more speaker measurements and perhaps try the data with your own impressions, here are some other resources that publish extensive frequency response and directivity measurements — some of them with spinoramas, some in other forms.
There might be others I’m missing, but hopefully, this provides a good selection of resources to get started. Better yet, the amount of available speaker measurements is increasing all the time. It’s a great time to be an audiophile — at least one who sees value in the data.
Join Transform 2021 for the most important themes in enterprise AI & Data. Learn more.
On the heels of a computer vision system that achieved state-of-the-art accuracy with minimal supervision, Facebook today announced a project called Learning from Videos that’s designed to automatically learn audio, textual, and visual representations from publicly available Facebook videos. By learning from videos spanning nearly every country and hundreds of languages, Facebook says the project will not only help it to improve its core AI systems but enable entirely new experiences. Already, Learning from Videos, which began in 2020, has led to improved recommendations in Instagram Reels, according to Facebook.
Continuously learning from the world is one of the hallmarks of human intelligence. Just as people quickly learn to recognize places, things, and other people, AI systems could be smarter and more useful if they managed to mimic the way humans learn. As opposed to relying on the labeled datasets used to train many algorithms today, Facebook, Google, and others are looking toward self-supervised techniques that require few or no annotations.
For example, Facebook says it’s using Generalized Data Transformations (GDT), a self-supervised system that learns the relationships between sounds and images, to suggest Instagram Reel clips relevant to recently watched videos while filtering out near-duplicates. Consisting of a series of models trained across dozens of GPUs on a dataset of millions of Reels and videos from Instagram, GDT can learn that a picture of an audience clapping probably goes with the sound of applause or that a video of a plane taking off likely goes with a loud roar. Moreover, the system can surface recommendations based on videos that sound alike or look alike, respectively, by leveraging audio as a signal.
When asked which Facebook and Instagram users were subjected to having their content used to train systems like GDT and whether those users were informed the content was being used in this way, a Facebook spokesperson told VentureBeat that the company informs account holders in its data policy that Facebook “uses the information we have to support research and innovation.” In training other computer vision systems such as SEER, a self-supervised AI model that Facebook detailed last week, OneZero notes that the company has purposely excluded user images from the European Union, likely because of GDPR.
Above: Facebook’s AI identifies and groups together similar Instagram videos and Reels.
Image Credit: Facebook
Learning from Videos also encompasses Facebook’s work on wav2vec 2.0, an improved machine learning framework for self-supervised speech recognition. The company says that when applied to millions of hours of unlabeled videos and 100 hours of labeled data, wave2vec 2.0 reduced the relative word error rate by 20% compared with supervised-only baselines. As a next step, Facebook says it’s working to scale wav2vec 2.0 with millions of additional hours of speech from 25 languages to reduce labeling, bolster the performance of low-and medium-resource models, and improve other speech and audio tasks.
In a related effort, to make it easier to search across videos, Facebook says it’s using a system called the Audio Visual Textual (AVT) model that aggregates and compares sound and visual information from videos as well as titles, captions, and descriptions. Given a command like “Show me every time we sang to Grandma,” the AVT model can find its location and highlight the nearest timestamps in the video. Facebook says it’s working to apply the model to millions of videos before it begins testing it across its platform. It’s also adding speech recognition as one of the inputs to the AVT model, which will allow the system to respond to phrases like “Show me the news show that was talking about Yosemite.”
TimeSformer
The Learning from Videos project also birthed TimeSformer, a Facebook-developed framework for video understanding that’s based purely on the Transformer architecture. Transformers employ a trainable attention mechanism that specifies the dependencies between elements of each input sequence — for instance, amino acids within a protein. It’s this that enables them to achieve state-of-the-art results in areas of machine learning including natural language processing, neural machine translation, document generation and summarization, and image and music generation.
Facebook claims that TimeSformer, short for Time-Space Transformer, attains the best reported numbers on a range of action recognition benchmarks. It also takes roughly one-third the time to train than comparable models. And it requires less than one-tenth the amount of compute for inference and can learn from video clips up to 102 seconds in length, much longer than most video-analyzing AI models. Facebook AI research scientist Lorenzo Torresani told VentureBeat that TimeSformer can be trained in 14 hours with 32 GPUs.
“Since TimeSformer specifically enables analysis of much longer videos, there’s also the opportunity for interesting future applications such as episodic memory retrieval — ability to detect particular objects of interest that were seen by an agent in the past — and classifying multi-step activities in real time like recognizing a recipe when someone is cooking with their AR glasses on,” Torresani said. “Those are just a few examples of where we see this technology going in the future.”
It’s Facebook’s assertion that systems like TimeSformer, GDT, wav2vec 2.0, and AVT will advance research to teach machines to understand long-form actions in videos, an important step for AI applications geared toward human understanding. The company also expects they’ll form the foundation of applications that can comprehend what’s happening in videos on a more granular level.
“[All] these models will be broadly applicable, but most are research for now. In the future, when applied in production, we believe they could do things like caption talks, speeches, and instructional videos; understand product mentions in videos; and search and classification of archives of recordings,” Geoffrey Zweig, director at Facebook AI, told VentureBeat. “We are just starting to scratch the surface of self-supervised learning. There’s lots to do to build upon the models that we use, and we want to do so with speed and at scale for broad applicability.”
Facebook chose not to respond directly to VentureBeat’s question about how any bias in Learning from Videos models might be mitigated, instead saying: “In general, we have a cross-functional, multidisciplinary team dedicated to studying and advancing responsible AI and algorithmic fairness, and we’re committed to working toward the right approaches. We take this issue seriously, and have processes in place to ensure that we’re thinking carefully about the data that we use to train our models.”
Research has shown that state-of-the-art image-classifying AI models trained on ImageNet, a popular (but problematic) dataset containing photos scraped from the internet, automatically learn humanlike biases about race, gender, weight, and more. Countless studies have demonstrated that facial recognition is susceptible to bias. It’s even been shown that prejudicescan creep into the AI tools used to create art, potentially contributing to false perceptions about social, cultural, and political aspects of the past and hindering awareness about important historical events.
Facebook chief AI scientist Yann LeCun recently admitted to Fortune that fully self-supervised computer vision systems can pick up the biases, including racial and gender stereotypes, inherent in the data. In acknowledgment of the problem, a year ago Facebook set up new teams to look for racial bias in the algorithms that drive its social network as well Instagram. But a bombshell report in MIT Tech Review this week revealed that at least some of Facebook’s internal efforts to mitigate bias were coopeted to protect growth or in anticipation of regulation. The report further alleges that one division’s work, Responsible AI, became essentially irrelevant to fixing the larger problems of misinformation, extremism, and political polarization.
VentureBeat
VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative technology and transact.
Our site delivers essential information on data technologies and strategies to guide you as you lead your organizations. We invite you to become a member of our community, to access:
up-to-date information on the subjects of interest to you
our newsletters
gated thought-leader content and discounted access to our prized events, such as Transform 2021: Learn More
Join Transform 2021 for the most important themes in enterprise AI & Data. Learn more.
The pandemic appears to have supercharged voice app usage, which was already on an upswing. According to a study by NPR and Edison Research, the percentage of voice-enabled device owners who use commands at least once a day rose between the beginning of 2020 and the start of April. Just over a third of smart speaker owners say they listen to more music, entertainment, and news from their devices than they did before, and owners report requesting an average of 10.8 tasks per week from their assistant this year compared with 9.4 different tasks in 2019. According to a new report from Juniper Research, consumers will interact with voice assistants on 8.4 billion devices by 2024.
But despite their growing popularity, assistants like Alexa, Google Assistant, and Siri still struggle to understand diverse regional accents. According to a study by the Life Science Centre, 79% of people with accents alter their voice to make sure that they’re understood by their digital assistants. And in a recent survey commissioned by the Washington Post, popular smart speakers made by Google and Amazon were 30% less likely to understand non-American accents than those of native-born users.
Traditional approaches to narrowing the accent gap would require collecting and labeling large datasets of different languages, a time- and resource-intensive process. That’s why researchers at MLCommons, a nonprofit related to MLPerf, an industry-standard set of benchmarks for machine learning performance, are embarking on a project called 1000 Words in 1000 Languages. It’ll involve creating a freely available pipeline that can take any recorded speech and automatically generate clips to train compact speech recognition models.
“In the context of consumer electronic devices, for instance, you don’t want to have to go out and build new language datasets because that’s costly, tedious, and error-prone,” Vijay Janapa Reddi, an associate professor at Harvard and a contributor on the project, told VentureBeat in a phone interview. “What we’re developing is a modular pipeline where you’ll be able to plug in different sources speech and then specify the [words] for training that you want.”
While the pipeline will be limited in scope in that it’ll only create training datasets for small, low-power models that continually listen for specific keywords (e.g. “OK Google” or “Alexa”), it could represent a significant step toward truly accent-agnostic speech recognition systems. By convention, training a new keyword-spotting model would require manually collecting thousands of examples of labeled audio clips for each keyword. When the pipeline is released, developers will be able to simply provide a list of keywords they wish to detect along with a speech recording and the pipeline will automate the extraction, training, and validation of models without requiring any labeling.
“It’s not even really creating a dataset, it’s just training a dataset that comes about as a result of searching the larger corpus,” Reddi explained. “It’s like doing a Google search. What you’re trying to do is find a needle in a haystack — you end up with a subset of results with different accents and whatever else you have in there.”
The 1000 Words in 1000 Languages project builds on existing efforts to make speech recognition models more accessible — and equitable. Mozilla’s Common Voice, an open source and annotated speech dataset, consists of voice snippets and voluntarily contributed metadata useful for training speech engines like speakers’ ages, sex, and accents. As a part of Common Voice, Mozilla maintains a dataset target segment that aims to collect voice data for specific purposes and use cases, including the digits “zero” through “nine” as well as the words “yes,” “no,” “hey,” and “Firefox.” For its part, in December, MLCommons released the first iteration of a public 86,000-hour dataset for AI researchers, with later versions due to branch into more languages and accents.
“The organizations that have a huge amount of speech are often large organizations, but speech is something that has many applications,” Reddi said. “The question is, how do you get this into the hands of small organizations that don’t have the same scale as big entities like Google and Microsoft? If they have a pipeline, they can just focus on what they’re building.”
For AI coverage, send news tips to Khari Johnson and Kyle Wiggers — and be sure to subscribe to the AI Weekly newsletter and bookmark our AI channel, The Machine.
Thanks for reading,
Kyle Wiggers
AI Staff Writer
VentureBeat
VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative technology and transact.
Our site delivers essential information on data technologies and strategies to guide you as you lead your organizations. We invite you to become a member of our community, to access:
up-to-date information on the subjects of interest to you
our newsletters
gated thought-leader content and discounted access to our prized events, such as Transform
A team of researchers from MIT and Massachusetts General Hospital recently published a study linking social awareness to individual neuronal activity. To the best of our knowledge, this is the first time evidence for the ‘theory of mind‘ has been identified at this scale.
Measuring large groups of neurons is the bread-and-butter of neurology. Even a simple MRI can highlight specific regions of the brain and give scientists an indication of what they’re used for and, in many cases, what kind of thoughts are happening. But figuring out what’s going on at the single-neuron level is an entirely different feat.
Here, using recordings from single cells in the human dorsomedial prefrontal cortex, we identify neurons that reliably encode information about others’ beliefs across richly varying scenarios and that distinguish self- from other-belief-related representations … these findings reveal a detailed cellular process in the human dorsomedial prefrontal cortex for representing another’s beliefs and identify candidate neurons that could support theory of mind.
In other words: the researchers believe they’ve observed individual brain neurons forming the patterns that cause us to consider what other people might be feeling and thinking. They’re identifying empathy in action.
This could have a huge impact on brain research, especially in the area of mental illness and social anxiety disorders or in the development of individualized treatments for people with autism spectrum disorder.
Perhaps the most interesting thing about it, however, is what we could potentially learn about consciousness from the team’s work.
[Read:How this company leveraged AI to become the Netflix of Finland]
The researchers asked 15 patients who were slated to undergo a specific kind of brain surgery (not related to the study) to answer a few questions and undergo an simple behavioral test. Per a press release from Massachusetts General Hospital:
Micro-electrodes inserted in the dorsomedial prefrontal cortex recorded the behavior of individual neurons as patients listened to short narratives and answered questions about them. For example, participants were presented with this scenario to evaluate how they considered another’s beliefs of reality: “You and Tom see a jar on the table. After Tom leaves, you move the jar to a cabinet. Where does Tom believe the jar to be?”
The participants had to make inferences about another’s beliefs after hearing each story. The experiment did not change the planned surgical approach or alter clinical care.
The experiment basically took a grand concept (brain activity) and dialed it in as much as possible. By adding this layer of knowledge to our collective understanding of how individual neurons communicate and work together to emerge what’s ultimately a theory of other minds within our own consciousness, it may become possible to identify and quantify other neuronal systems in action using similar experimental techniques.
It would, of course, be impossible for human scientists to come up with ways to stimulate, observe, and label 100 billion neurons – if for no other reason than the fact it would take thousands of years just to count them much less watch them respond to provocation.
Luckily, we’ve entered the artificial intelligence age and if there’s one thing AI is good at it’s doing really monotonous things, such as labeling 80 billion individual neurons, really quickly.
It’s not much of a stretch to imagine the Massachusetts team’s methodology being automated. While it appears the current iteration requires the use of invasive sensors – hence the use of volunteers who were already slated to undergo brain surgery – it’s certainly within the realm of possibility that such fine readings could be achieved with an external device one day.
The ultimate goal of such a system would be to identify and map every neuron in the human brain as it operates in real time. It’d be like seeing a hedge maze from a hot air balloon after an eternity lost in its twists.
This would give us a god’s eye view of consciousness in action and, potentially, allow us to replicate it more accurately in machines.