Features

Audio Engineering: the Next 40 Years Page 3

Nevertheless, gestural-control and head-tracking technologies share many of the same design attributes, and appear to be maturing at similar rates. Head-tracking requires six degrees of freedom (6DOF), which tracks XYZ axes and rotations about each axis, known as pitch, yaw, and roll. Popular head-motion tracking systems for gaming today cost around $200 and offer about 640x480 of raw resolution, a sample rate of 100 frames per second, and latencies of less than 10 milliseconds. Lab-grade units with better resolution and response are also available.

Assuming a doubling of CPE every two years, high-resolution head-motion tracking should reach commodity status by 2025, offering a larger field of use and nearly imperceptible latency. By 2025, IC manufacturers will offer low-cost second- or third-generation silicon tracking solutions that will be used in most headworn devices. And by 2035, ultra-high-resolution, low-cost head-motion tracking will be part of every virtual-reality device.

Convergence
Over the next two decades, therefore, all of these virtual technologies—and others I haven't mentioned, such as haptics (tactile feedback technology)—will converge into a singular media ecosystem. As this happens, the way we produce and consume A/V media will radically change.

By 2025, the transition from mouse to gesture will be well under way, and by 2035 both the mouse and touchscreen paradigms will be fading. By 2025, full-immersion headworn video displays will be rapidly displacing external monitors, especially for gaming. And by 2040, most individual visual-interface applications (home entertainment, business, mobile, etc.) will be headworn.

Today's media-production studios are increasingly migrating from hardware to software applications, yet most professional studios remain anchored to rooms full of hardware. This will change. As A/V headgear becomes more convincing, with immersive performance that more faithfully mimics the physical world, almost all media postproduction will migrate to the virtual domain. The few exceptions will be body-sensed acoustics (haptic feedback or subwoofers, for example).

When we converge all of our technology projections into a single media ecosystem, we recognize that high-resolution visual editing, audio mixing and mastering, game development, music composition, and other A/V production and postproduction tasks will be performed predominantly in the virtual domain by 2040, if not earlier.

Likewise, by 2040—and perhaps as early as 2025—most A/V and gaming will be delivered with stark realism via low-cost headworn devices. And when we combine artificial intelligence with full-immersion virtual reality, the line between production and consumption will blur. With assistance from deep AI running on massively powerful CPU–GPU processors, media consumers will become media creators, participating with others in new forms of self-organizing virtual stories.

The era of fully virtual A/V postproduction is almost here, and in some ways has already begun. Post rooms with giant mixing consoles, racks of outboard hardware and patch panels, video editing suites, external video and audio monitors, touchscreens and physical input devices, and large acoustic architectures will become historical curiosities.

Every functional piece of "production equipment"—every knob, fader, switch, screen, indicator, meter, and patch point—will be visible and gesture-controllable entirely in immersive space. Audio and visual monitoring will migrate from big rooms of external hardware to increasingly lightweight and human-adapted headworn devices. The keyboard and mouse will be replaced by spoken commands and gestures made in free space. And if you really need a keyboard, it will be provided, complete with tactile (haptic) feedback—in virtual space.

Today, a $400 Sony PlayStation 4 employs some 5 billion transistors with 2-teraflop graphics processing—or, according to Ray Kurzweil, about the same computing power as one mouse brain. By 2025, a commodity gaming console will be nearly 10,000 times more powerful than today's machines (fig.7). That's the processing power of a human brain sitting on your desktop—roughly equivalent to the power of IBM's most powerful supercomputer in 2008 (footnote 3).

Fig.7 Media creation computing power trend.

With effectively unlimited processing power and profoundly advanced AI, our future production tools will allow us to call up a complete symphony orchestra in any concert hall of our choosing. Let's add a 200-voice choir, or maybe a great soprano or piano soloist out front. Systems for creating immersive media will allow us to input our own music and interact with each desk of a symphony orchestra—or a gamelan orchestra, or whatever—of any size, and in any space, assuming our desired instruments and acoustic space have been characterized. Gestural and voice commands will make refinements to the score and performance, just as a conductor would rehearse an orchestra in real space, until the ensemble plays exactly as we desire.

On the delivery side, room speakers and monitor screens will not go away. Casual and background A/V environments (cars, businesses, homes) will continue to drive a real-space market. But for the audiophile and videophile worlds, the demographic and technical trend data suggest that within 15 to 20 years we will be well into a transition away from big amplifiers, big speakers, big screens, and big rooms to put them in, and toward an ultra-high-resolution headworn experience that will exceed today's best real-space performance.

Well-recorded music will no longer be subject to wildly variable room acoustics. A recording's spatial and timbral realism will remain far more consistent for all listeners. Unless our technical trends abruptly stop, real-space audiophile and videophile markets will likely begin to decline, perhaps as soon as 2025–30, as new generations experience the superior sense of immersive 3D realism offered by accelerated improvements in headworn technology.

Technology curves show that electromechanical devices have been halving in size every 30 months for the last 50 years. This suggests that headworn A/V-immersion hardware will continue to shrink in size as its powers of resolution increase. It's not much of a stretch to envision ultra-lightweight "transparent headgear" that keep your eyes and ears open to your real-space environment, while providing immersive qualities on demand: for instance, physically unnoticeable headphones that allow you to hear in-room sounds just as you would with uncovered ears.

A century ago, Oscar Wilde noted that "life imitates art." Today, technology imitates science fiction. In the not-too-distant future we'll wear holodecks on our heads The future of music, audio, filmmaking, gaming—any creative media construction, from inception to postproduction to delivery—is boundless, limited only by our imaginations (footnote 4).

About the author: John La Grou is founder and chair of POW-R, the world's leading audio bit-length reduction algorithms. Roughly one-third of all CD and downloaded music is processed with POW-R. He is also founder and CEO of Millennia Media, a design leader in critical audio recording, live sound, postproduction, mastering, and archiving. Millennia is the world's most popular front-end for film scoring and classical music recording, while Millennia's phono preamplifiers are used by the Library of Congress to archive their collection of three million historic audio recordings. John presented an earlier version of this article as the Sunday lunchtime keynote address at the 135th Audio Engineering Society Convention in New York, October 2013.

Footnote 3: Projections of processing speeds since 1990 show that it takes about 17 years for supercomputer power to migrate to commodity desktop and mobile devices.

Footnote 4: Special thanks to Ray Kurzweil.

Features

Audio Engineering: the Next 40 Years Page 3

ARTICLE CONTENTS

ArtIcle Contents