Vox Technica: How Siri Gets Its Voice

n early October, CNN revealed that veteran voice actor Susan Bennett was the voice behind Siri until Apple changed it in iOS 7. But even more interesting than Siri’s identity is the question of how a person’s voice is transformed into software that can synthesize any text thrown at it. My Voice Is My Passport In the movie Sneakers, Robert Redford’s ragtag team of hackers bypasses a company’s voice-based security system by splicing together individual words taped from an unsuspecting employee. The process of giving voice to iOS’s digital assistant may not be all that different. “For a large and dynamic synthesis application, the voice talent (one or more actors) will be needed in the recording studio for anywhere from several weeks to a number of months,” says veteran voice actor Scott Reyns, who is based in San Francisco.

“They’ll end up reading from thousands to tens of thousands of sentences so that a good amount of coverage is recorded for phrasing and intonation.” According to Arash Zafarnia, director for consulting firm Handsome, based in Austin, Texas, consistency is key in obtaining a good voice sample: “The same words and phrases have to be repeated dozens of times.” Slice and Dice Once the initial voice data has been collected, it must be broken down into components that can be reassembled into new words. Think of the process as a high-tech version of cutting and splicing different lengths of tape together. Producing high-quality output requires that single words be broken down into phonemes—the building blocks of every spoken language. For example, the word Macintoshcan be broken down into eight phonemes, which are then classified according to the universally recognized International Phonetic Alphabet. That reduces the word to its basic sounds. Each sound is categorized, with multiple variations stored in a database. Common phonetic combinations are also extracted from the source material and stored alongside the individual phonemes to make the output sound more natural. The amount of work that goes into the voice-synthesizing phase is staggering, and critical to the ultimate quality of the synthesizer’s speech. Frankenvoice Once the phonetic database is complete, it is shipped alongside the final product, and either installed on servers that provide voice synthesis remotely across the Internet—as in Siri’s case—or directly on a device, as is the case for the VoiceOver software that ships with OS X and iOS. When asked to transform a sentence into speech, the synthesis engine will first look for a predefined entry in its database. If it doesn’t find one, it will then try to make sense of the input’s linguistic makeup, so that it can assign the proper intonation to all the words. Next, it will break the input down into combinations of phonemes, and look for the most appropriate candidate sounds in its database. In an ideal scenario, the engine’s database would contain every possible combination of sounds that can be produced by a human voice—a goal that would be nearly impossible to achieve. Instead, the software looks for a series of best matches, stringing them together into a final audio stream. In some cases, such as with nonstandard or foreign words, this may be very hard to do, leading to incorrect results. Almost Like the Real Thing Making Siri talk requires the contribution of many experts, from actors to engineers to voice specialists. Still, despite their ever-increasing accuracy, synthesized voices are no substitute for the real thing. “When emotion, engaging and compelling an audience, telling a story, or getting a message across that sells counts,” says actor Scott Reyns, “companies hire the real thing: actual humans.”

Leave a Comment Cancel reply