速记员即将被淘汰,未来 AI 可以把一切转录为文字

发布时间:2024/10/24 12:32:00

人工智能势不可当。虽然尚不完美,却极有可能在未来取代打字员,将人类从打字的繁琐中解放出来,甚至使人们摆脱设备的束缚。便捷、高效、低廉的人工智能转录还将对未来社会产生哪些影响?本文编译自GREG NOONE在 the Atlantic上发表的“”。

怎样才是描述报业大亨鲁伯特·默多克(Rupert Murdoch)被奶油派砸了一脸的最好方式?这对世界新闻界来说不成问题。几乎所有媒体都报道了在2011年英国议会听证会期间,这位媒介大亨发表证词时发生的意外事件,报道风格从高雅喜剧到低俗喜剧皆由。但这对听证会的官方书记员来说,则是另一回事。通常情况下,书记员的工作只是记录听到的话语。奶油派袭击事件发生后——无论是出于有意选择还是受制于议会的固定风格——书记员决定以最简单的方式,将其标注为“中断”。

What is the best way to describe Rupert Murdoch having a foam pie thrown at his face? This wasn’t much of a problem for the world’s press, who were content to run articles depicting the incident during the media mogul’s testimony at a 2011 parliamentary committee hearing as everything from high drama to low comedy. It was another matter for the hearing’s official tranionist. Typically, a tranionist’s job only involves typing out the words as they were actually said. After the pie attack—either by choice or hemmed in by the conventions of house style—the tranionist decided?to go the simplest route?by marking it as an “[interruption].” ?


Across professional fields, a whole multitude of conversations—meetings, interviews, and conference calls—need to be transcribed and recorded for future reference. This can be a daily, onerous task, but for those willing to pay, the job can be outsourced to a professional tranion service. The service, in turn, will employ staff to transcribe audio files remotely or, as in my own couple of months in the profession, attend meetings to type out what is said in real time.


Despite the recent emergence of browser-based tranion aids, tranion’s an area of drudgery in the modern Western economy where machines can’t quite squeeze human beings out of the equation. That is until last year, when Microsoft built one that could.

微软首席语言科学家黄学东(Xuedong Huang)在苏格兰爱丁堡大学攻读博士课程时,就被自动语音识别(ASR)深深地吸引了。“当时我刚离开中国,”黄学东回忆起用本科水平的美式英语,试图听懂苏格兰口音的教授讲话时的困难,他说,“我希望每个讲师和教授在教室里授课时,都能有字幕。”

Automatic speech recognition, or ASR, is an area that has gripped the firm’s chief speech scientist, Xuedong Huang, since he entered a doctoral program at Scotland’s Edinburgh University. “I’d just left China,” he says, remembering the difficulty he had in using his undergraduate knowledge of the American English to parse the Scottish brogue of his lecturers. “I wished every lecturer and every professor, when they talked in the classroom, could have subtitles.”

为了实现这种实时服务,黄学东和他的团队首先需要创建一个能够追溯转录的程序。人工智能的发展使他们得以利用名为“深度学习”的技术,将该程序训练为能从大量数据中识别出模式。黄学东和他的同事们利用该软件来转录NIST 2000 CTS测试集,这是20多年来作为语音识别工作基准的一组记录谈话。职业打字员在转录两个不同部分的测试时,分别会出现5.9%和11.3%的错误率。微软团队开发的系统则略微胜过两者。

In order to reach that kind of real-time service, Huang and his team would first have to create a program capable of retrospective tranion. Advances in artificial intelligence allowed them to employ a technique called deep learning, wherein a program is trained to recognize patterns from vast amounts of data. Huang and his colleagues used their software to transcribe the NIST 2000 CTS test set, a bundle of recorded conversations that’s served as the benchmark for speech recognition work for more than 20 years. The error rates of professional tranionists in reproducing two different portions of the test are 5.9 and 11.3 percent. The system built by the team at Microsoft edged past both.


“It wasn’t a real-time system,” acknowledges Huang. “It was very much like we wanted to see, with all the horsepower we have, what is the limit. But the real-time system is not that far off.”

的确,ASR程序已经能够准确地转录采访或会议内容,内容看上去也不再那么荒唐。在上个月微软举办的Build大会上,副总裁沈向洋(Harry Shum)展示了一款PowerPoint转录服务,展示时的语音能够和个人幻灯片相关联。同时,微软也在和苹果、谷歌等公司展开激战,让实时移动翻译应用能够完美地进行转录。

Indeed, the promise of ASR programs capable of accurately transcribing interviews or meetings as they happen no longer seems so outlandish. At Microsoft’s Build conference last month, the company’s vice-president, Harry Shum, demonstrated a PowerPoint tranion service that would allow the spoken words of the presentation to be tied to individual slides. The firm is also in a close race with the likes of Apple and Google to perfect the trans produced by its real-time mobile translation app.


Huang believes the point at which tranion software will overtake human capabilities is open to interpretation. “The definition of a perfect result would be controversial,” he says, citing the error rates among human tranionists. “How ‘perfect’ this is depends on the scenario and the application.”


An ASR system tasked with transcribing speech in real time is only deemed successful if every word is interpreted correctly, something that largely has been achieved with mobile assistants like Cortana and Siri, but has yet to be mastered in real-time translation apps.? However, a growing number of computer scientists are realizing that standards do not need to be as high when it comes to the automatic tranion of recorded audio, where any mistakes in the text can be amended after the fact.


“We don’t claim ... this is perfect. But, with good audio, it can be close to perfect.”


Two companies—Trint, a start-up in London,and Baidu, the Chinese internet giant with an application called?SwiftScribe—have begun to offer browser-based tools that can convert recordings of up to an hour into text with a word-error rate of 5 percent or less.*?On the page, their output looks very similar to the raw documents I typed out in real-time during the many meetings I attended as a freelance tranionist: at best, a Joycean stream-of-consciousness marvel, and at worst, gobbledygook. But by turning the user from a scribe into an editor, both programs can shave hours off an onerous and distracting task.


The amount of time saved, of course, is contingent on the quality of the audio. Trint and SwiftScribe tend to make short work of face-to-face interviews with the bare minimum of ambient noise, but struggle to transcribe recordings of crowded rooms, telephone interviews with bad reception, or anyone who speaks with an accent that isn’t American or British English. My attempt to run a recording of a German-accented speaker through Trint, for example, saw the engine interpret “it was rather cold, but the atmosphere was great” as “That heart is also all barf. Yes. His first face.”

“我们并不认为在几分钟的访谈中,这样的转录结果是完美的,”Trint的首席执行官杰夫·考夫曼(Jeff Kofman)说。“但是,只要有高质量音频,它就能接近完美。你可以搜索、重听、查错,就能在几秒内知道究竟说了什么。”

“We don’t claim that this turnaround in a couple of minutes of an interview like this is perfect,” says Jeff Kofman, Trint’s CEO. “But, with good audio, it can be close to perfect. You can search it, you can hear it, you [can] find the errors, and you know within seconds what was actually said.”

考夫曼表示,Trint的绝大多数用户都是记者,其次是定性研究的研究员以及商界和医疗保健客户——换句话说,都是需要在严格的规定时间内完成大量音频转录的职业。这与SwiftScribe的开发者Ryan Prenger和他的同事们收集到的匿名用户行为数据相一致。虽然Prenger推测有一些长尾用户,他们只是渴望测试SwiftScribe能力的人工智能爱好者,但他也看到一些日常使用该程序转录语音的“超级用户”。随着ASR技术的不断改进,他对该技术能够吸引的用户范围感到乐观。

According to Kofman, most of the people using Trint are journalists, followed by academics doing qualitative research and clients in business and healthcare—in other words, professions expected to transcribe a large volume of audio on tight deadlines. That’s in keeping with the anonymized data on user behavior being collected by the developer Ryan Prenger and his colleagues at SwiftScribe. While there is a long tail of users who Prenger speculates are simply AI enthusiasts eager to test out SwiftScribe’s capabilities, he’s also spotted several “power users” that are running audio through the program on almost a daily basis. It’s left him optimistic about the range of people the tool could attract as ASR technology continues to improve.


“That’s the thing with tranion technology in general,” says Prenger. “Once the accuracy gets above a certain bar, everyone will probably start doing their tranions that way, at least for the first several rounds.” He predicts that, ultimately, automated tranion tools will increase both the supply of and the demand for trans. “There could be a virtuous circle where more people expect more of their audio that they produce to be transcribed, because it’s now cheaper and easier to get things transcribed quickly. And so, it becomes the standard to transcribe everything.”


It’s a future that Trint is consciously maneuvering itself to exploit. The company just?raised $3.1 million in seed money?to fund its next round of expansion. Kofman and his team plan to demonstrate its capabilities later this month at the Global Editors Network in Vienna. Their aim is to have the tranion of the event’s keynote address up on the?Washington Post’s website within the hour.


It’s difficult to predict precisely what this new order could look like, although casualties are expected. The stenographer would likely join the ranks of the costermonger?and the?iceman?in the list of forgotten professions. Journalists could spend more time reporting and writing, aided by a?plethora of assistive writing tools, while detectives could analyze the contradictions in suspect testimony earlier. Captioning on YouTube videos could be standard, while radio shows and podcasts could become accessible to the hard of hearing on a mass scale. Calls to acquaintances, friends, and old flames could be archived and searched in the same way that social-media messages and emails are, or intercepted and hoarded by law-enforcement agencies.


For Huang, tranion is just one of a whole range of changes ASR is set to provide that will fundamentally change society itself, one that can already be glimpsed in voice assistants like Cortana, Siri, and Amazon’s Alexa. “The next wave, clearly, is beyond the devices that you have to touch,” he says, envisioning computing technology discreetly woven into a range of working environments. “UI technology that can free people from being tethered to the device will be in the front and center.”


For the moment, however, the engineers behind automated transcribers will have to content themselves with more germane users: the journalist sweating a deadline, or the tranionist working out the right way to describe a man being pied in a parliamentary select committee.



