Just a decade ago, the idea of using technology to do things like automatically translate conversations, identify objects in pictures — or even write a sentence describing those pictures — seemed like interesting research projects, but not practical for real-world use.
The recent improvements in artificial intelligence have changed that. These days more and more people are starting to rely on systems built with technologies such as machine learning. That’s raising new questions among artificial intelligence researchers about how to ensure that the basis for many of these systems — the algorithms, the training data and even the systems for testing the tools — are accurate and as unbiased as possible.
Ece Kamar, a researcher in Microsoft’s adaptive systems and interaction group, said the push comes as researchers and developers realize that, despite the fact that the systems are imperfect, many people are already trusting them for important tasks.
“This is why it is so important for us to know where our systems are making mistakes,” Kamar said.
At the AAAI Conference on Artificial Intelligence, which begins this weekend in San Francisco, Kamar and other Microsoft researchers will present two research papers that aim to use a combination of algorithms and human expertise to weed out data and system imperfections. Separately, another team of Microsoft researchers is releasing a corpus that can help speech translation researchers test the accuracy and effectiveness of their bilingual conversational systems.
The data underpinning artificial intelligence
When a developer creates a tool using machine learning, she generally relies on what’s called training data to teach the system to do a particular task. For example, to teach a system to recognize various types of animals, developers would likely show the system many pictures of animals so it could be trained to tell the difference between, say, a cat and a dog.
Theoretically, the system could then be shown pictures of dogs and cats it’s never seen before and still categorize them accurately.
But, Kamar said, training data systems can sometimes have some so-called blind spots that will lead to false results. For example, let’s say the system is only trained with pictures of cats that are white and dogs that are black. Show it a picture of a white dog, and it may make a false correlation and mislabel the dog as a cat.
These problems arise in part because many researchers and developers are using training sets that weren’t specifically designed for learning the task at hand. That makes sense – a set of data that already exists, such as an archive of animal pictures, is cheaper and faster than building the sets on your own – but it makes it all the more important to add these kinds of safety checks.
“Without these, we are not going to understand what kind of biases there are,” Kamar said.
In one of the research papers, Kamar and her colleagues show an algorithm that they think could be used to identify those blind spots in predictive models, allowing developers and researchers to fix the problem. It’s a research project for now, but they hope that it would eventually grow into something that developers and researchers could use to identify blind spots.
“Any kind of company or academic that’s doing machine learning needs these tools,” Kamar said.
Another research paper Kamar and her colleagues are presenting at the AAAI conference aims to help researchers figure out how different types of mistakes in a complex artificial intelligence system lead to incorrect results. That can be surprisingly difficult to parse out as artificial intelligence systems are doing more and more complex tasks, relying on multiple components that can become entangled.
For example, let’s say an automated photo captioning tool is describing a picture of a teddy bear as a blender. You might think the problem is with the component trained to recognize the pictures, only to find that it really lies in the element designed to write descriptions.
Kamar and her colleagues designed a methodology that provides guidance to researchers about how they can best troubleshoot these problems by simulating various fixes to root out where the trouble lies.
A ‘human in the loop’
For this and other research she has been conducting, Kamar said she was strongly influenced by the work she did on AI 100, a Stanford University-based study on how artificial intelligence will affect people over the next century.
Kamar said one takeaway from that work was the importance of making sure that people are deeply involved in developing, verifying and troubleshooting systems – what researchers call a “human in the loop.” That will ensure that the artificial intelligences we are creating augment human capabilities and reflect how we want them to perform.
Testing the accuracy of conversational translation
When developers and academic researchers create systems for recognizing the words in a conversation, they have well-regarded ways of testing the accuracy of their work: Sets of conversational data such as Switchboard and CALLHOME.
Christian Federmann, a senior program manager working with the Microsoft Translator team, said there aren’t as many standardized data sets for testing bilingual conversational speech translation systems such as the Microsoft Translator live featureand Skype Translator.
So he and his colleagues decided to make one.
The Microsoft Speech Language Translation corpus, which is being released publicly Friday for anyone to use, allows researchers to measure the quality and effectiveness of their conversational translation systems against a data set that includes multiple conversations between bilingual speakers who are speaking French, German and English.
The corpus, which was produced by Microsoft using bilingual speakers, aims to create a standard by which people can measure how well their conversational speech translation systems work.
“You need high-quality data in order to have high-quality testing,” Federmann said.
A data set that hits on the combination of both conversational speech and bilingual translation has been lacking until now.
Marine Carpuat, an assistant professor of computer science at the University of Maryland, who does research in natural language processing, said that when she wants to test how well her algorithms for conversational translation are working, she often has to rely on data that is freely available, such as official translations of European Union documents.
Those kinds of translations weren’t created to test conversational translation systems and they don’t necessarily reflect the more casual, spontaneous way in which people actually talk to each other, she said. That makes it difficult to know if the techniques she has will work when people want to translate a regular conversation, with all the attendant pauses, “ums” and other quirks of spoken language.
Carpuat, who was given early access to the corpus, said it was immediately helpful to her.
“It was a way of taking a system that I know does great on formal data and seeing what happens if we try to handle conversations,” she said.
The Microsoft team hopes the corpus, which will be freely available, will benefit the entire field of conversational translation and help to create more standardized benchmarks that researchers can use to measure their work against others.
“This helps propel the field forward,” saidWill Lewis, a principal technical program manager with the Microsoft Translator team who also worked on the project.