Freshly (un)retired, Gary McGraw takes on machine-learning security (Q&A)
Tucked away in Building 43 of Google’s headquarters in Mountain View, Calif., hangs a large but otherwise unassuming mirror that, like Nietzsche’s monsters, will gaze back into you. Among its many features, the Artemis mirror can change your hair color, gamify your hygiene routines, and potentially help you figure out whether your eyesight is getting worse.
The mirror, built by the CareOS subsidiary of the French tech company Baracoda, offers personalized recommendations guided by Google’s TensorFlow Lite machine-learning algorithm platform. It reacts to your image and to voice commands, and offers personalized health, beauty, and hygiene suggestions, as well as more common smartphone features, such as hotel room booking and taxi ordering. It’s no accident, says Thomas Serval, a former Google executive and Baracoda’s founder and CEO, that the mirror system is designed to put “privacy first.”
“We discovered that customers wanted a smartphone in the bathroom but were afraid of getting it wet. And they were afraid of having a camera in the bathroom,” he told The Parallax at Google I/O last month. “It’s the most intimate place in your house, so your data stays with you. We don’t share it without your consent.”
READ MORE ON MACHINE LEARNING
How Facebook fights fake news with machine learning and human insights
Opinion: Beware the dangers of bias in AI
For black and white hats, AI is shaking up infosec
Kasparov talks calculated odds, AI, and cybersecurity (Q&A)
Exacerbating our ‘fake news’ problems: Chatbots
But how can Serval and his team be sure that the underlying implementation of TensorFlow Lite is secure enough to resist hacker interference and manipulation?
The need to answer that question drove longtime cybersecurity researcher and executive Gary McGraw, along with three associates, to create the Berryville Institute of Machine Learning in January. Because machine-learning algorithms are becoming more prevalent as the “brains” behind software, making them more secure and less susceptible to error and manipulation will be one of the defining cybersecurity struggles of the next decade, he says.
“There are things that we’ve learned over the last 20 years about securing systems from software security that apply to machine learning,” says McGraw, who co-wrote the first book on software security in 2000. “We need to think about the parts that they’re constructed from, we need to think about the data that they consume, and we need to think about what we’re asking them to do—and what the risks may be, if they fail to function as we’ve asked them to.”
After more than 30 years in the field he helped create, McGraw retired in January as vice president of security technology at Synopsys. And after a meeting of the board of Ntrepid, on which he continues to serve as chairman, his newfound freedom paved the way for him to co-found BIML.
During a recent phone conversation, McGraw and I discussed why he believes now is the right time to found a dedicated machine-learning cybersecurity research organization—and what he hopes BIML will achieve. What follows is an edited transcript of our conversation.
Q: Why is securing machine learning important?
Machine learning, as you know, has caught on like wildfire. Lots of people are using machine-learning systems to do all sorts of things, many times without really understanding how the systems they’re deploying or are putting to use actually function. Machine-learning algorithms are often treated as a black box, but they’re not a black box. They’re software.
Imagine that you use a machine-learning algorithm to build some sort of association between personal financial data and whether or not somebody should get a loan. It turns out that data may be useful in your system. The system, because of the ways these things function, actually knows all the data that should be confidential that got fed in at the beginning. And if you, as a hacker, can figure out how to work that information out backwards, you can end up getting information that should have been kept confidential. That’s just one example.
There are a lot of people using machine learning to do a lot of things, and they’re not really thinking about security while they’re building these systems. Which reminds me a lot of the software state of the world in the early 1990s.
How prevalent are machine-learning algorithms in modern computing?
They’re becoming more prevalent. If you look at the startup land, it’s one of the hot topics and buzzwords that people use to get funded. The amount of data that we can now process has become so huge that algorithms that we’ve known about for 20 years are becoming more useful in everyday use, all over the place, in many different fields. You might consider it part of your normal toolkit.
Have there been any hacking incidents of machine-learning algorithms?
There are a number of academic papers that have talked about how you control machine-learning algorithms. In fact, there’s a whole literature on attacking machine learning through changing the data. This notion of figuring out how they’re doing what they’re doing, finding the boundaries, and pushing those boundaries—so you end up attacking the algorithm directly through its input—is pretty well trod. There are a lot of people writing about that.
In fact, one of the very first things we set out to do at the Berryville Institute of Machine Learning was build a taxonomy of those known attacks. These fall in, more or less, two categories: manipulation attacks, which compromise integrity, against the data or model; and extraction attacks, which extract input or data, or the model itself.
A lot of people have written about these adversarial attacks. That’s an important piece of work. But it really doesn’t get to the design aspect: how to fix these systems—or if they’re even fixable.
How can your research help secure proprietary algorithms?
We’re looking at the way they’re designed. We’re configuring all of the parts of a machine-learning system, the ways those parts interact, the kinds of assumptions that are made with those parts, and where that can lend itself to an attack.
You can pull apart the raw input data—think about sensors or APIs. You can think about preprocessing as a box you can attack. Vulnerable areas might be between the raw input data and preprocessing, or between preprocessing and network training, during which you decide how to divide up the data between training data and validation data.
Then you get into the big box. Does the algorithm learn while supervised or unsupervised? How susceptible is this learning to parameter cancelling or poisoning? There’s another box for evaluation, and another yet for production.
All of those aspects of the system carry risk. So we’re trying to identify the risks, rank them, and then talk about possible controls or mitigations that can help control them.
What kinds of risks would a hack of the Facebook News Feed algorithm pose to consumers?
I’ve never been on Facebook, so I can only imagine what it even looks like.
One pretty famous example of a vulnerability in machine learning involves a manipulation attack. A guy at the University of Michigan put tape on stop signs. The machine-learning algorithm was part of an autonomous vehicle that was supposed to drive around and not screw up. But if you put tape on the stop sign in just the right way, the algorithm would perceive it as a speed limit 45 mph sign.
One tricky thing about machine learning is that we don’t always know why or how machines do the things we ask them to do, so they screw up in really surprising ways.
Another consumer-based example would be when you go to the bank and request a loan, a machine-learning algorithm can help the bank decide whether to offer you one. If that algorithm is trained up on all of the existing loans and the way the system is working today, it may have built-in biases regarding your race, ethnicity, gender, or sexual orientation—all the sorts of things that banks are not supposed to take into account when considering loans. It might need to get trained to equitably respond to all queries.
Is there enough pressure on the tech companies that we’ve come to rely on to allow independent analysis of their algorithms?
Most of the algorithms that people are using are published in an open way. The real trick in machine learning is not necessarily inventing a new algorithm; it’s which data you use to train the algorithm up—and how do you make use of it.
I think the open-source nature of algorithms, like Google’s TensorFlow, is probably vastly misunderstood, that really anybody can use them. How common are they, and what attacks do they face?
TensorFlow is one. There are about 20 or so that most people use. One of the attacks that we’ve identified in our attack taxonomy extracts the model from a system that makes use of TensorFlow or another model. If you know they started with a certain model, you can screw around with that knowledge.
There are a lot of people using machine learning to do a lot of things, and they’re not really thinking about security while they’re building these systems. Which reminds me a lot of the software state of the world in the early 1990s.
We thought of a new attack that would essentially involve Trojanizing an open-source system. The system would appear to be trained up to do a task, and it would do really well on that task, but it also would do something unexpected. We haven’t seen such an attack in the wild, but we fully expect to because of the way algorithms are dropped around.
Where is machine learning going?
I don’t think it’s going to matter where we want it to be. It’s going to find widespread use in very risky ways, where people apply the technology without thinking about the risks of doing so. We want to be there to help them do a better job, and think about it ahead of time.
You need to think about these issues before you release the system.
Given the history of the cybersecurity industry’s frustrating efforts to convince the rest of tech that cybersecurity is worth focusing on, how do you feel about the chances of BIML influencing machine learning?
I think we’ve made some progress in that kind of thinking over the last 25 years. I hope so. Back in 2000, when I wrote the first book in the world on software security, I couldn’t convince anyone that it was a good idea. Now it’s turned into a real industry, and that means we’ve made some pretty serious progress. We can always make more progress, but I feel like we’ve actually accomplished something.
It’s really exciting work, and we welcome collaboration and feedback from everybody who’s been working on this for a while. It’s exciting when you have a new field that’s a confluence of machine learning on one hand, and deep security knowledge on the other hand, and how we get those two things to come together and help everybody is what we’re worried about. There’s lots of work to do, so we welcome other people to reach out and help us.