New York Times Op-Ed on the A/B Illusion & the Virtues of Data-Driven Innovation

I have an op-ed with Christopher Chabris that appeared in this past Sunday’s New York Times. It focuses on one theme in my recent law review article on corporate experimentation: the A/B illusion. Despite the rather provocative headline that the Times gave it, our basic argument, made as clearly as we could in 800 words, is this: sometimes, it is more ethical to conduct a nonconsensual A/B experiment than to simply go with one’s intuition and impose A on everyone. Our contrary tendency to see experiments—but not untested innovations foisted on us by powerful people—as involving risk, uncertainty, and power asymmetries is what I call the A/B illusion in my law review article. Here is how the op-ed begins:

Can it ever be ethical for companies or governments to experiment on their employees, customers or citizens without their consent? The conventional answer — of course not! — animated public outrage last year after Facebook published a study in which it manipulated how much emotional content more than half a million of its users saw. Similar indignation followed the revelation by the dating site OkCupid that, as an experiment, it briefly told some pairs of users that they were good matches when its algorithm had predicted otherwise. But this outrage is misguided. Indeed, we believe that it is based on a kind of moral illusion.

After the jump, some clarifications and further thoughts.

A Clarification About the Op-Ed’s Position on IRB Review and Debriefing

On Twitter, James Grimmelmann worries that New York Times readers will interpret one part of the op-ed, as he did, to be making a problematic claim. In the piece, we ask readers to imagine a hypothetical CEO who worries that some of her employees aren’t taking optimal advantage of the company’s generous 401(k) matching program. She has a hunch that adding to the usual matching letter information telling each employee what his or her similarly-aged peers are saving would nudge the low-savers to save more. We explain that the CEO could implement this peer information letter policy in the next enrollment cycle and observe the result, but that, regardless of whether savings went up or down, she’d never really know what effect, if any, the policy had without running a randomized, controlled experiment in which half the employees receive letters with no peer information and half receive letters with peer information. We also explain that telling employees in advance about that the company would be sending out different letters and why would badly bias the results by altering employees’ behavior. We then explain that our hypothetical CEO isn’t so hypothetical:

Economists do help corporations run such experiments, but many managers chafe at debriefing their employees afterward, fearing that they will be outraged that they were experimented on without their consent. A company’s unwillingness to debrief, in turn, can be a deal-breaker for the ethics boards that authorize research. So those C.E.O.s do what powerful people usually do: Pick the policy that their intuition tells them will work best, and apply it to everyone.

It is this paragraph that has James worried:

Your argument appears to run: (1) people will be outraged about being experimented on if they’re debriefed; (2) managers “chafe” at de-briefing because of the outrage, (3) IRBs would require debriefing; so (4) it’s reasonable for companies not to go through IRBs at all [and to conduct the nonconsensual A/B test of 401(k) letters with no debriefing afterward].

(The above is three tweets strung together but otherwise verbatim; the bits in brackets are my words, but I think they fairly reflect James’s other tweets and the logic of his argument. I trust he’ll tell me if I’m wrong.)

To be clear: No, this is not the position we intended to take in the op-ed. I thank James for bringing this possible misreading to my attention so that I can correct it here, at least. More after the jump.

We endorse premises 1 to 3, more or less. (We don’t actually claim that employees would be outraged—an empirical claim for which we have no data—only that some CEOs have worried that this will be the case, sometimes to the point of refusing to allow debriefing, and that sometimes economists’ university IRBs have required debriefing anyway, in which case the experiment has not proceeded.) And we do find these three premises to collectively present a lamentable situation in which policy is based on intuition rather than evidence. That’s lamentable in our view because, as we say in the op-ed, gut instincts all too often prove wrong, sometimes leading either to harm or to welfare gains foregone. And these risks are taken for no good reason: if it would be okay, legally and ethically, for the CEO, on a hunch or for no reason at all, to give everyone letters with no peer information, and if it would also be okay for the CEO, on a hunch or for no reason at all, to give everyone letters with peer information, then flipping a coin and giving half of them one letter and half of them the other letter in the same year in order to determine which letter actually helps them should not cause outrage. This is the A/B illusion.

But none of this logically entails conclusion 4: that the solution to this lamentable situation is for academic economists to either lie to their IRBs about whether debriefing occurred or not submit their protocol to IRBs in the first place (in most cases, a violation of university policy and faculty contracts, if not federal regulations). It doesn’t even entail the conclusion that researchers not subject to IRB review by law or institutional policy should forgo either IRB review (or some sort of corporate analog) or debriefing.

In fact, at one point a few hours before the piece went live online, we were still fiddling with the wording for purposes of space and the copy desk’s style requirements, and an editor changed some wording in one of the sentences at issue so that it read: “Unfortunately, a company’s unwillingness to debrief can be a deal-breaker for the ethics boards that authorize research.” I specifically said that I did not want to imply that it was “unfortunate” that IRBs sometimes require debriefing, and reworded the sentence to read as it now does.

No, the solution we advocate—and the point of our writing for a public audience—is to try to explain what the A/B illusion is and why we should all try to overcome it, much as others have advocated that people learn to recognize and try to stem the ill effects of hyperbolic discounting and confirmation bias. If no one were afflicted with the A/B illusion, then perfectly reasonable debriefing requirements would stand as no obstacle to the outcome we want: evidence-based employment retirement savings policy. And so, in our concluding paragraph, we call for all of us (because, as these other cognitive biases and heuristics, we think that the A/B illusion afflicts all of us from time to time) to “cast off the A/B illusion and applaud” “low-risk efforts…to rigorously determine the effects of” “unilateral choices that affect us” made by “people in power.”

On my reading of our op-ed, we’re saying that low-risk research for the purpose of determining the effects of innovations on those affected by them (not of learning more about some random scientific question that is orthogonal to the company-user/employee relationship) shouldn’t be viewed as per se more morally suspicious than simply unilaterally imposing one untested option with unknown effects on people on everyone, even if such an experiment must, to ensure the validity of the data (not just because getting informed consent is inconvenient), be nonconsensual (after all, we necessarily cannot give informed consent to the unknown effects of unilaterally imposed A-only practices). To the contrary, a better rule of thumb is that, given a choice between these two options, an A/B test will generally be preferable to an untested A-only practice. (There is, however, a third option; see below.)

Perhaps the best interpretation of the piece, then, is that it takes no position on debriefing or IRB review, one way or the other. That’s because the emphasis in the op-ed is on the A/B illusion, not on laying out the process by which such low-risk, necessarily nonconsensual, quality improvement experiments are conducted. But I would sooner characterize the op-ed as implicitly endorsing those things than I would interpret it as explicitly or even implicitly rejecting them, much less urging people to flout them in violation of law or policy.

Ryan Calo raises some similar concerns. He tweets that our op-ed “has [an] obvious flaw: Just because some A/B testing is ok doesn’t mean all experiments are. So, how do we know which is ok and which [are] not w/out notice? Because, believe this: sometimes firms are experimenting to extract rent. I’ll bet you $9.99.” In a follow-up tweet, he endorses the view, which he has previously written about, that corporations should use consumer subject review boards.

For reasons I explain above, I disagree that this is a flaw of the op-ed. Again, the op-ed is simply silent about the role of IRBs or some sort of alternative. It does not state or assume that nonconsensual low-risk research can occur with only the researcher deciding what constitutes low risk. Making the case that nonconsensual “human experimentation” can ever be ethically acceptable—much less laudable or sometimes ethically obligatory—seemed like enough to tackle in 800 words.

Conversely, the op-ed is not silent about other (substantive, as opposed to process-oriented) criteria by which we might begin to distinguish acceptable from unacceptable nonconsensual A/B tests. To repeat: we say that low-risk research to determine the effects of innovations on those affected by them (not to learn more about some random scientific question that is orthogonal to the company-user/employee relationship or to extract rent from users) even if they must, for the validity of the data (not just because getting informed consent is inconvenient), be nonconsensual, shouldn’t be viewed as per se more morally suspicious than simply unilaterally imposing one untested option with unknown effects on people on everyone. Each of those italicized caveats is important and limits the scope of our argument, and although we can only briefly assert them in an op-ed, my article discusses each at length. My article also explicitly acknowledges that some corporate research, like most academic research, does not take the form of quality assurance/quality improvement activities in which subjects exposed to (minimal) risks in the study are randomly drawn from the same population that is expected to benefit from the knowledge produced by the study.

Beyond the Op-ed: My Views on IRB Review and Debriefing

In fact, I have endorsed both debriefing and prospective group review of nonconsensual research especially (by an IRB or some similar body) again and again, regardless of whether these are required by law or not. The op-ed itself links to my underlying law review article, which contains an entire section on the importance of debriefing and the normative ends that debriefing does and does not serve. In a footnote in the article, I bracket questions about the appropriate ex ante review process, including IRB review, in order to focus on the appropriate substance of experiments, e.g., what constitutes informed consent in the human subjects research context and when and why it is and isn’t ethically required. Nevertheless, in the Nature piece I co-authored last year on the Facebook experiment, we say that the Facebook and Cornell researchers should have voluntarily sought IRB review and should have debriefed subjects. And just the other day, I had occasion to explain why prospective group review of nonconsensual research is ethically important:

As I’ve said before in the context of the Facebook mood contagion experiment and in other contexts involving risky behavior by journalists, if you’re not going to ask prospective subjects to weigh the risks and potential benefits of participating (say, because that would badly bias the results or because your subjects are not competent adults) and are instead going to deceptively conscript them into service as subjects, then it is right and good to ask people who are less personally and professionally invested in the study than you to put eyes on your plans to ensure that they really don’t pose any more than minimal risk to subjects and that whatever risk[s] they do impose are justified by the expected benefits. This holds true whether you’re an academic, a journalist, a corporate data scientist, or a citizen scientist.

I’ll be turning my attention to corporate research review boards in the coming months, so I expect I’ll have more to say about them in due course. For now, just a few points. First, prospective group review isn’t ethics magic. They don’t always prevent unethical studies from being conducted, both because the review system isn’t perfect and because IRBs exercise zero oversight of the actual consent or study processes to ensure that they adhere to the IRB-approved protocol (they technically have the power to do so but virtually never exercise it, in part because they don’t remotely have the resources). If there’s a truly bad actor out there, you can staff your IRB with a bus full of nuns, and they aren’t going to stop him or her. IRBs can also delay, alter, or block valuable and ethical research in ways that can be harmful to people’s welfare. In short, IRBs make Type I and Type II errors, and both are worth watching out for and trying to mitigate.

Second, and relatedly, I can’t say that we should simply transplant Common Rule IRBs to companies (or other institutions not covered by the Common Rule), because Common Rule IRBs have too many well-known problems that scholars have been documenting, occasionally even empirically, for decades (in the 1970s, the National Commission reviewed some of this empirical work highlighting problems with IRBs that existed at the time). Even federal regulators have acknowledged a variety of problems and are seeking to redress them. And that’s without even considering the possible relevance of various differences between universities and their subjects, on one hand, and corporations and their users, on the other.

Third—and on the other hand—one strength of Common Rule IRBs is that they must include at least one member who is a non-scientist—someone who isn’t steeped in the perspective of scientific research and may not see proposed studies the same way—and at least one member who is a non-affiliate of the institution—so that they do not have any inherent financial or other professional investment in the institution being permitted to conduct the research. The whole point of IRBs (or at least one really big point) is that we don’t trust researchers to be sufficiently disinterested in their proposed research to fairly assess and present the risks and benefits to prospective subjects—not because they’re bad people but because they’re people.

In academia, other affiliates of the same institution and other members of the same or related disciplines will tend to be less interested in a particular project going forward or being conducted in a particular way than the PI, but they are likely not as disinterested and, just as important, they likely do not have quite the same perspective, as others, including members of the subject population (patients, patient advocates, families, product users, employees, citizens—whatever the subject population is). In practice, alas, a single member of the IRB often fills both the non-scientist and the non-affiliate role, and some empirical research suggests that those “non-affiliate” members are uncertain of their role on the IRB and/or may have less of an impact on deliberations as one might hope. Most other countries with IRB-like bodies require far more than one non-affiliate, perhaps for that reason. So, again, there is room for improvement in the way U.S. IRBs are structured. But they seem to provide a significant advantage over any corporate IRBs comprised solely of company insiders steeped in a particular corporate culture. Yes, corporate IRBs would often be reviewing research involving proprietary information, but there are ways of structuring these kinds of ethics boards that protect such information while still serving the intended purpose.

This is actually what makes the prospect of corporate ethics review boards so interesting for those of us who study research ethics and regulation: the ANPRM has been stuck in bureaucratic limbo for almost four years now in a quest to bring more than a dozen agencies on board. But here we’re presented with a blank canvas. Of course, by now, readers can probably guess one approach I’d take if I were czar of the universe: evidence-based research ethics. Come up with a few different promising ways of structuring corporate IRBs (as I’ll call them for simplicity’s sake) and a few different promising sets of substantive rules, principles, or guidelines for them to apply to proposed research projects, test all of them against each other and against the status quo in most companies of having no such bodies at all; measure the effects; and, using the results, develop best practices that companies are encouraged to adopt. That won’t happen for a variety of reasons, which is too bad.

Power Asymmetries

Mary L. Gray tweets our op-ed and comments, “if you’ve ever been on the crap end of scientific experiments=not so awesome.” I am not unaware that nonconsensual research has a dark history in which risks (often more than minimal) were imposed on subjects who were deliberately drawn from already vulnerable populations where neither the subjects themselves nor even the larger population from which they were drawn were likely ever to benefit from whatever knowledge was expected to result. See here for a recent NPR report on one such WWII-era experiment. Beecher’s famous NEJM article provides several other examples. And then, of course, there was Tuskegee. And Guatemala. All of these subjects were used as mere means to science’s ends, to use Kant’s locution.

That’s why our op-ed discusses three examples of one category of experiments that are very different in ethically relevant ways: low-risk quality improvement/quality assurance RCTs where the randomly selected subjects exposed to (minimal) risks are also those who stand to benefit from the results, because the RCTs aim to determine the effects of company practices on ongoing users and employees, not transient passers-by with no ongoing relationship to the company. My article makes explicit that this kind of experiment is ethically different from, and explains why. At the heart of the A/B illusion is the idea that corporations (and, as we say, other “powerful people” like lawmakers and educators) are already unilaterally affecting your welfare in unknown and, therefore, necessarily unconsented-to ways. So if the group of employers, users, citizens, or whomever being “experimented on” is vulnerable, then we should be every bit as worried about the effects of imposing untested A’s on them as Mary is about subjecting them to A/B testing. In these cases, it’s a mistake, we think, to view attempts to determine the effects of their power over us as more morally suspect than their practice of simply imposing untested innovations on us. In fact, I think there’s a prima facie case that it would be a shirking of corporate responsibility and an exploitation of users for a company like Facebook, faced with probative but inconclusive (and indeed contradictory) evidence that their untested product or service is harming users, to ignore that evidence and not seek to determine the truth about the effects it has on users.

There has always been a tension between the idea that subjects need to be protected from research and the idea that people, especially the most vulnerable people, need access to the considerable benefits of research. Both women and AIDS patients have demanded to be included in biomedical research, for instance. It would be a tragic mistake to conclude from the sorry history of Tuskegee and other exploitative research that low-risk RCTs that are conducted without fully informed consent when they cannot be conducted otherwise have only the potential to harm vulnerable populations and not the potential to help them. Here, for instance, is a great story about Riot Games’ use of nonconsensual social psychology RCTs to find the most effective ways of combatting a range of troublesome behaviors in the company’s popular online game, League of Legends, from incivility to harassment. As the article discusses, Riot Games engages in a level of transparency about its A/B testing that other companies would do well to emulate. But, as far as I know, there is nothing remotely approaching “informed consent” for each of their user experiments, as that term is understood in the context of human subject research, nor could there be for the vast majority of those experiments.

Speaking of Beecher, here is as good a place as any to urge readers to take 15 minutes to watch this speech by Georgetown bioethicist Tom Beauchamp who, as staff philosopher on the National Commission, drafted the famous Belmont Report. Tom gave this speech as he accepted the prestigious Hastings Center’s Beecher Award in 2010, and in it, he argues that the distinction between research and ordinary practice that is written into the Belmont Report and the Common Rule has caused us to overprotect subjects and underprotect patients (and, I’d add, clients, users, and others who are subject to unilaterally imposed untested practices). I also discuss this problematic distinction in my article.

Now, of course, for these experimenters to truly meet the noble purpose I’ve described for them, they have to do more than simply run a low-risk experiment with the potential to improve their practice, to the benefit of those who are subject o the effects of those practices. As I say in the conclusion to my article:

[R]esponsible innovation only begins—it does not end—with a culture of continual testing. The responsible innovator will respond appropriately to the results of A/B testing. This may not require, in every case, that a practice be changed, implemented, or abandoned accordingly. But if the practitioner opts not to change her practice, then sometimes—especially if A/B testing was conducted without subjects’ consent, imposed any incremental risk, and yielded important results—ethics will require that she at least transparently communicate the results so that users or consumers can make a more informed decision about whether and how to avail themselves of that practice.

To return to the Riot Games example, yes, we all wish harassers and those who are uncivil online would just stop or go away. But absent that fantasy, do you really want gaming companies guessing about how to address the problem, or would you like them to quickly gather rigorous evidence of what works through low risk, nonconsensual RCTs? It’s certainly possible to prefer the former choice to the latter, but I think that doing so requires acknowledging the benefits that are likely foregone and owning that outcome.

A Challenge for Critics: Compared to What?

Which leads me to my final series of comments. For those who reject nonconsensual low-risk RCTs to determine the effects on the subject population (users, customers, or employees) of a company’s product, service, or policy, which alternative do you prefer (either as a matter of law or as a matter of policy)? Please stipulate with me that sometimes, nothing but a nonconsensual experiment (but with all of the debriefing afterward and IRB review beforehand that you want) will definitively determine the effects of an innovation or an existing practice. In the Facebook case, for example, as I explain in my article, academic observational studies over several years had reached opposite conclusions about the psychological risks, if any, of News Feed. Some said that friends’ artificially positive posts risk making users feel worse about their own lives (a social comparison effect) while others said that it is friends’ negative posts that risk depressing users (through a mood contagion effect). Others said that both of these hypotheses hold some truth, and that the effect of News Feed on users depends on an individual user’s characteristics (like personality). Still others thought that News Feed has no significant psychological effect on users and that studies finding otherwise were mere noise, based on oft-low sample sizes, self-reported outcomes, and other methodological problems.

The way I see it, there are two broad alternatives to mid-stream A/B field experiments of the kind Facebook and OkCupid did. The first is conducting A/B testing from the beginning, with small groups of people, before rolling out an innovation to everyone. This is what we do with new drugs and medical devices. The second alternative, of course, is never to conduct A/B testing.

Let’s consider the second possibility first. A company (or other innovator) could go ahead and launch an untested innovation with unknown effects on its customers, users, or employees (or citizens, students, patients, clients—whomever). Maybe no one ever even tries to observe its effects. Or maybe someone inside or outside the company observes correlations between the innovation and bad outcomes. Maybe several such people do, but their observations contradict each other. Some say using your innovation one way harms users while others say, no, using the innovation the opposite way is what harms users. You and only you could run a sufficiently powered RCT that could resolve this dispute and determine the actual effects of your innovation on your users. But telling users about the RCT would bias the results and make them useless. Maybe you think that not telling people about an A/B test, no matter how low the risk compared to the status quo and no matter how much they themselves stand to benefit from the results, is absolutely unethical. Or maybe you know that some other people believe this and you don’t want the bad publicity. Or maybe you just can’t be bothered; after all, you’re still making money and your user base continues to grow. For whatever reason, you don’t run an A/B test. You simply ignore the academic fight about whose hypothesis is right, assume they’ll cancel each other out in the media, and move on.

If either this approach or midstream A/B testing is disrespectful of users, an abuse of power, deprives users of important information, treats users like mere means to corporate ends, and so on, my money’s on this approach, not on low-risk, consensual A/B testing. Yet this approach seems to be what a lot of people would have had Facebook do in the wake of probative, but conflicting and indeterminate evidence that either positive posts or negative posts harm users.

But these two approaches—mid-stream A/B testing or making no rigorous attempt to ever determine the effects of your innovations or existing practices—do not exhaust the universe of possibilities. This is implicit, albeit subtly, in the op-ed: The A/B illusion refers only to the mistake of viewing low-risk attempts to determine the effects of practices on us as more morally suspect than simply allowing untested innovations to be imposed on us without ever testing their effects. As we say, “as long as we permit those in power to make unilateral choices that affect us, we shouldn’t thwart low-risk efforts . . . to rigorously determine the effects of those choices” (emphasis added).

But sometimes, of course, we shouldn’t permit those in power to make unilateral, untested choices that affect us. A third option, then, is to prohibit altogether, reduce the frequency of, or otherwise “regulate” (by law or ethics) untested A-only practices. There are at least two reasons why we might want to adopt this third approach of jettisoning an A-only practice rather than either imposing A forever untested or rolling it out untested and then A/B testing it mid-stream (perhaps only if credible evidence of harm of inefficacy arises).

First, there may be innovations or existing practices that we think are simply off-limits, not because we aren’t sure of their safety or efficacy but because they are inherently problematic. I flagged this possibility in a section of my article called “The Problem of Unethical Underlying Practices.” What if, for instance, Facebook decided to configure its News Feed algorithm in an attempt to skew an election in a particular direction? Sure, it might be nice to know how effective their attempts at mucking around in an election are, and if somehow the only options were figuring out how successfully Facebook can skew an election outcome and just guessing about how effective it is, perhaps A/B testing would be preferable to nothing. But these aren’t, in any obvious sense, the only options. Facebook should not be mucking around in an election in the first place and the right thing to do is to seek to prohibit that underlying practice, not to test its effects or to ignore the untested practice (or, in my view, to attack the testing in a backdoor effort to get rid of the underlying practice, for that is a gamble whose cost is borne by all of us when the concept of research is tarnished and depicted as inherently problematic). That is, the right thing to do is neither conducting an A/B test nor allowing A to proceed unstudied, but to stop A.

Second, we might subject innovations to A/B testing from the outset (either as a matter of law or policy) because, like new drugs and medical devices, the prior probability of an innovation causing significant harm (or forgoing significant benefit) is high enough that something like premarket approval based on phased safety and efficacy trials is warranted. A/B testing in small populations before rolling out an innovation to everyone can prevent harmful effects occurring on a catastrophic scale (this was the lesson of thalidomide).

I think this is probably a good idea for some categories of innovations not currently governed this way. Note, however, several reasons why this third alternative of premarket A/B testing will not always relieve the innovator of the later dilemma of deciding whether to run a field experiment or ignore hints of problems with her product, service, policy, law, or practice.

First, Phase I, II, and III trials obviously do not avoid the problem of “experimenting on human beings.” Someone has to go first. Such trials do have advantages, however. They are usually highly controlled, both in terms of the inclusion and exclusion criteria for subjects and in terms of adherence to a rigid protocol, and well-monitored. All of this control and monitoring requires subjects to know that they’re in a trial, so there is almost always informed consent. On the other hand, what pristine lab RCTs have in internal validity they tend to lack in external validity: the results of trials based on cherry picked subjects in highly controlled—read: contrived—settings do not always generalize well outside the lab, where people have co-morbidities, are not treatment-naïve, imperfectly comply with regimens, and so on. I’m referring to well-known problems with the external validity of biomedical trials, but there are analogs to, say, psychology lab studies. This is why, at some point, field experiments involving members of the actual population of interest in the actual setting of interest are so critical.

Similarly, some adverse effects are so subtle or take so long to emerge that these relatively small, relatively brief trials may not reveal all serious problems with an innovation.

Finally, it can’t possibly be cost-benefit justified to conduct an RCT of literally every change someone in power proposes to make. RCTs usually take time, money, personnel, and other resources. The world would come to a standstill if we had to await the results of well-designed RCTs before anyone makes a move. I will note, however, that one fairly unique characteristic of digital products and services is that A/B testing them tends to be quick and cheap, making it more incumbent on these innovators, all else equal, to determine the effects of their innovations in advance.

Be Sociable, Share!